Data Investigation

Tuesday, February 6, 2018 11:11 PM

Milestone #1: Data Investigation¶

My first step is to investigate my data options for this project. As discussed in my plan, I am considering Google Streetview data and LiDAR data. The Streetview data is my first choice but I realized that that data might be different from what I expect or have weird complications that make it difficult or impossible to do what I have in mind. I wanted to consider alternatives, and there's a lot that interests me about LiDAR data. Of course that data might be impossible to work with too. In any case, I needed to find out these things right away while it is still easy to change course on this project.

The summary of Google Streetview data is that it is easy to work with and close to what I expected. They provide a convenient API that is properly documented. Unfortunately, the depth data discussed in this blog post does not come from the API, and that information is compressed in a format I have not yet parsed. The author of that post does provide C++ code for doing so; I am optimistic that I will be able to translate that to Python and/or integrate their process into my code.

LiDAR data is also well documented but extremely complex. I've worked with complex data before and am confident I can manage this if I put in the time. My objection is that taking the project in that direction would take a good portion of the class. I would have less time to learn about the topics I want to be learning about.

Additionally, I feel the challenges I would face with the Google Streetview data is resonating with me in a way that the LiDAR data challenges are not.

My conclusion is that I will use the Google Streetview data for this project. Sometime after the semester is over I might spend more time with the LiDAR data and get some experience working with it. It would be a great choice for a future project.

Tuning Hyperparameters

Thursday, December 14, 2017 1:50 PM

Our last Learning Machines assignment is to calibrate the hyperparameters for a Multilayer Perceptron. Patrick gave us a working model using the MNIST database of handwritten digits. The model uses a Restricted Boltzmann Machine to reduce the dimensionality of the data and then a Multilayer Perceptron to classify the digits.

I was able to achieve an out-of-sample accuracy of almost 96%. This is in line with the results of other researchers.

Multi-Layer Perceptron Study

Monday, December 11, 2017 8:58 AM

Our next assignment is to use a Multi-Layer Perceptron to study a dataset.

The dataset I selected is the commonly studied Poker Hand data. Each record contains data for 5 playing cards and a poker hand classification, such as full house or straight.

This dataset proved to be difficult to work with. It is an example of an imbalanced dataset in that the more common poker hands like two-of-a-kind are heavily represented and the less common hands like straight and flush are not.

I found that the Perceptron was able to correctly classify some poker hands very well while performing terribly for others. I suspect a very different training methodology is required to properly train a Perceptron with this dataset.

Modified Pulse Sensing Algorithm

Monday, December 4, 2017 11:56 PM

Our Physical Computing final project depends on a Pulse Sensor to detect a user's heartbeat. The people at World Famous Electronics created an Arduino library for their customers to use with their sensor. The library adds a lot of value because it provides users with a well researched algorithm for using the sensor to properly detect a heartbeat. Pulse Sensor users don't have to re-invent the wheel and code their own algorithms. Writing your own algorithm to do this is difficult, and the one provided by the company is better than the one that I came up with for our midterm.

Still, the provided algorithm isn't perfect. For some people it seems to miss some heartbeats and add extra heartbeats. A fellow ITP student, Ellen, showed me that it would have odd spikes in the beats-per-minute (BPM) value. It wasn't clear why this was happening. Since I previously had been analyzing the sensor's data in Python, I came up with a plan to figure out why the Arduino code was doing this and to figure out if there was anything I could do about it. After studying the data and making some plots, I was able to make some improvements the algorithm. It still isn't perfect but my changes address many of the weaknesses of the algorithm.

The original Pulse Sensor Arduino code is available online on GitHub. I am sharing this code with my fellow students who are also using the same sensor. After our projects are complete I will submit my modified code to GitHub as a pull request to share with the rest of the community.

Perceptrons

Friday, November 24, 2017 4:30 PM

Basic Perceptron¶

This week's assignment is to code a Perceptron in Python and train it to learn the basic AND, OR, and XOR logic operations.

I created a Perceptron function with parameters that will let me study the operation of this algorithm.

Clustering and NumPy

Saturday, November 18, 2017 8:24 PM

K-means clustering¶

Our second assignment in our Learning Machines class is to implement k-means clustering in Python. I've implemented this in other programming languages but not in Python. Normally I'd use scikit-learn for this but it is a worthwhile exercise to think through how to do this in Python.

Run Length Encoding

Saturday, November 11, 2017 11:33 PM

Our first assignment in our Learning Machines class is to implement a run length encoder and decoder. This is a simple data compression algorithm that benefits from repeated patterns.

It happens that I previously had an idea for an Arduino project that requires a light-weight data decompression algorithm to decode audio data. I was going to use run length encoding because it is simple to implement and the code itself won't take up much of the Arduino's precious memory. I'll also need to encode the audio files in Python, and I'll use the below code to do it.

Heartbeat Detection Algorithm

Tuesday, October 24, 2017 4:41 PM

Purpose of detecting heartbeat data¶

Our Midi Meditation project is a physical computing device that will repeatedly play a single note in sync with the user's heartbeat. Fundamental to this is the ability to reliably detect when a user's heart is beating.

We want our device to work effectively for most or all people. This means it should play one note in sync with the user's pulse without extra notes between beats.

We had a pulse sensor suitable for an Arduino to use for this project. One approach for prototyping this is to code a heartbeat detection algorithm on an Arduino after viewing the sensor readings on the Serial monitor for a couple of people. This approach could work but would require a lot of parameter tweaking to get it "just right" with repeated user testing between parameter adjustments.

First Jupyter Notebook Post

Friday, July 21, 2017 12:00 PM

This is a blog post created in Jupyter notebook.¶

The goal is to see how well this feature works. I'd like to be able to post Python code to my blog. Happily, Nikola supports that seamlessly.

Normally Nikola preserves the width of each notebook cell. It makes sense that it does this but that doesn't work so well with this template because of the navigation bar on the left side of the screen. That's OK, I can override it by changing the notebook styling with this if I need to:

#notebook-container {
  width: 800px;
}

And here is some Python code:

In [1]:

def square(x):
    return x**2

for i in range(10):
    print(square(i))

And a plot:

In [2]:

%matplotlib inline
import matplotlib

import pandas as pd
import pandas.util.testing as pd_testing

In [3]:

df = pd_testing.makeTimeDataFrame(20)
df.index = pd.date_range(start=pd.Timestamp.now().floor('D'), periods=df.shape[0])

df.plot(figsize=(10, 5))

Out[3]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f57609a2fd0>

$four random colored lines ranging from -2 to 2 on the y axis and 20 days on the x axis$

Magnificent!

JupyterDay NYC

Sunday, October 25, 2015 1:45 PM

Yesterday I had the pleasure of attending the first JupyterDay Conference in NYC. This was a one day event discussing the open source project Jupyter, formerly known as IPython Notebook.

I had a wonderful time at the event. All of the speakers were engaging and I got a lot of great ideas for what I want to learn about to strengthen my technology and data science skills.

I took extensive notes and can't compile them all here. Instead, here are a few highlights from the event:

Jeremy Singer-Vine, BuzzFeed - Jeremy is a Data Editor at BuzzFeed, and does data investigative journalism. BuzzFeed does quantitative analysis for some of their news stories and will back up their news stories with research posted on github that readers can verify. For example, this news story and this notebook. I wish more journalists were this transparent.
Doug Blank, Bryn Mawr - Doug talked about how Jupyter is changing education at his college. Everything is a notebook there. Students submit notebooks for their homework assignments. They've built many extensions to Jupyter to support this. The most fascinating is they have kernels for many other languages like BASIC, Assembly, and Pascal. I am going to set these up on my computer very soon.
Sylvain Corlay, Bloomberg - Sylvain is a quant at Bloomberg. He showed us a demo of a new plotting library called bqplot they will share with the community. He employed ipython widgets to interact with the charts.

These were just a few of yesterday's speakers. The attendees were supportive and bright as well. I had many thought provoking conversations about data analysis and now have a list of tools I want to learn about as soon as I can.

All in all, a great day. Very glad I signed up for this.