Python Data Science

My notes, resources and examples using Python, NumPy, SciPy and Matplotlib as alternatives to R and Matlab for data science and analysis.

Load Data from Text File

import pylab
filename = "cool_data.dat"

# use skiprows if your data file has headers
data = pylab.loadtxt(filename, skiprows=1)

An example loading comma delimited data using Numpy:

import numpy as np
data = np.loadtxt(open('comma_delim.csv'), delimiter=",")

Plotting and Graphing

Log Scale

import math
import matplotlib.pyplot as pyplot

X = list(xrange(1,25))
Y= []
for i in X:
    Y.append(math.pow(10, i))
pyplot.xlim(0,25)
pyplot.ylim(max(Y))
pyplot.yscale('log')
pyplot.plot(X,Y)

Labels for Titles and Axes

import matplotlib.pyplot as pyplot
import pylab
X = pylab.np.random.normal(0,1,500)
Y = pylab.np.random.normal(0,1,500)
pyplot.scatter(X,Y)
pyplot.title("Scatter Plot Example")
pyplot.xlabel("X-Axis")
pyplot.ylabel("Y-Axis")

Saving a Graph

The following will create a png image 648×432 pixels. Note, you most likely will want to keep the dpi set to 72 since this has a direct effect on the font sizes in the rendered image

import pylab
... setup of data ...

# figure size in inches
pylab.rcParams['figure.figsize'] =  9, 6
pylab.plot(X,Y)
pylab.savefig("graph.png", dpi=72) # dots per inch

Installing NumPy, SciPy and Matplotlib on OS X

I had a little trouble with the initial setup of some of the key libraries used for machine learning, stats and data science.

Here’s what worked for me, to install on Mac OS X, the key was to not use the built-in python, download the binary from python.org. Secondly, make sure you set your environment variables to use this binary.

The symlinks to the python binaries were put in /usr/local/bin

The actual binaries are installed in /Library/Frameworks/Python.framework/Versions/3.3/bin

You need to install gfortran as a prerequisite, which I did using Homebrew
brew install gfortran

If you are still using Python 2.7, pip install works fine, make sure the pip you are using is for the binary you installed, and not the base system. Most likely should be /usr/local/bin/pip and not /usr/bin/pip

$ pip install numpy
$ pip install scipy
$ pip install matplotlib

If you are using Python 3.3, which I have switched to and have had no problems using once installed, it seems the pip libraries aren’t as up-to-date or require the latest code, so I checked out from source and built

$ git clone https://github.com/numpy/numpy.git
$ cd numpy
$ python setup.py install

$ git clone https://github.com/scipy/scipy.git
$ cd scipy
$ python setup.py install

$ git clone https://github.com/matplotlib/matplotlib.git
$ cd matplotlib
$ python setup.py install

Resources