Conversion between standard Python data types and numpy data types

Introduction to numpy and scipy packages

Basic statistical analysis with numpy tools

Reading and writing tabular data with pandas

Compare performance using vector math and for loops.

Introduction

As Python matured into the multifunctional language it is today, more and more features were added. We've seen most of the original functionality of the language, which was originally intended for scripting of UNIX and web processes, and writing user interfaces. But as Python became popular as a scripting language, it found its way into the arms of scientists, who generally speaking, care more about efficient data manipulation than web programming. The big piece that was missing in Python was a numerical library, which means a collection of tools for dealing with large amounts of numbers at once. After a few efforts, an integrated one was developed. Numerical Python, or NumPy provided a large collection of mathematical functions similar to those found in a language like MATLAB or R, and also similarly to those languages, provides datatypes that are manipulatable as vectors and matrices. We'll introduce these data types, then their associated functions.

NumPy Basics

Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data. Unforunately for those of us whose heads continue to spin in a crash-course of syntax, it also uses a different set of rules. I hope you'll understand why when you see the power and speed NumPy's data types afford us. Let's start off creating some empty arrays, which look sorta like lists, and are in fact vectors.

They differ in a few fundamental ways from lists:

Arrays cannot be of mixed types. They can be all integers, floats, strings, logical (or boolean) values, or other immutable values. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays (sortof). Which brings us to:

Arrays can be multidimensional, but they must be rectangular. You can have a list of lists, where the first interior list is 3 elements long, the second 5, and the third 12, but your multidimensional array must be expressible as "a m by n (by j by k by...) array". I have never encountered a situation where Python says there are too many dimensions (but I've never had need beyond, maybe, 4 dimensions).

We can perform vector operations on them, which can be algebraic functions (like a dot product), or simple replacements of values in slice of the array.

For the some of the lecture today, I'll be using the format

>>>code goes here

to indicate things that I'm doing in IPython, as opposed to in a code file. It's the same format that the regular python interactive interpreter uses, (rather than In[] and Out[]), but it lets us use cpaste, which strips out the >>> when interpreting code.

Arrays

Here's one way: start with a list and turn it into an array with the array method:

>>>import numpy as np
>>> a =[0] * 40>>> a = np.array(a)

Or this can be shortened:

>>> a = np.array([0] * 40)

You know have an array a of 1 row and 40 columns with zeros. But there's a better way to get a vector of zeros:

>>> a = np.zeros(40)

Note that the default type when declaring an array is float64:

>>>type(a[0])<type'numpy.float64'>

And here's the simplest way to change that:

>>> a = np.zeros(40,int)>>>type(a[0])<type'numpy.int32'>

And here's how to declare something that's not all zeros:

>>> a = np.arange(40)>>> a
array([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39])>>>type(a[0])<type'numpy.int32'>

Notice the int type.

What if we want a float? There's a couple ways to do it:

>>> a = np.arange(40, dtype=float)#Explicitly tell it to make a float>>>type(a[0])<type'numpy.float64'>>>> a = np.arange(40.0)#Implicitly give a float, and it will still work>>>type(a[0])<type'numpy.float64'>

Like with range(), you can also give arange() more parameters:

>>> np.arange(40,50)# Start and Stoparray([40,41,42,43,44,45,46,47,48,49])>>> np.arange(40,50,1)# Start, Stop, and incrementarray([40,41,42,43,44,45,46,47,48,49])>>> np.arange(40,50,.25)# If any of the parameters are floats, the output will be a floatarray([40. ,40.25,40.5,40.75,41. ,41.25,41.5,41.75,42. ,42.25,42.5,42.75,43. ,43.25,43.5,43.75,44. ,44.25,44.5,44.75,45. ,45.25,45.5,45.75,46. ,46.25,46.5,46.75,47. ,47.25,47.5,47.75,48. ,48.25,48.5,48.75,49. ,49.25,49.5,49.75])

As I said above, you can have arrays with more than one dimension

>>> a = np.zeros((10,10))# Note the inner set of parentheses. (Rows, Columns)>>> a
array([[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])

And you can even modify a particular element with the same syntax, or a subtly different syntax, as our list-of-lists:

>>> a[5][5]=3# row, then column>>> a[6,6]=42# still row, column>>> a
array([[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,3.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,42.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])#You can even add a number to a specific position using the '+=' notation.>>>a[6,6]+=10# Add 10 to the nth row, nth columnarray([[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,3.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,52.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])

So far, the coolest thing I've shown you isn't really that exciting: a range function that can have floats. The real power of arrays is the ability to have one statement affect a large chunk of an array:

>>> a[1,:]=1>>> a
array([[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[1.,1.,1.,1.,1.,1.,1.,1.,1.,1.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,3.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,42.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[0.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])>>> a[:,0]=7>>> a
array([[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,1.,1.,1.,1.,1.,1.,1.,1.,1.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,3.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,42.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.],[7.,0.,0.,0.,0.,0.,0.,0.,0.,0.]])

Let us pause for a moment and think about how we would do this with a for loop in lists:

LoL =[[0]*10for i inrange(10)]for i, elem inenumerate(LoL[1]):
LoL[1][i]=1for L in LoL:
L[0]=7

We can also take slices of arrays, just as if they were lists:

>>> a = np.arange(10)>>> a[2:5]array([2,3,4])>>> a[-1]9>>> a[::-1]array([9,8,7,6,5,4,3,2,1,0])

Maybe you can see the advantage of the array syntax. But wait, there's more! Act now, and we'll throw in math operations for free!

Vector Math with Arrays

We can do math on many values at once with arrays, no for loop required.

>>> a = np.arange(0,100,2)>>> b = np.arange(50)# Simple math across the whole array:>>> a
array([0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98])>>> a + 10array([10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46,48,50,52,54,56,58,60,62,64,66,68,70,72,74,76,78,80,82,84,86,88,90,92,94,96,98,100,102,104,106,108])>>> b
array([0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49])>>> b/2.0array([0. ,0.5,1. ,1.5,2. ,2.5,3. ,3.5,4. ,4.5,5. ,5.5,6. ,6.5,7. ,7.5,8. ,8.5,9. ,9.5,10. ,10.5,11. ,11.5,12. ,12.5,13. ,13.5,14. ,14.5,15. ,15.5,16. ,16.5,17. ,17.5,18. ,18.5,19. ,19.5,20. ,20.5,21. ,21.5,22. ,22.5,23. ,23.5,24. ,24.5])# pairwise multiplication, consider the for loop alternative>>> a * b
array([0,2,8,18,32,50,72,98,128,162,200,242,288,338,392,450,512,578,648,722,800,882,968,1058,1152,1250,1352,1458,1568,1682,1800,1922,2048,2178,2312,2450,2592,2738,2888,3042,3200,3362,3528,3698,3872,4050,4232,4418,4608,4802])>>>>>> a = np.arange(0,100,2)>>> b = np.arange(50)>>>>>>sum(a * b)# alternatively, the function np.dot(a,b)80850

In general, you'll want your arrays to be the same shape and size when doing math with them, although there are arcane rules for what to do when they aren't (it may or may not just give an error).

Boolean (logical) Values with Arrays

>>> a = np.zeros(10, dtype=bool)>>> a
array([False,False,False,False,False,False,False,False,False,False], dtype=bool)# Slicing and mass-assignment still works>>> a[2:5]=True>>> a
array([False,False,True,True,True,False,False,False,False,False], dtype=bool)# Comparison and assignment>>> b =(a ==False)>>> b
array([True,True,False,False,False,True,True,True,True,True], dtype=bool)>>> b = -a
>>> b
array([True,True,False,False,False,True,True,True,True,True], dtype=bool)>>> b =not a # It would be nice if this worked, but no such luck
Traceback (most recent call last):
File "<stdin>", line 1,in<module>ValueError: The truth value of an arraywith more than one
element is ambiguous. Use a.any()or a.all()

Basic Statistics with Numpy

NumPy is freaking huge, with around 1200 pages of reference documentation, but all of you will, at some point, use some basic statistics to get a feel for your data. So let's make sure we hit some of those functions:

Summary Statistics

>>> a = np.arange(99)>>> np.mean(a)49.0# Standard Deviation>>> np.std(a)28.577380332470412# Variance>>> np.var(a)816.66666666666674

To get a full catalog of all the possible numpy functions, you can type np., then press "tab" and it will give you a list of possible methods to fill it in with.

Random distributions

>>> a = np.random.uniform(0,100,10)# Low, High, Size of Output.# Another way to write this with a specified number of rows and columns:>>> a = np.random.uniform(0,100,(1,10))#3rd argument is (rows,columns)>>> a
array([59.29058916,6.92921008,91.93188435,45.38511999,34.46457059,41.29478046,33.51351871,42.5493656,50.54698303,19.79408663])>>> a = np.random.uniform(0,100,(3,3))>>> a
array([[12.06197457,90.90232851,51.27616458],[61.12517983,88.14112687,29.36606864],[77.35499074,53.7975086,72.63088545]])>>> a = np.random.uniform(0,100,1000000)>>> np.mean(a)50.022576176522485# exponential sample with mean = e>>> np.mean(np.random.exponential(np.exp(1),1000000))2.7188135956330033

Pylab

In the early exploratory phases of your data analysis, sometimes every second counts when you're trying to convert some idea in your head into code that you can try out. IPython has a mode called Pylab that has lots of things already imported into the main namespace, so you can start playing around with your data without having to import things, and without having to type out numpy. (or even np.) in front of each and every Numpy function you want to use. To get it, simply start up IPython with a flag:

$ ipython --pylab

This starts it up with all the numpy functions imported, as well as lots of Scipy (which I'll tell you about next) and matplotlib (more on this on Friday) set up so that you can do pretty heavy math without a lot of mental overhead remembering which package everything is in. The way I often like to go about things is to figure out what I want to do, try it in IPython with Pylab on, then use the history or save magic words in IPython to get what I did, put it into a file, trim out all the failed approaches, and turn key parts into functions. In that file, I still usually do an import numpy as np, and then go to all the numpy functions i used and change them from, for example, x = array(...) to x = np.array(...). This is because there are a few functions (like min and max) that numpy overrides the built-in versions, and it's good to be explicit about which one you want.

SciPy and Fitting

SciPy (pronounced "Sigh Pie") is a collection of libraries that builds on NumPy, and has lots of convenient, fast functions for working with large amounts of scientific data. It's slightly smaller than NumPy, with only 900-odd pages of documentation. That includes sections on integrating C or Fortran code into Python, which is way outside the scope of this course, but if you ever do get to the point where you need a super-efficient implementation of something, you're covered. Especially in the one-off nature of academic science, you're often better served spending less time writing code that takes longer to run, compared to spending lots and lots of time writing code that runs slightly faster.

The stats module of SciPy has functions for even more statistical distributions, statistical tests, and other assorted functions that a good statistician might need. As an example, let's see how we might use the linregress function, which does a linear regression on some data. Linear regression is the process of finding a line that minimizes the sum of the square of the vertical distances from each point to the line.

First, we'll set up some noisy data:

import numpy as np
slope =.5
intercept = -10
x = np.arange(0,100)
y = slope*x + intercept
noise =5 * np.random.randn(len(x))
y = y + noise

I'll show you how to generate plots like this in later in the week, but if you know any Matlab, it's eerily similar.

This isn't a math class, so we're going to start with the equations for slope, intercept, and correlation coefficient of the best-fit line as given, y = 0.464316064259x - 8.00654988105 (best fit line in green):

Now we just translate the math to Python code:

n =len(x)
m =(n * sum(x * y) - sum(x) * sum(y)) / (n * sum(x**2) - (sum(x))**2)
b =(sum(y) - m * sum(x))/n
r =(n * sum(x * y) - sum(x) * sum(y)) / np.sqrt((n*sum(x**2) - sum(x)**2)
* (n * sum(y**2) - sum(y)**2))print m, b, r

0.486677343735 -9.05040165994 0.928979337505

This gives us pretty much the right result, but it was kind of a pain to type in. If only the libraries had some sort of function that could do linear regression for us...

One other function that you might find useful is the corrcoef function, which gives you a correlation matrix between two data sets. For two 1-D sets like we have (x and y), this will give a 2x2 matrix.

Note that the 0,1 and 1,0 (correlation of x with y, and y with x, respectively) entries of this matrix are the same as the correlation coefficient from before. This two dimensional version of the output is kind of a pain for 2 sets of 1-D data, but it really does make sense with datasets with more variables.

There's a lot more to the stats module alone than we can cover in just one morning, so I'll just point you to the documentation for Scipy: http://docs.scipy.org/doc/scipy/reference/index.html You might also look into the cluster, interpolate, and optimize modules, depending on what exactly you want to do. The others could be helpful as well, but those are among the most likely for biologists.

Performance Enhancing Arrays

Remember how I told you that it's better to write something that's kind of good enough, rather than spending your time optimizing the code, as long as the "kind of good enough" version is a lot faster to write? Well, it turns out with arrays we get to have our cake, and eat it too! Not only is the code a lot faster to write, since you don't need to do explicit for loops, it also runs a lot faster. Let's investigate this by generating two lists (or arrays) with 10 million random numbers, then add them together:

importrandom
list1 =[random.randrange(0,100)for i inrange(int(1e7))]
list2 =[random.randrange(0,100)for i inrange(int(1e7))]
list3 =[a + b for a, b inzip(list1, list2)]

Note that this is the fast "pure Python" version, since we're using list comprehensions. If you had to append onto a list, it would be even slower!

Because numpy is able to know that everything is going to be an integer, it can do a lot of optimizations to the arrays that it wouldn't be able to do if each element could, conceivably be a different type. Furthermore, a lot of the time is spent doing checks on the input to the randrange function that, because we're using an array, can be done twice, rather than 20 million times.

Tabular data

One of the things that gets boring really fast is reading and writing tables of data, especially in a way that preserves the information we care about. Recently, a really nice package has caught on for doing exactly that, called pandas (short for PANel DATA). Pandas uses numpy internally, so it's really fast, but it also has lots of nice features for dealing with tables that may contain several types of data.

Let's look at an example (which I'll do in IPython, in line with the exploratory nature of poking at someone else's data):

import pandas as pd
calls = pd.read_table('operonCalls.tsv')
calls
# calls is a pandas data type called a DataFrame# which is modeled after an equivalent data type# in the language R
calls['SysName1']# One way to access a column
calls.SysName1# Another way to look at a column
calls.ix[0]# Using .ix is the only good way to look at a rowdef do_something(row):
pass# This function does nothingfor index in calls.index:
do_something(calls.ix[index])# orfor row_num, row in calls.iterrows():
do_something(row)
row.index
row['pOp']# This column is numbers, so that's what it's stored as
row['Name1']# This one is strings, so that easily pops out too!
calls.SysName1=='b0019'# This is the simplest way to do a search
calls.pOp.mean()# You can even do math on the columns#(or, if it's a homogenous table, the rows)
np.array(calls.pOp)# You can convert the pandas types into plain old numpy arrays

Exercises

1) Writing Mathematical Functions
a) Write a function that accepts an array of floats. Return an array where every value of the input array is divided by 1.5.
b) Use the random functions to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation from the mean of the array.
c) Write a function that evaluate the function y = e^x for values from 0 to 10 and stores the values in an array. The function should also generate a random exponential sample of the same length. Sort the random sample, then return the sum of the distances at each point from the random sample to the exponential model.

2) Strings to arrays
So we had this idea that we might be able to find a periodicity in the spacing of pyrimidine residues downstream of the termination site in Rho dependent genes (by and large, we don't). Nevertheless:
a) Make a function that finds all locations of a certain substring (like 'C', or 'CT', or whatever) and returns it as an array.
b) Find the difference between each pair of adjacent substrings
c) Use the BioPython module to read in a fasta file that has lots of 3' UTRs (we'll calculate these later) and uses the previous functions to generate a full list of the spacings between the pyrimidines.
d) Compute the histogram of these spacings (we'll show you how to plot them later).

import numpy as np
##### 1a ######def divide(array):
returnarray/1.5
a = np.arange(0,10)print divide(a)##### 1b ######def getOutliers(array):
mean=np.mean(array)
std=np.std(array)
highCutoff=mean+std
lowCutoff=mean-std
greater =array> highCutoff
less =array< lowCutoff
greater_or_less = greater + less
returnarray[greater_or_less]
a = np.random.uniform(0,100,1000)print getOutliers(a)#### OR #####def getOutliers(array):
mean=np.mean(array)
std=np.std(array)
highCutoff=mean+std
lowCutoff=mean-std
outliers=[]for i inarray:
if i > highCutoff or i < lowCutoff:
outliers.append(i)return np.array(outliers)##### 1c ######****def yex():
a = np.arange(0,11)
yEx = np.exp(a)# Generate a random exponential sample of length 11# Try np.random."tab" to get a list of possible random functions
z= np.random.uniform(0,10,len(a))
randomExp = exp(z)# OR# z = np.random.exponential(1,11)
sortedarray = np.sort(randomExp)# Calculate distances at each point from random sample to exponential model
distance =sum(yEx - sortedarray)return distance

2) Strings to arrays

###### 2a #########def findsubstring(string,Substring):
fullLength=len(string)
length =len(Substring)
posList=[]for pos,letter inenumerate(string):
ifstring[pos:pos+length]==Substring:
posList.append(pos)return np.array(posList)
x = findsubstring('ACTAGGGCTAATAGATTACGGACTATG','CT')print x
###### 2b #########
difference = np.diff(x)###### 2c ########## To download a fasta file of all 3' UTRs, try flybase: ftp://flybase.net/genomes/Drosophila_melanogaster/current/fasta/from Bio import SeqIO
importsys
all_diffs =[]for rec in SeqIO.parse('dmel-2L-three_prime_UTR-r5.57.fasta','fasta'):
# rec.seq will return the sequence, but this must first be converted to a string type!
seq =str(rec.seq)
hits = findsubstring(seq,'C')
diffs = np.diff(hits)
all_diffs.extend(diffs)###### 2c #########print np.histogram(all_diffs,range(0,30))

3) Updating old code

from collections import defaultdict
import pandas as pd
def get_operon_pairs(operon_filename):
# operons = open(operon_filename)
# operons.readline()
# # This skips the header line
operons = pd.read_table(operon_filename)
calls = defaultdict(bool)
# # If no call, it is not an operon
# calls[('rrmJ', 'ftsH')] = True
# calls[('ftsH', 'rrmJ')] = True
for row_num, row in operons:
#line = line.strip()
#elements = line.split('\t')
#gene1 = elements[4]
#gene2 = elements[5]
#is_operon = (elements[6] == "TRUE")
gene1 = row['SysName1']
gene2 = row['SysName2']
is_operon = row['bOp']
calls[(gene1, gene2)] = is_operon
calls[(gene2, gene1)] = is_operon
return calls
operons = get_operon_pairs("operonCalls.txt")
# gene_info = open('geneInfo.txt')
# gene_info.readline()
# # Skip the header line
gene_info = pd.read_table('geneInfo.txt')
current_name = ''
current_start = -1
current_stop = -1
current_strand = '.'
current_operon_names = []
GFF_base = '\t'.join(['Chromosome', 'MicrobesOnline', 'operon',
'%d', # low end position
'%d', # high end position
'.', '%s', # strand
'.', 'operonName "%s";\n'])
outfh = open('operons2.gff', 'w')
genes_thus_far = set()
#for line in gene_info:
for row_num, row in gene_info.iterrows():
#next_line = line.strip()
if 'CRISPR' in row['name'] or 'insE' in row['name'] or 'insC' in row['name']:
continue
# next_elements = line.split('\t')
# next_start = int(next_elements[4])
# next_stop = int(next_elements[5])
# next_strand = next_elements[6]
# next_name = next_elements[8]
next_start = row['start']
next_stop = row['stop']
next_strand = row['strand']
next_name = row['sysName']
next_hname = row['name']
if not operons[(current_name, next_name)]:
if current_name:
# current_name starts out empty, so on the first valid line, this
# won't write out to the file
GFF_line = GFF_base % (min(current_start, current_stop),
max(current_start, current_stop),
current_strand,
'-'.join(current_operon_names))
outfh.write(GFF_line)
# current_operon_names = [next_name]
current_operon_names = [next_hname]
current_start = next_start
current_stop = next_stop
current_strand = next_strand
else: # Genes are in the same operon
if current_strand != next_strand:
print line
# This is just a consistency check.
if current_strand == "+":
current_start = min(current_start, next_start)
current_stop = max(current_stop, next_stop)
# current_operon_names.append(next_name)
current_operon_names.append(next_hname)
elif current_strand == "-":
current_start = max(current_start, next_start)
current_stop = min(current_stop, next_stop)
# current_operon_names.insert(0, next_name)
current_operon_names.insert(0, next_hname)
current_strand = next_strand
current_name = next_name
# Clear out the last one, which has no "next"
GFF_line = GFF_base % (min(current_start, current_stop),
max(current_start, current_stop),
current_strand,
'-'.join(current_operon_names))
outfh.write(GFF_line)
outfh.close()

## Numerical Python (NumPy)

## Topics

arraydata typesnumpyandscipypackagespandasforloops.## Introduction

As Python matured into the multifunctional language it is today, more and more features were added. We've seen most of the original functionality of the language, which was originally intended for scripting of UNIX and web processes, and writing user interfaces. But as Python became popular as a scripting language, it found its way into the arms of scientists, who generally speaking, care more about efficient data manipulation than web programming. The big piece that was missing in Python was a numerical library, which means a collection of tools for dealing with large amounts of numbers at once. After a few efforts, an integrated one was developed. Numerical Python, or NumPy provided a large collection of mathematical functions similar to those found in a language like MATLAB or R, and also similarly to those languages, provides datatypes that are manipulatable as vectors and matrices. We'll introduce these data types, then their associated functions.## NumPy Basics

Numerical Python is a powerful library of functions, methods, and data types we can used to analyze our data. Unforunately for those of us whose heads continue to spin in a crash-course of syntax, it also uses a different set of rules. I hope you'll understand why when you see the power and speed NumPy's data types afford us. Let's start off creating some empty arrays, which look sorta like lists, and are in fact vectors.

They differ in a few fundamental ways from lists:

integers, floats, strings, logical(orboolean) values, or otherimmutablevalues. But they cannot be some characters, some numbers, or any other olio of data types. They also cannot contain mutable types such as lists. So, we can have a list of lists, but not an array of lists. We can, however, have an array of arrays (sortof). Which brings us to:For the some of the lecture today, I'll be using the format

to indicate things that I'm doing in IPython, as opposed to in a code file. It's the same format that the regular python interactive interpreter uses, (rather than In[] and Out[]), but it lets us use

cpaste, which strips out the >>> when interpreting code.## Arrays

Here's one way: start with a list and turn it into an array with thearraymethod:Or this can be shortened:

You know have an array

aof 1 row and 40 columns with zeros. But there's a better way to get a vector of zeros:Note that the default type when declaring an

arrayisfloat64:And here's the simplest way to change that:

And here's how to declare something that's not all zeros:

Notice the

inttype.What if we want a float? There's a couple ways to do it:

Like with

range(), you can also givearange()more parameters:As I said above, you can have arrays with more than one dimension

And you can even modify a particular element with the same syntax, or a subtly different syntax, as our list-of-lists:

So far, the coolest thing I've shown you isn't really that exciting: a range function that can have floats. The real power of arrays is the ability to have one statement affect a large chunk of an array:

Let us pause for a moment and think about how we would do this with a for loop in lists:

We can also take slices of arrays, just as if they were lists:

Maybe you can see the advantage of the array syntax. But wait, there's more! Act now, and we'll throw in math operations for free!

## Vector Math with Arrays

We can do math on many values at once with arrays, no for loop required.In general, you'll want your arrays to be the same shape and size when doing math with them, although there are arcane rules for what to do when they aren't (it may or may not just give an error).

## Boolean (logical) Values with Arrays

## Basic Statistics with Numpy

NumPy is freaking huge, with around 1200 pages of reference documentation, but all of you will, at some point, use some basic statistics to get a feel for your data. So let's make sure we hit some of those functions:

Summary StatisticsTo get a full catalog of all the possible numpy functions, you can type np., then press "tab" and it will give you a list of possible methods to fill it in with.

Random distributions## Pylab

In the early exploratory phases of your data analysis, sometimes every second counts when you're trying to convert some idea in your head into code that you can try out. IPython has a mode called Pylab that has lots of things already imported into the main namespace, so you can start playing around with your data without having to import things, and without having to type out numpy. (or even np.) in front of each and every Numpy function you want to use. To get it, simply start up IPython with a flag:

$

ipython --pylabThis starts it up with all the numpy functions imported, as well as lots of Scipy (which I'll tell you about next) and matplotlib (more on this on Friday) set up so that you can do pretty heavy math without a lot of mental overhead remembering which package everything is in. The way I often like to go about things is to figure out what I want to do, try it in IPython with Pylab on, then use the

historyorsavemagic words in IPython to get what I did, put it into a file, trim out all the failed approaches, and turn key parts into functions. In that file, I still usually do animport numpy as np, and then go to all the numpy functions i used and change them from, for example,x = array(...)tox = np.array(...). This is because there are a few functions (like min and max) that numpy overrides the built-in versions, and it's good to be explicit about which one you want.## SciPy and Fitting

SciPy (pronounced "Sigh Pie") is a collection of libraries that builds on NumPy, and has lots of convenient, fast functions for working with large amounts of scientific data. It's slightly smaller than NumPy, with only 900-odd pages of documentation. That includes sections on integrating C or Fortran code into Python, which is way outside the scope of this course, but if you ever do get to the point where you need a super-efficient implementation of something, you're covered. Especially in the one-off nature of academic science, you're often better served spending less time writing code that takes longer to run, compared to spending lots and lots of time writing code that runs slightly faster.

The stats module of SciPy has functions for even more statistical distributions, statistical tests, and other assorted functions that a good statistician might need. As an example, let's see how we might use the linregress function, which does a linear regression on some data. Linear regression is the process of finding a line that minimizes the sum of the square of the vertical distances from each point to the line.

First, we'll set up some noisy data:

I'll show you how to generate plots like this in later in the week, but if you know any Matlab, it's eerily similar.

This isn't a math class, so we're going to start with the equations for slope, intercept, and correlation coefficient of the best-fit line as given, y = 0.464316064259x - 8.00654988105 (best fit line in green):

Now we just translate the math to Python code:

0.486677343735 -9.05040165994 0.928979337505This gives us pretty much the right result, but it was kind of a pain to type in. If only the libraries had some sort of function that could do linear regression for us...

Regression Slope: 0.486677343735Regression Intercept: -9.05040165994Regression correlation: 0.928979337505R^2:, 0.86300260951p(slope is 0): 4.31945319634e-44One other function that you might find useful is the

corrcoeffunction, which gives you a correlation matrix between two data sets. For two 1-D sets like we have (x and y), this will give a 2x2 matrix.Note that the 0,1 and 1,0 (correlation of x with y, and y with x, respectively) entries of this matrix are the same as the correlation coefficient from before. This two dimensional version of the output is kind of a pain for 2 sets of 1-D data, but it really does make sense with datasets with more variables.

There's a lot more to the stats module alone than we can cover in just one morning, so I'll just point you to the documentation for Scipy: http://docs.scipy.org/doc/scipy/reference/index.html You might also look into the

cluster, interpolate, andoptimizemodules, depending on what exactly you want to do. The others could be helpful as well, but those are among the most likely for biologists.## Performance Enhancing Arrays

Remember how I told you that it's better to write something that's kind of good enough, rather than spending your time optimizing the code, as long as the "kind of good enough" version is a lot faster to write? Well, it turns out with arrays we get to have our cake, and eat it too! Not only is the code a lot faster to write, since you don't need to do explicit

forloops, it alsorunsa lot faster. Let's investigate this by generating two lists (or arrays) with 10 million random numbers, then add them together:Note that this is the

fast"pure Python" version, since we're using list comprehensions. If you had to append onto a list, it would beeven slower!Because numpy is able to know that everything is going to be an integer, it can do a lot of optimizations to the arrays that it wouldn't be able to do if each element could, conceivably be a different type. Furthermore, a lot of the time is spent doing checks on the input to the randrange function that, because we're using an array, can be done twice, rather than 20 million times.

## Tabular data

One of the things that gets boring really fast is reading and writing tables of data, especially in a way that preserves the information we care about. Recently, a really nice package has caught on for doing exactly that, called

pandas(short for PANel DATA). Pandas uses numpy internally, so it's really fast, but it also has lots of nice features for dealing with tables that may contain several types of data.Let's look at an example (which I'll do in IPython, in line with the exploratory nature of poking at someone else's data):

## Exercises

1) Writing Mathematical Functionsa) Write a function that accepts an array of floats. Return an array where every value of the input array is divided by 1.5.

b) Use the random functions to generate an array of floats. Write a function that accepts this array, and returns a list of values that are more than one standard deviation from the mean of the array.

c) Write a function that evaluate the function y = e^x for values from 0 to 10 and stores the values in an array. The function should also generate a random exponential sample of the same length. Sort the random sample, then return the sum of the distances at each point from the random sample to the exponential model.

2) Strings to arraysSo we had this idea that we might be able to find a periodicity in the spacing of pyrimidine residues downstream of the termination site in Rho dependent genes (by and large, we don't). Nevertheless:

a) Make a function that finds all locations of a certain substring (like 'C', or 'CT', or whatever) and returns it as an array.

b) Find the

difference between each pair of adjacent substringsc) Use the BioPython module to read in a fasta file that has lots of 3' UTRs (we'll calculate these later) and uses the previous functions to generate a full list of the spacings between the pyrimidines.

d) Compute the histogram of these spacings (we'll show you how to plot them later).

3) Updating old codeLet's go back and look at the code for computing the operons into a GFF file. How would you modify this code to use Pandas to read in the tab-delimited files?

## Solutions

1) Writing Mathematical Functions2) Strings to arrays3) Updating old code