How to make fancy figures


Okay, there's a lot more to this than we can realistically cover in one lecture, but there are lots of things you can do to make really cool figures in Python. We're going to be using matplotlib, which is a plotting library that took a lot of the plotting functionality from the popular MATLAB software, re-wrote it in Python, and (in my opinion) made it about 10 times saner and easier to use.

Nevertheless, we'll try to cover the basics, and give you a good enough understanding of how everything is set up so you can look at the extensive gallery of examples and figure out how to make similar plots with your own data.

Basic plotting

There are two main ways we could go through this plotting lecture. One is by good ole' fashioned scripting in aquamacs, and the other is through the "pylab" mode of IPython. IPython in pylab mode is meant to be set up pretty close to the MATLAB environment, in terms of plotting and doing other math heavy stuff. To use the pylab mode of ipython, type:

$ ipython --pylab

Pylab will automatically load matplotlib and will automatically show the graphs that you plot. But if you are scripting this in aquamacs, you actually want to import the MATLAB plotting library. For today, we'll work primarily out of aquamacs, and use ipython as a quick reference. This is just my personal preference; if you prefer to work out of one or the other, go for it!



Informative Interlude: Differences between IPython with and without --pylab


If you start IPython without the --pylab flag, and you discover that you want to plot things, you have a couple options. By far the best of these is to use the %pylab magic word, which will load all the pylab related things that you need. Next best is to use from pylab import *, which will load all the plotting functions, but they won't work quite properly for interactive plotting. Instead, every time you want to see what you've plotted, you'd need to use the show() function, which will block anything else you do until you close the window. If you want to have a program that does its plotting, then saves the figures out, then you can do something like the above, where you have

import numpy as np
import matplotlib.pyplot as plt
 
# Plotting code here
# ...
# ...
 

In a perfect world, someone ought to be able to run a script and have almost all their figures for a paper just pop out, with only relatively minor tweaking. Then, if that code were available too, it should be possible for a reasonably savvy reviewer to see that all the data is on the up-and-up.



Now let's start plotting! Let's say we have some data that's approximately a line, but there's some noise in it. Let's plot it:

#!/usr/bin/env python
 
# 9.2-1.py
 
import numpy as np
import matplotlib.pyplot as plt
 
x = np.arange(0,100)
y = 0.5 * x + 5 + 10*np.random.uniform(0,3,len(x))
plt.plot(x,y)
plt.show()
 

That was pretty easy! Now, let's say we don't want to have lines connecting each data point, but instead just a marker. We can look at the documentation for plot like we would anything else in IPython:

In [6]: plot?
 
Type: function
Base Class: <type 'function'>
String Form:<function plot at 0x6312770>
Namespace: Interactive
File: /Library/Frameworks/.../site-packages/matplotlib/pyplot.py
Definition: plot(*args, **kwargs)
Docstring:
Plot lines and/or markers to the
:class:`~matplotlib.axes.Axes`. *args* is a variable length
argument, allowing for multiple *x*, *y* pairs with an
optional format string. For example, each of the following is
legal::
 plot(x, y) # plot x and y using default line style and color
 plot(x, y, 'bo') # plot x and y using blue circle markers
 plot(y) # plot y using x as index array 0..N-1
 plot(y, 'r+') # ditto, but with red plusses
 
If *x* and/or *y* is 2-dimensional, then the corresponding columns
will be plotted.
 
An arbitrary number of *x*, *y*, *fmt* groups can be
specified, as in::
 a.plot(x1, y1, 'g^', x2, y2, 'g-')
Return value is a list of lines that were added.
The following format string characters are accepted to control
the line style or marker:
================ ===============================
character description
================ ===============================
``'-'`` solid line style
``'--'`` dashed line style
``'-.'`` dash-dot line style
``':'`` dotted line style
``'.'`` point marker
``','`` pixel marker
``'o'`` circle marker
``'v'`` triangle_down marker
``'^'`` triangle_up marker
...

You can see that there's a lot of different things you can do for something as simple as plotting... Markers, colors, lines. If you keep reading, you can even incorporate labels for the lines. Let's try this code, now, and see what it looks like:

x = np.arange(0,100)
y = 0.5 * x + 5 + 10*np.random.uniform(0,3,len(x))
plt.plot(x,y, 'bo')
plt.show()
 
from scipy import stats
 
r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y)
plt.plot(x, x * r_slope + r_int, 'g-.', label='Dash-dotted Line')
plt.show()


Hey, what about the labels? To incorporate these, use the legend() function.

## To add label to the graph, call the legend function and add the kwarg "label" to the plot function.
plt.plot(x, x * r_slope + r_int, 'g-.',label='Dash-dotted Line')
plt.legend()
plt.show()
 

Or, if we decide that we don't like the labels that we gave it before, we can put a list of labels into legend(). This case, we're plotting two lines, and each line will take the label corresponding to the string in the position in the list by the order in which it's plotted:

x = np.arange(0,100)
y = 0.5 * x + 5 + 10*np.random.uniform(0,3,len(x))
r_slope, r_int, r_rval, r_pval, r_stderr = stats.linregress(x, y)
 
plt.plot(x, y, x * r_slope + r_int, 'g-.',label='Dash-dotted Line')
plt.legend(["Noisy data", "Linear regression"])
plt.show()
 

Here are some other tweaks in the call to the legend() function:

plt.legend(["Noisy data", "Linear regression"], loc='lower right', numpoints=1,fancybox=True, shadow=True)
 
If you want to look at the full list of options you can set with the legend() function, try plt.legend? in ipython.

Ok, so let's say you've spent all this time and you're reasonably satisfied with the figure you've created. To save the figure into a file, use the savefig function:

plt.savefig('filename.png',format='png')
#OR
plt.savefig('filename.pdf',format='pdf')
 

Pretty cool! You can load your data, graph it in the way that you want, and then save that figure, ready to go, or import into Illustrator or any other image editor of your choice for further editing.

Making our own plotting functions


You know how in papers, they will sometimes have a kind of fancy figure, and then they'll have things in the same style, but for a bunch of different ways of slicing and dicing their data? It's really pretty effective scientific story-telling. It allows them to connect all those figures together conceptually, and readers only have to look for the relevant differences.

The thing is, if you're going to actually make those figures, it can be annoying to tweak the plots in the same way every time. Fortunately, we've spent almost two weeks now taking boring things that a person could do and making the computer automate them.

Creating a scatter plot


Yesterday, we had you run cuffquant, cuffnorm, and cuffdiff on your RNA-seq samples. Cuffnorm returned an output file that contained a list of each transcript and its FPKM value across the different treatments. Let's say you are interested in what the correlation is between FPKM values between samples (to compare replicates, let's say, or even different treatment conditions). Let's create a function that will take the cuffnorm output and will make a scatter plot of the FPKM values in one sample vs another.

#!/usr/bin/evn python
#9.2-2.py
 
import numpy as np
import matplotlib.pyplot as plt
import sys
 
 
# Function to parse and plot the data
def scatterFPKMs(table,sample1,sample2):
    # Parse table
    tableFile = open(table,'r')
    header=tableFile.readline().strip()
    colNames=header.split()
    if sample1 in colNames:
        idx1=colNames.index(sample1)
    else:
        return 'Could not find',sample1,'!. Please try again!'
    if sample2 in colNames:
        idx2=colNames.index(sample2)
    else:
        return 'Could not find',sample2,'!. Please try again!'
    FPKM_sample1 = []
    FPKM_sample2 = []
    for line in tableFile:
        line_parts=line.split()
        FPKM1 = float(line_parts[idx1])
        FPKM_sample1.append(FPKM1)
        FPKM2 = float(line_parts[idx2])
        FPKM_sample2.append(FPKM2)
        # Add values to the array
    plt.scatter(FPKM_sample1,FPKM_sample2)
    plt.show()
 
 
scatterFPKMs(sys.argv[1],'control_20min_0','20bicyclomycin_20min_0')
 

Looks good! Except all the interesting business seems to be going on in the lower left hand corner. Next, we'll talk about some tweaks we can make to be able to view things in a clearer way. But first...


Informative Interlude: How Matplotlib is laid out


As long as we're grabbing information from the axes, it's worth spending a few moments talking about how Matplotlib is organized. We're going to use the code from the Log plots example to make a pretty looking set of pixels:

FigureAxes.png


The window that this is being plotted in corresponds to a Figure. This is everything inside the window (but not the tools on the bottom), and when you want to save an image to the disk (so you can include it in your manuscript), this is what actually gets saved. Figures control things like the size of the image if you print it out, and the resolution (for on screen, something like 72 dpi is fine, but if you're printing, you want it to be more like 300).

A Figure can contain zero or more sets of Axes, which are the subplots. In this case, we have four. A set of Axes is usually what you'll want to be trying to modify. Axes have properties like x and y limits, a set of major (and sometimes minor) ticks for the x and y axes, an optional legend, and lots more data. Furthermore, each axis has its own plotting commands, which are very similar to the top-level commands, but lets you be certain that they're going within the same Axes. This is particularly important if you have several subplots going on, each of which could conceivably receive the plotting data.

This is one of those places where we can't spend all our time talking about every single feature available, but the inline documentation is pretty good, and the gallery of examples is also really helpful if you kinda know what you want to do.




#!/usr/bin/evn python
 
# 9.2-3.py
 
import numpy as np
import matplotlib.pyplot as plt
import sys
 
# Let's change the function so that the axes are in log-scale
 
# Function to parse and plot the data
def scatterFPKMs(table,sample1,sample2):
 
    # Parse table
    tableFile = open(table,'r')
    header=tableFile.readline().strip()
    colNames=header.split()
    if sample1 in colNames:
        idx1=colNames.index(sample1)
    else:
        return 'Could not find',sample1,'!. Please try again!'
    if sample2 in colNames:
        idx2=colNames.index(sample2)
    else:
        return 'Could not find',sample2,'!. Please try again!'
    FPKM_sample1 = []
    FPKM_sample2 = []
    for line in tableFile:
        line_parts=line.split()
        FPKM1 = float(line_parts[idx1])
        FPKM_sample1.append(FPKM1)
        FPKM2 = float(line_parts[idx2])
        FPKM_sample2.append(FPKM2)
    plt.scatter(FPKM_sample1,FPKM_sample2)
     # To be able to set more parameters, we can get use the "get current axes" or gca() function
     # plt.gca? to get more info
     # or frame."tab" to see all possiblities
     # For example, try frame.axes.errorbar?
    frame = plt.gca()
    frame.axes.set_yscale('log')
    frame.axes.set_xscale('log')
    # Set limits on x-axis. Because the axis is log-scale, the min must be > 0
    plt.xlim(1,1000000)
    # Set limits on y-axis
    plt.ylim(1,1000000)
    plt.show()
 
scatterFPKMs(sys.argv[1],'control_20min_0','20bicyclomycin_20min_0')
 

Plotting a histogram of all RPKMs/FPKMs


Another thing that would be good to know is the overall distribution of expression values across all genes in the genome. Let's create a function that will take a sample name from this cuffnorm output table and plot a histogram of the FPKM values in that sample:

#!/usr/bin/env python
 
# 9.2-4.py
 
import numpy as np
import matplotlib.pyplot as plt
import sys
 
# Let's modify the function to grab a set of just one FPKM values.
# You could also do this using pandas
 
# Function to parse and plot the data
def scatterFPKMs(table,sample1):
 
    # Parse table
    tableFile = open(table,'r')
    header=tableFile.readline().strip()
    colNames=header.split()
    if sample1 in colNames:
        idx=colNames.index(sample1)
    else:
        return 'Could not find',sample1,'!. Please try again!'
    FPKMs = []
    for line in tableFile:
        line_parts=line.split()
        FPKM = float(line_parts[idx])
        FPKMs.append(FPKM)
    plt.hist(FPKMs,bins=30,range=(0,1000))
    plt.show()
 
scatterFPKMs(sys.argv[1],'control_20min_0')
 
 


Another approach would be to calculate the histogram values, then plot() them as a sort of density function. You can use the return values of either histogram (which only calculates the histogram) or hist (which calculates and draws the histogram) to get out what you need.

Finally, you can even set axis ticks and you can draw partially transparent graphs. To do the first, you can again use the gca() function to set a frame variable, then use the methods below to set x-ticks. To make a graph transparent (to enable overlapping graphs, for example), you can pass in the keyword argument "alpha" to hist(), then the bars are drawn partially transparent.

#!/usr/bin/env python
 
# 9.2-5.py
 
import numpy as np
import matplotlib.pyplot as plt
import sys
 
# Function to parse and plot the data
def scatterFPKMs(table,sample1):
 
    # Parse table
    tableFile = open(table,'r')
    header=tableFile.readline().strip()
    colNames=header.split()
    if sample1 in colNames:
        idx=colNames.index(sample1)
    else:
        return 'Could not find',sample1,'!. Please try again!'
    FPKMs = []
    for line in tableFile:
        line_parts=line.split()
        FPKM = float(line_parts[idx])
        FPKMs.append(FPKM)
    # We can set the transparency level of the graph with the 'alpha' argument. 0 = fully transparent, 1 = opaque.
    # bins can also be a sequence of numbers
    plt.hist(FPKMs,bins=range(0,1050,50),range=(0,1000),alpha=0.5)
 
    # Again, we can use gca() to get the axis instance and set parameter
    frame = plt.gca()
    frame.axes.get_xaxis().set_ticks(range(0,1050,50))
    plt.show()
 
scatterFPKMs(sys.argv[1],'control_20min_0')
 
 

Alpha can be any number between 0 and 1, where 0 is fully transparent (in which case why are you even drawing it?), and 1 is fully opaque (this is the default). Most of the matplotlib functions know how to deal with this alpha property, so that can sometimes be useful when your plots start getting visually busy.

One final note: the matplotlib documentation can be immensely helpful in aiding you to make a plot that fits your science needs. Not only is there a gallery of example plots, but there are also demo scripts that give you the code for how to make these plots. These demo scripts can easily be adapted for your own plotting purposes. No need to try and remember all of these functions and methods! Plus, unless you make the same kind of plot many many times, it would be virtually impossible to remember all of the different methods and functions that this plotting library contains.

Happy plotting!

Exercises



Exercise 1

1a) We've learned a lot about how to make graphs and how to manipulate certain parameters to our liking. But what if we want multiple panels in our graph? Let's say we want to compare the scatter plots of more than one pair-wise comparison in the same figure. Are we doomed to making individual graphs and then combining them in illustrator?! Of course not! That's what the subplot() function is for. Use "plt.subplot?" to read about the subplot function and how it works. Then, make a single figure that contains separate scatter plots, in different colors, for the FPKMs of the following samples:
1. 0mg/ml Drug, 20 min x 0mg/ml Drug, 60 min
2. 0mg/ml Drug, 20 min x 20mg/ml Drug 20 min
3. 0mg/ml Drug x 100mg/ml Drug 20 min
4. 20mg/ml Drug 20 min x 100mg/ml Drug 20 min

Save this plot as an image file in the format of your choice. To see all the different possible formats, try plt.savefig? in python.

1b) Add labels to each subplot in part (a) and a "Figure title" to your plot.

Exercise 2

(a) Tomorrow, we'll learn how to use python for clustering analysis. Then, we'll use the clustered data to make a heatmap of all the differentially expressed genes in our RNA-seq experiment. To prepare for that, however, we need to do a little more parsing. Using the cuffdiff output file, create a tab-delimited text file where each row is the gene, and each column is the fold-change (NOT the log2fold change) in FPKM values for the following two comparisons. Exclude genes in which a test was not performed ("NOTEST").

(i) 0mg/ml Bicyclomycin 20 min vs 20mg/ml Bicyclomycin 20min
(ii) 0mg/ml Bicyclomycin 20min vs 100 Bicyclomycin mg/ml 20 min

(b) Write another script that takes the above output table as input and creates an array of (n-genes) x 2 of the different FPKM values. Make sure you include a way to reference the identities of the gene positions in the array.



Solutions


Exercise 1

#!/usr/bin/env python
 
import numpy as np
import matplotlib.pyplot as plt
import sys
 
 
# This time, let's split up the parsing function and the graphing function for simplicity.
 
# Function to parse the data. This will return a dictionary, where each sample is the value and keys are lists of the FPKMs for that sample.
def parseCounts(table):
    FPKM_dict={}
    # Parse table
    tableFile = open(table,'r')
    header=tableFile.readline().strip()
    sampleNames=header.split()[1:]
    condition_dict={}
    for i,name in enumerate(sampleNames):
        condition_dict[i]=name
        FPKM_dict[name]=[]
    for line in tableFile:
        line_parts=map(float,line.split()[1:])
        for ind,fpkm in enumerate(line_parts):
            condition=condition_dict[ind]
            FPKM_dict[condition].append(fpkm)
    return FPKM_dict
 
x = parseCounts(sys.argv[1])
 
conditionsToGraph=[('control_20min_0','control_60min_0'),('control_20min_0','20bicyclomycin_20min_0'),('control_20min_0','100bicyclomycin_20min_0'),('20bicyclomycin_20min_0','100bicyclomycin_20min_0')]
 
colorList=['red','orange','green','purple']
 
# Create a "figure" object to set the title of the whole figure.
 
fig = plt.figure("FPKM Scatter Plots")
 
for ind,pair in enumerate(conditionsToGraph):
    cond1_fpkm=x[pair[0]]
    cond2_fpkm=x[pair[1]]
    # Use the .add_subplot method on the fig object to add a subplot.
    # This returns an axes instance just like the above plt.subplot(n,n,n)
    ax = fig.add_subplot(2,2,ind+1)
    ax.scatter(cond1_fpkm,cond2_fpkm,color=colorList[ind],edgecolor='black')
    ax.set_ylim(1,1000000)
    ax.set_xlim(1,1000000)
     # Use ax.set_title to add a title to the current axes instance
    ax.set_title(pair[0]+' vs '+pair[1])
    ax.set_yscale('log')
    ax.set_xscale('log')
 
plt.savefig('Ex1.pdf',format='pdf')
 

Exercise 2

#!/usr/bin/env python
 
import numpy as np
import pandas as pd
 
# 9.2_sol2.py
 
#2 a
 
# Write a function to parse the cuffdiff output table for the conditions we want
# This function takes the following inputs: cuffdiff output, a list of tuples containing the pair of comparisons it wants us to return
# Returns a table with rows = gene names, columns = fold-change values for each comparison.
 
def getFoldChanges(diffTable,tupleList):
    # Open diffTable for parsing
    diffFile=open(diffTable,'r')
    header=diffFile.readline().split()
    geneID_dict={}
    for line in diffFile:
        parts=line.split()
        gene=parts[2]
        comparison=parts[4],parts[5]
        revComparison=parts[5],parts[4]
        status = parts[6]
        if (comparison in tupleList) or (revComparison in tupleList):
            try:
                positionInList=tupleList.index(comparison)
            except:
                positionInList=tupleList.index(revComparison)
            if status=='OK':
                # Convert log2fold change to fold-change
                log2fold=parts[9]
                if log2fold=='-inf':
                    log2fold=-10
                elif log2fold=='inf':
                    log2fold=10
                else:
                    log2fold=float(parts[9])
                if log2fold > 0:
                    foldChange=str(round(2**log2fold,2))
                else:
                    foldChange=str(round(1/(2**log2fold,2))
                if gene in geneID_dict:
                    pass
                else:
                    geneID_dict[gene]=['NoTest']*len(tupleList)
                    geneID_dict[gene][positionInList]=foldChange
 
    diffFile.close()
    listOfKeysToRemove=[]
    for key in geneID_dict:
        value=geneID_dict[key]
        if 'NoTest' in value:
            listOfKeysToRemove.append(key)
    for k in listOfKeysToRemove:
        del geneID_dict[k]
    return geneID_dict
 
foldChangesDict = getFoldChanges('E_coli_gene_exp.diff',[('control_20min','20bicyclomycin_20min'),('control_60min','20bicyclomycin_60min')])
 
print foldChangesDict
 
# Now, write out the contents of this dictionary to a tab-delimited file.
 
outFile=open('FoldChangesTable3.tab','w')
outFile.write('\t'.join(['Gene','control_20min vs 20bicyclomycin_20min','control_60min vs 20bicyclomycin_60min'])+'\n')
for key in foldChangesDict:
    listOfChanges=foldChangesDict[key]
    line='\t'.join([key]+listOfChanges)+'\n'
    outFile.write(line)
 
outFile.close()
 
 
#2 b
 
# Function that takes a table and outputs an array of fold-change values.
 
def makeArray(table):
    data = pd.read_table(table)
    # Get column names
    colNames=data.columns.values
    #get the number of rows for the array
    nrows=len(data[colNames[0]])
    # Number of Non-Gene Columns
    dataColumns=len(colNames[1:])
    array=np.zeros((nrows,dataColumns))
    # Create a dictionary of indexes for each gene to be able to reference
    # the row number of each gene
    genes=data[colNames[0]]
    indexGene=dict(enumerate(genes))
    print indexGene
    # Fill in the array
    for ind,name in enumerate(colNames[1:]):
        foldChanges=np.array(map(float,data[name]))
        array[:,ind]+=np.array(foldChanges)
    return array
 
print makeArray('FoldChangesTable.tab')