In Depth RNA-seq analysis

This morning, we left off having all the tools we'd need to do the RNA-seq analysis, and asked you to map the reads for your respective samples. Now we're going to combine our data, and actually try to draw some conclusions about what actually happens when we inhibit Rho.

First, though, let's check the results of Tophat. Whenever I have lab meeting with new sequencing data, I always have a standard table format, where I have
  • # of reads from each sample (for index 3: 30,471,634)
  • # of mapped reads from each sample and % of reads that mapped (29,290,400 (96.1%))
  • Some estimate of the library complexity: # of unique reads, as called by samtools (samtools flagstat index3/accepted_hits.bam)


It's all great to have the reads aligned to the genome, but there's still a few things we need to be aware of. First off, there's thousands of genes in the genome, and we would like to have a single number for each one. That's still a lot of data points, but it becomes tractable to deal with that number. The community has settled on RPKM (Reads per Kilobase per Million Reads) and FPKM (the paired-end equivalent, where one pair of reads is one fragment). This corrects for a few of the most obvious biases one might encounter in RNAseq. By dividing by the length of the gene, you correct for a longer transcript producing more sequencing reads than a shorter one; and by dividing by the number of reads, you correct for different sequencing runs having different numbers of reads, but potentially even being derived from the same library. While the term FPKM is still generally used, there are more robust ways for normalizing read depth between samples than using 'per million reads'. We'll discuss these a bit later.

Let's furthermore say that you are interested in whether differential splicing is happening in your sample. This is an even harder problem because calculating the FPKM for a particular isoform is difficult because you must somehow assign a read that aligns to a common exon to one of multiple isoforms.

Add in the fact that there are a number of biases present in the library preparation process, and you almost start to wonder why you thought RNAseq was a good idea in the first place. At least, you do until someone points out Cufflinks, the last major tool in the Tuxedo suite. Cufflinks does three main things: assemble reads into transcripts (if you care about novel isoforms), assign FPKM values to genes and isoforms, and perform differential expression tests. (see Cufflinks website for workflow:


This tool can assemble transcript structures from aligned reads. Importantly, it can find novel isoforms based on reads that cross splice junctions. We don't care about this for our E. coli dataset so we won't use it, but I'll quickly describe how you might run it for, say, human RNA-seq data where you're interested in finding new things. Cufflinks will also report FPKM values but I don't recommend using them because the normalization procedure is not ideal. Use Cuffquant/Cuffnorm to get FPKMs.

$ cufflinks
Usage: cufflinks [options] <hits.sam>

There are many options Cufflinks can take, including ones like -p (threads) and -o (output directory). You should check --library-type for your dataset. One option I recommend is -g/--GTF-guide <reference_annotation.(gtf/gff)> which uses the known transcript structure to assemble transcripts and help find novel isoforms. The output you'll use going forward would be transcripts.gtf which contains the assembled transcripts for the particular sample you ran Cufflinks on.


The program Cuffmerge merges the GTF files that Cufflinks generates. You first run Cufflinks on each sample independently and then use Cuffmerge to merge them all into one 'universal' transcript assembly to use for comparing sample in future analyses. You should also give Cuffmerge the reference annotation gtf and it will annotate the transcripts with the appropriate gene names, and classify isoforms as known, novel, etc. Cuffmerge is a wrapper script for the tool Cuffcompare (which can also be run independently). Check out the manual information for Cuffcompare to get descriptions of the Cuffmerge output files.

$ cuffmerge
cuffmerge [options]* <assembly_GTF_list.txt>

The options you probably want here are -g/--ref-gtf (the reference annotation) and -s/--ref-sequence (genome sequence used to remove artifacts).


The newest version of Cufflinks (which we just downloaded) has modularized the functions of the Cuffdiff to make the workflow more efficient. Cuffdiff was used to (and still can be) calculate FPKM values and perform differential expression testing. The part that takes a long time is getting the read counts for each isoform and gene (the FPK part), so now you can Cuffquant to do this once, and then run Cuffdiff or Cuffnorm for differential testing or just calculating FPKMs. This is the step we will start with today for our E. coli data.

$ cuffquant
Usage: cuffquant [options]* <annotation.(gtf/gff)> <aligned_reads.(sam/bam)>

Cuffquant takes a gtf. This can be the output of Cuffmerge if you want to use a novel transcript assembly, or a reference annotation which is what we'll use. It also takes the aligned read file and counts up the reads aligning to each transcript in the gtf. There are two options that help correct for known biases in RNA-seq data that I recommend you always use. They are -b/--frag-bias-correct (controls for GC and 5' UTR biases and requires genome sequence) and -u/--multi-read-correct (better weights reads mapping to multiple places in the genome). The output is a human non-readable abundances.cbx file which is then fed into Cuffnorm or Cuffdiff.


Cuffnorm does something very simple - it normalizes the read counts across samples to control for read depth and allow comparisons. Here is the step where you combine all your samples into one file.

$ cuffnorm
cuffnorm [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM.sam]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]>... [sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]

Cuffnorm again takes in a gtf (either Cuffmerge output or reference annotation) and the reads. While the read files indicate they are SAM files, they should be the abundances.cbx files if you have them. Here you can take into account replicates to get FPKMs per condition if desired. There are three normalization methods available. 'classic-fpkm' is exactly as it sounds, but subject to outlier bias and other artifacts and is not recommended. The other two options are 'geometric' and 'quartile' and both should work. We can use the default 'geometric' today. The output of Cuffnorm is an easily parse-able tab file with expression values for all genes in all conditions. I recommend using the -L/--labels option to keep track of the different samples. You can use the condition names instead of the index names.


Finally we come to probably the most important tool: Cuffdiff. Cuffdiff performs sophisticated statistical testing in order to find significantly differentially expressed genes and isoforms. The output will contain, for each pair of conditions you input, and for each gene, the FPKM values in the conditions, the log(fold change) between the conditions, whether or not the gene had sufficient data for a test (I would ignore anything not tagged 'OK'), and the corrected p-value for significant differences.

$ cuffdiff
cuffdiff [options]* <transcripts.gtf> <sample1_replicate1.sam[,...,sample1_replicateM.sam]> <sample2_replicate1.sam[,...,sample2_replicateM.sam]>... [sampleN.sam_replicate1.sam[,...,sample2_replicateM.sam]]

You'll note the usage is pretty much the same as for Cuffnorm. Again, you should use abundances.cbx files instead of SAM files if you have them. If you're using SAM/BAM files, use the -u and -b options. Use the usual -o and -L options and use the same library type as for quantnorm.


Exercise 1: Find FPKM values for all genes

Run Cuffquant on the accepted_hits.bam output from Tophat using the Ensembl annotation . Remember to use the -b and -u options. The abundances.cxb output files from Cuffquant for each index will be posted on the server as they finish: server

Once you have all 6 files, run Cuffnorm to get FPKM values for all the genes. Visualization and further analyses on this type of data will be covered on Friday.

Exercise 2: Find differentially expressed genes

Run Cuffdiff with the 6 samples as different conditions. The gene_exp.diff file will have a p-value for every pairwise comparison for each gene. From that file, pull the following comparisons into their own files: index3 vs index4, index3 vs index5, index6 vs index8, index6 vs index11.

index3: control, no bicyclomycin (20min treatment)
index4: 20mg/ml bicyclomycin (20min treatment)
index5: 100mg/ml bicyclomycin (20min treatment)
index6: control, no bicyclomycin (60min treatment)
index8: 20mg/ml bicyclomycin (60min treatment)
index11: 100mg/ml bicyclomycin (60min treatment)

For each of these four interesting comparisons, how many genes are significantly differentially increased? How many change by 2-fold or 4-fold? These are potentially Rho-dependent operons (it would be better to just look at the read coverages of the 3' UTRs but those changes may affect overall operon expression). How does the number of operons that increase compare to the number that decrease for each comparison?

Exercise 3: Venn diagrams

How many of the genes differentially expressed in the high drug cases are also differentially expressed in the medium-drug cases? How about 20min treatment vs 60min treatment? Does this give you more or less confidence in the results?

As a hint, the set() datatype will likely be helpful here, since you can calculate the intersections and differences of two lists at a time.

Another thing we can check is that in genes that have a terminator hairpin sequence, the response to Rho is much less pronounced than in those without. Of course, it's probably too much to expect that the every gene with a terminator is rho independent, and every gene without one is dependent, but by and large, that trend should hold true.

Using the list of genes you made from the file from the program GeSTer on the first day of class (here if needed), see if the Rho-dependent genes are enriched for genes without a terminator sequence (get a p-value with Scipy's Fisher's exact test). What percentage of genes with a terminator are Rho-dependent compared to the percentage of genes without a terminator?

Exercise 4: Sequence histograms (optional)

Using the spacing-finder program from Monday, see if you can determine any differences in the pyrimidine spacing of 3' UTRs of Rho-dependent and Rho-independent genes. How would you try to test whether these are significant? (3' UTR gff from this morning here if needed)



cuffquant -o index3_cuffquant/ -b E_coli_genome.fasta Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.gtf index3/accepted_hits.bam
cuffnorm -o cuffnorm_output/ -L 20min_control,20min_20,20min_100,60min_control,60min_20 Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.gtf cuffquant_outputs/index3_abundances.cxb cuffquant_outputs/index4_abundances.cxb cuffquant_outputs/index5_abundances.cxb cuffquant_outputs/index6_abundances.cxb cuffquant_outputs/index8_abundances.cxb


cuffdiff -o cuffdiff_output/ -L 20min_control,20min_20,20min_100,60min_control,60min_20 Escherichia_coli_str_k_12_substr_mg1655.GCA_000005845.1.19.gtf cuffquant_outputs/index3_abundances.cxb cuffquant_outputs/index4_abundances.cxb cuffquant_outputs/index5_abundances.cxb cuffquant_outputs/index6_abundances.cxb cuffquant_outputs/index8_abundances.cxb

Split gene_exp.diff (this will print all combinations):

#!/usr/bin/env python
in_name = 'E_coli_gene_exp.diff'
outbase = 'E_coli_gene_exp.diff'
out_lines = {}
inp = open(in_name, 'r')
head = inp.readline().strip()
lines = inp.readlines()
for line in lines:
 line = line.strip()
 ln = line.split()
 cond1, cond2 = ln[4], ln[5]
 comp = cond1 + "-" + cond2
 if comp in out_lines:
 out_lines[comp] = [line]
for c in out_lines:
 out_file = outbase + "_" + c + ".txt"
 outp = open(out_file, 'w')
 print >>outp, head
 print >>outp, "\n".join(out_lines[c])

Counting program:

#!/usr/bin/env python
import sys
in_file =sys.argv[1]
fh = open(in_file, 'r')
head = fh.readline()
lines = fh.readlines()
sigUp = 0
sigDown = 0
twoUp = 0
twoDown = 0
fourUp = 0
fourDown = 0
totalTested = 0
for line in lines:
 ln = line.strip().split()
 if ln[6] == 'OK':
 totalTested += 1
 ln[9] = float(ln[9])
 if ln[-1] == 'yes' and ln[9] > 0:
 sigUp += 1
 elif ln[-1] == 'yes' and ln[9] < 0:
 sigDown += 1
 if ln[9] > 1:
 twoUp += 1
 if ln[9] > 2:
 fourUp +=1
 elif ln[9] < -1:
 twoDown += 1
 if ln[9] < -2:
 fourDown += 1
print "tested:", totalTested
print "sig increase:", sigUp
print "sig decrease:", sigUp
print "2x increase:", twoUp
print "2x decrease:", twoDown
print "4x increase:", fourUp
print "4x decrease:", fourDown


Get intersection:

#!/usr/bin/env python
import sys
def getSet(filename):
 genes = set()
 fh = open(filename, 'r')
 head = fh.readline()
 lines = fh.readlines()
 for line in lines:
 ln = line.strip().split()
 if ln[6] == 'OK':
 ln[9] = float(ln[9])
 if ln[9] > 1:
 return genes
set_one = getSet(sys.argv[1])
set_two = getSet(sys.argv[2])
print "genes in file one:", len(set_one)
print "genes in file set:", len(set_two)
print "genes in both files:", len(set_one.intersection(set_two))

filter for terminator genes:

#!/usr/bin/env python
import sys
gene_list = []
f1 = open(sys.argv[1])
for l in f1.readlines():
 l = l.strip()
term_genes_out = open("term_genes_" + sys.argv[2], 'w')
no_term_genes_out = open("no_term_genes_" + sys.argv[2], 'w')
f2 = open(sys.argv[2])
head = f2.readline().strip()
print >> term_genes_out, head
print >> no_term_genes_out, head
lines = f2.readlines()
for line in lines:
 line = line.strip()
 ln = line.split()
 if ln[2] in gene_list:
 print >> term_genes_out, line
 print >> no_term_genes_out, line

Are Rho-dependent genes enriched for no-terminator genes?

Contingency table:

Total tested
% Rho+
No terminator


from scipy import stats
cont_table = [[684, 1787], [435, 1446]]
oddsratio, pvalue = stats.fisher_exact(cont_table)
print pvalue