SNP calling: finding polymorphism

Although in our current project, we are not interested per se in finding where the sample sequence differs from the reference sequence. However, in many kinds of experiments and analyses, identifying variable sites is a key step in the analysis pipeline. For example, if you might be conducting an experiment to find the genetic basis of a mutant phenotype in which some samples have a mutant phenotype (let's call these guys cases) and other samples don't (let's call these guys controls). One option in this case is to do an association study, which looks for all the variable sites where one allele is enriched in the cases and the other allele is enriched in the controls. Another, increasing common application of sequencing data is to look at sequence variation in natural populations. This kind of data can be extremely informative about the evolutionary and demographic history of a species.

Errors or SNPs?

If you were guaranteed that every read were 100% accurate, there would be no SNP calling problem. However, because some of the calls that map to a particular site may be errors, we need some way of figuring out whether or not the site is truly a SNP.

Let's first consider a haploid organism. This is both most relevant to this class, since the data we have is from E. coli, but also an easy place to start. Let's take a look at a little bit of a reference sequence and then the reads that map to it. I'm going to write the data in a similar form to what's called a pileup file; you'll see a proper one later.
The first column is the reference allele while the second column summarizes the nucleotides from the reads. For example, in the first row, the reference allele is an A and you obtained 8 reads that had an A at that position and 1 read with a C at that position.

The first row is relatively easy: the reference allele is an A, and 8 out of 9 reads are A. Probably that C is an error, and we would think that this site does not have a SNP in this individual.

The second row is the opposite: the reference allele is G, but only 1 out of the 9 reads is a G. Instead, because every other read has a C, we might think that this site is a SNP and this sample has a C, instead of a G.

The last row is a hard case: the reference is a C, but only 4 out of the 9 reads is a C. However, unlike before, the other reads are roughly evenly split between 3 Gs and 2 As. What do you do in this case?

One approach is to adopt a statistical framework that attempts to estimate error rates and makes use of all the data. This is relatively complicated and we won't talk about it here. Another approach makes implicit use of prior knowledge you may have. First of all, in a haploid you expect only 1 allele at any site; so if there aren't enough reads with the same nucleotide, it might be wise to ignore that site. Second, assuming that your reference and your sample are not too diverged, you should expect SNPs to be relatively rare. Even between humans and chimp you expect only about 1 site in 100 to be different! So you probably want to put more weight on a site not being a SNP. Then, you can simply choose an arbitrary cutoff and declare a site a SNP or not.

A diploid organism can be much harder, because you have to deal with heterozygous sites. In a perfect world, your reads would come equally from both copies of the allele, but that isn't always the case. Consider the following sites:
The first site has 4 Gs and 6 Cs. This is reasonably close to 50% G and 50% C, exactly what you would expect from a heterozygous site. But what about the second site? It has only two Ts but eight Gs. Is that possibly a heterozygous site, or a site homozygous for the non-reference allele? The final site is once again a mess.

Again, a statistical framework can estimate error rates, sequencing bias, and model the process that assigns alleles to reads. However, another option is to again make arbitrary, but informed, cutoffs.

Pre-packaged SNP calling

There are a number of software packages that will perform SNP calling for you. One of the simplest to use is samtools. Samtools uses a relatively simple command line that takes as an input a indexed fasta file and a sorted bam file (we'll talk about that in a moment). The general framework can be found here.

The first thing you need to do to use samtools is index your fasta file using samtools faidx:

$ samtools faidx E_coli_genome.fasta
$ head E_coli_genome.fasta.fai
Chromosome 4639675 77 60 61

As you can see, running samtools faidx results in a file with the same name as the input fasta but with ".fai" tagged on the end of it. The faidx file simply shows the name and size of the sequence, the place in the file where the sequence begins and the number of characters per line. Using this information it is quite easy to get an arbitrary sub-sequence if you know the start and end positions of that subsequence.

The other task is to generate a sorted bam file with your reads. A bam file is simply a binary version of a sam file and samtools plays more nicely with them. To generate a bam file from a sam file of mapped reads, we have to first convert our sam file file into a bam file and then sort it. To conver the sam file into a bam file, use samtools view:

$ samtools view -Sb E_coli_mapped_reads.sam > E_coli_mapped_reads.bam
[samopen] SAM header is present: 2 sequences.

These bam files are NOT human-readable. the -Sb option tells it that the input is in Sam format and that it should output in bam format. To sort the bam file, use samtools sort:

$ samtools sort E_coli_mapped_reads.bam E_coli_mapped_reads.sorted

which results in a file E_coli_mapped_reads.sorted.bam. The first argument is the input bam file and the second argument is the prefix for the output bam file (in other words, the output will be called prefix.bam).

Now we can finally call SNPs! We do this using a series of programs. The first one, samtools mpileup, will be used to generate bcf file. However, first it's instructive to take a brief look at a pileup file:

$ samtools mpileup -f E_coli_genome.fasta E_coli_mapped_reads.sorted.bam > E_coli_mapped_reads.pileup
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

the -f option simply means that the reference genome is given as a fasta file. Let's take a brief look inside E_coli_mapped_reads.pileup:

Chromosome 21150 G 17 c,,,,,,,,,,,,,,,^~, +JIIJJGGJJH@JCJHC
Chromosome 21151 T 17 ,,,,,,,,,,,,,,,,, AIJGIJBFJJE;J>JGF
Chromosome 21152 C 17 ,,,,,,,,,,,,,,,,, GGJJJJGHJJJDJEJHH
Chromosome 21153 A 17 ,,,,,,,,,,,,,,,,, EJJJJJIEJJJIJHJCH
Chromosome 21154 A 17 ,,,,,,,,,,,,,,,,, >IIIJJIJJIJEJDJFE
Chromosome 21155 T 17 ,,,,,,,,,,,,,,,,, FJIIJJJIJIJEJ@IGF

The first column is the chromosome, followed by the position, the reference allele and the coverage. The next two columns tell you about the reads at that position: , signifies matching reference on forward strand while . is matching reference on reverse strand; similarly upper and lower case letters indicate a read from a non-reference allele on either the forward or reverse strand, respectively. There are a few other characters, for example indicating the beginning of a read, but this was just meant to be a short detour rather than a full exploration of the pileup format.

However for actual SNP calling we need this to be in bcf format. This is done using an almost identical command line, the only difference being that the option should be "-Suf" instead of simply "-f". The "S" will tell samtools to compute per-site strand bias values, which we'll come back to later.

$ samtools mpileup -Suf E_coli_genome.fasta E_coli_mapped_reads.sorted.bam > E_coli_mapped_reads.bcf
[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

Now, we use bcftools view to do all the genotype calling. bcftools uses a statistical method to call SNPs.

$ bcftools view -vg E_coli_mapped_reads.bcf > E_coli.vcf
[afs] 0:279.924 1:4.392 2:3.684

The options -vg tell it to output only variable sites and to call genotypes. Calling genotypes is a bit weird for E. coli, since it's haploid, but this can be a good practice: sites that are called heterozygous might be indicative of low quality sites or that you are not sequencing a clonal culture. The output is stored in a vcf. "vcf" stands for variant call format, and is a compressed way to store variant information. Taking a look inside, first there is a long header:

##samtoolsVersion=0.1.17 (r973:277)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=-1,Type=Integer,Description="List of Phred-scaled genotype likelihoods, number of values is (#ALT+1)*(#ALT+2)/2">

and then some lines indicating the variants
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT E_coli_mapped_reads.sorted.bam
Chromosome      2863059 .       T       C       3.55    .       DP=1;AF1=1;AC1=2;DP4=0,0,0,1;MQ=42;FQ=-30       GT:PL:SP:GQ     0/1:31,3,0:0:4
This looks complicated but it's rather simple. First let's look at the line describing the variant. The first column is the chromosome, then the position. The next column is an ID for that SNP, for example if you had a list of known SNPs. Finally you get the good stuff: the reference allele and the alternative allele. In this case, samtools thinks that while the reference is a T at position 2863059 in the E. coli chromosome, our sample actually has a C. The rest of the fields give you some information about the SNP. For example, the field that starts with "DP=1" is called the info field. If you want to know what all the things there mean, take a look up at the header. For example, this line
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
means that "DP=1" indicates that only 1 read mapped to this position in the genome. The last column tells you the actual genotype, as well as the genotype likelihood scores (essentially -log(probabilty of a given genotype)). So here it thinks that the E. coli is heterozygous for the alternative allele.

Your vcf file contains all the information about variant positions in your reference genome, but doesn't usually contain any information about the genomic positions that did not vary. This may not matter if you are only concerned with variant positions - say if you are trying to track down the mutation underlying a new phenotype. However, there are many cases where you might want to know the most likely sequence of all the bases in the genome, not just the variant ones.

To generate a consensus sequence for your organism, you can use another part of vcfutils - a library of perl scripts used to work with vcf files. Since we are looking not just at variant positions, we won't run vcftools with the '-v' flag, so that it outputs every position in the genome

bcftools view -cg E_coli_mapped_reads.bcf | vcf2fq > E_coli_genome_cns.fq

Population genomics

Analyzing genomic variation within and between populations is one of the most powerful uses of high-throughput sequencing. We are able to sequence more individuals to higher quality than ever before and so we can get a genome-wide understanding of variation within populations.

Imagine that you have found all the variable sites in a set of individuals. Although you have all the sequence for each sample, you really only care about those places where at least one sample differs from one other sample. Below is a cartoon of a small dataset with 3 chromosomes. Variant sites are marked with a red x, the rest of the chromosome is assumed to be the same between all individuals. If two chromosomes have a variant that (roughly) line up vertically, they have the same variant at that site.

There are three summaries of population genomic variation that can act as a first pass at exploratory data analysis. Two of these summaries are just single numbers, and one of them is a bit more complicated. The two simplest summaries are the number of segregating sites, S, and the average number of pairwise differences, pi. Let's walk through calculating those on the example data above.

S is exactly what it sounds like: simply count up the number of sites have at least one chromosome with a variant,


so that S is equal to 6 in this case.

Calculating pi is a little bit harder. First, compare every pair of sequences and ask how many differences there are between them. Then, take the average of this number over all the comparisons you did,


so that in this case,


Individual list:
1. Download the above VCF file containing variants from human chromosome 21 in various human populations. Individuals are labeled with anonymous identifiers, so you will also need to download a list that tells which population each individual is from. Then,
a. Compute pi and S within each population SEPARATELY. Note that because these are diploids, you can just select a random allele from each individual for calculating pi (you may need to look at the module "random" for this purpose, specifically the function random.randint).
b. Compute the average distance between a random Yoruban (YRI) and European individual (CEU). How does it compare to within population diversity?

2. Using samtools, call SNPs in the reads from yesterday. How many heterozygous sites are called? Why might there be "heterozygous sites" in a haploid genome? Make a vcf file with just the variable sites in it.