Topics:

  • Common Errors
  • Error messages from Python
  • Exception Handling (try, except)
  • Print statement debugging

Introduction


"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." – Brian W. Kernighan

Let's be honest, most of us aren't perfect. And in our zen self-awareness, we are well-served to be sensitive to our predispositions to err in particular ways. We are probably even better served to listen to the computer when it tells us we've made a big mistake. Python can be quite good at communicating our flaws to us, if only we're receptive to the constructive criticism in its spartanly high-contrast black and white print.

This morning we're going to look at common sources of error, see what to look for as feedback from Python, and learn a couple of tricks to both obviate, or if necessary, track down bugs, should they occur.

A few commonly encountered sources of errors include:

  • mis-typed variable names
  • mis-typed variables
  • non-existent variables (or indices in a variable)
  • unexpected input

If you're very lucky, an error will cause Python to give up right away, while others will cause insidious bugs that sneak through unnoticed until you present your work in lab meeting and someone calls you on your exciting, but seemingly impossible (and ultimately bogus) result.

Strategic Initiative 1: Test Early, Test Often


There's an approach to software engineering called "Test Driven Development". In that way of doing things, before you write a function that does something, you write code that tests it, often in a separate file. It's really convenient to use modules to do this, where you could have a test.py file that imports the relevant functions from your other modules and tests them on as many cases as it can. This also has the advantage that you notice fairly quickly when your new code breaks something that used to work. Especially in scientific programming, this isn't always practical to do, but the general ideas hold: testing is good.

Even if you don't do TDD, you can save yourself lots of time by testing frequently. Debugging 100 lines of code is often more than 10 times as hard as debugging 10 lines. Writing a ton of code that generates output without checking if each component works individually does NOT make you a coding rock star; it makes you sloppy.

Strategic Initiative 2: Be verbose


"Errors should never pass silently." -- The Zen of Python, by Tim Peters

One of the easiest ways to debug code is to print out the value of variables at the point things start going wrong. Of course, if you knew where things were going wrong, you would probably know what was going wrong in the first place, so to find this, a divide-and-conquer approach is often fastest. Start out by putting a bunch of print statements in throughout your code, then narrow things down gradually until you're right at the broken line of code.

It's not a bad idea to include these print statements in moderation before you need them. If you think there will be a subset of your data that will fail a logical test, setup your if and else statements to report the incidence of failure.

# lab_members = {name : [status]}
# pretend this dictionary is already filled with data
 
for name in lab_members:
     if lab_members[name] == 'Post-doc':
          get_a_job(name)
          print "Time for this post-doc to get a job..."
     elif lab_members[name] == 'Grad Student':
          do_research()
          print "Be patient, this one's just a grad student."
     else:
          print "Warning:", name, "is not a post-doc nor a grad student!"

Strategic Initiative 3: Be Boring, Be Obvious


"Programs must be written for people to read, and only incidentally for machines to execute." --Structure and Interpretation of Computer Programs Hal Abelson and Gerald Jay Sussman

As you get better at coding, you will start to take shortcuts and combine lines. As soon as something doesn't behave as you expect, you should decompose your compound statements, as this is a common source of error.
import sys
##Call in first gene and print out sequence
myfile = open(sys.argv[1],mode='r')
myfile = myfile.read()
 
##COMPOUND
xx = {int(myfile[5:7].strip()):myfile[7:40].strip()+myfile[40:72].strip()+myfile[72:].strip()}[int(myfile[5:7].strip())][3:8]
print 'Compound:',xx
 
 
##EXPANDED
myl1 = myfile[5:7]
myl1 = myl1.strip()
myl2 = myfile[7:40]
myl2 = myl2.strip()
myl3 = myfile[40:72]
myl3 = myl3.strip()
myl4 = myfile[72:]
myl4 = myl4.strip()
mykey = int(myl1)
myseq1 = myl2+myl3+myl4
mydict = {mykey:myseq1}
xx = mydict[mykey][3:8]
print 'Expanded:', xx

These two code snippets do the same thing, and Python doesn't much care which you use. On the other hand, it's much easier to look at each line of code in the expanded case and see that it does what it should be doing, whereas in the compound form, you could very easily have parentheses that are matched in unexpected ways that throws the whole meaning of things off.

On a related note, it's good practice to comment your code. The Python Style Guide has some good recommendations about comments. Anything you did that required some thinking on your part should ideally be commented, since you might come back to the code weeks or months later and have forgotten why you did it that way. Also try to keep your comments up to date with the code.

Error Messages from Python: Syntactical Mistakes and Others


Much like in spoken languages, Python has rules for what is and is not grammatically allowed. Before running your script, Python will check it to make sure that it does follow all these rules, and will let you know if it doesn't. Keep in mind, though, that it can't catch errors in meaning, so long as your code follows the rules of the language (as a natural language equivalent of things Python can't catch, consider Groucho Marx: "One morning I shot an elephant in my pajamas. How he got into my pajamas I'll never know.").

#!/usr/bin/python
 
while True
    print 'Bioinformatics is AWESOME!'

$ ./errors.py

Traceback (most recent call last):
  File "./errors.py", line 3
    while True
              ^
SyntaxError: invalid syntax

This error message, for its brevity, is quite helpful. We quickly learn the following:

  1. The problem is on line 3 of the file errors.py. This may seem trivial in this case, but when you have nested functions conjured from various modules in various directories, knowing which file to look in is a big, big help.
  2. There is a problem after True. Note that, for syntax errors, the caret will often point somewhere after the place where the actual error is, and sometimes can point quite a bit after, though in this case it's pretty close.
  3. The problem is that there is invalid syntax: we haven't followed the rules of the language. As you've likely seen already, and will no doubt see much more of in the future, there are all kinds of different errors that can show up, usually with a terse message explaining what went wrong.

What is the problem and how is it fixable?

There are lots of other kinds of errors that will almost inevitably crop up in your code. You've even already seen some in our lecture on data structures.

instructors = ['Aisha', 'Peter', 'Diana']
print instructors[10]

$./errors.py

Traceback (most recent call last):
  File "errors.py", line 4, in <module>
    print instructors[10]
IndexError: list index out of range

Again, let's see what we can work out from this error message:
  1. The error is now on line 4 of errors.py.
  2. There's some problem with the expression print instructors[10] (presumably in this case you could have looked that up yourself, but Python does try to be helpful).
  3. That problem is of the type IndexError, and in particular there is a list index out of range.

By now, it should be clear what the problem is (assuming it wasn't clear to you before I even ran it): instructors isn't long enough to have an item at index 10, so Python throws up its hands and gives up. Sometimes, though, we don't want Python to surrender at the first sign of trouble. If only there were some way to help it deal with problems that we can anticipate...

Easier to Ask Forgiveness than Permission: Try and Except


"Errors should never pass silently. Unless explicitly silenced." -- The Zen of Python, by Tim Peters

The try and except statements let us attempt to do something that may or may not work, and then respond appropriately if or when something goes wrong. For example, sometimes the file we want to work with isn't available for some reason (and if you're really lucky, it's only because you forgot to plug in an external hard drive).

fh = open('AwesomeDataFile.txt', 'r')

$ ./errors.py

Traceback (most recent call last):
  File "badsyntax.py", line 4, in <module>
    fh = open('AwesomeDataFile.txt', 'r')
IOError: [Errno 2] No such file or directory: 'AwesomeDataFile.txt'

try:
    f1 = open('AwesomeDataFile.txt', 'r')
except IOError:
    print "The file 'AwesomeDataFile.txt' doesn't exist!"
 

Sometimes you might have a section of code where multiple things could go wrong. In that case, you can respond to each of them appropriately using multiple except statements. It's even possible to have an except statement without listing the type of error, which will deal with anything not handled by one of the previous except statements.

try:
    do_something()
except IOError:
    handle_ioerror()
except IndexError:
    handle_indexerror()
except:
    print "Some other error happened"

import sys
 
def dividetwo(num):
    return 2/num
 
try:
    print dividetwo(float(sys.argv[1]))
except TypeError:
    print 'not a float!'
except IndexError:
    print 'did you put a number in the command line?'
except:
    print 'Some other error!'
What happens for each of these cases?

$ python errors.py 1

$ python errors.py a

$ python errors.py

$ python errors.py 0

It's considered slightly bad form, however, to have a catch-all except block. If you didn't anticipate the error, it will probably be helpful to you to get the default error message, even if it is a little arcane, rather than a message like the above, which is pretty darn vague.

For a listing of the default kinds of errors (also called "exceptions") that can happen, see the list of Built-in Exceptions in the Python Documentation.

Else and Finally


Similar to the use of else in for and while loops, anything in an else block will run *only if* no errors happened. On the other hand, anything in a finally block runs after the try/except blocks, and will run regardless of whether an exception was thrown; if there was an exception, it doesn't stop it from being thrown, it just waits a bit before it does.

def divide(x,y):
    try:
        result = x/y
    except ZeroDivisionError:
        print "Warning: divide by zero!"
        return float('NaN') #Not a number
    else:
        return result
    finally:
        print "executing final clause"
 
print divide(2,1)
print
print divide(2,0)
print
print divide('A','B')

$./errors.py
executing final clause
2
 
Warning: divide by zero!
executing final clause
nan
 
executing final clause
Traceback (most recent call last):
  File "errors.py", line 16, in <module>
    print divide('A','B')
  File "errors.py", line 3, in divide
    result = x/y
TypeError: unsupported operand type(s) for /: 'str' and 'str'

The first case made it through the division just fine.

The second case raised an exception (a division by zero error), but we anticipated that, so we replace it with a special Not a Number value, and move on.

The third case we found an exception that we didn't anticipate. The exception came up, so we still ran the finally clause, but then raised the exception. Any questions?



Exercises


1) Exception handling we have known.


If you don't know them already, there are two functions to get rid of elements in a list: remove() and discard().
#!/usr/bin/env python
 
list_of_letters = ['a', 'a', 'b', 'c','c','c','d','e']
print 'ORIGINAL'
set_of_letters = set(list_of_letters)
print set_of_letters
 
print 'DISCARD'
set_of_letters.discard('q')
print set_of_letters
 
print 'REMOVE'
set_of_letters.remove('q')
print set_of_letters

a) Now that we've learned more about exception handling, explain what is happening here.

b) Create a script that contains a list of 5 of your favorite beers or wines or soft drinks (depending on your preference) for a tasting party stored as a set. Each time someone drinks a beverage, it is removed from your fridge and cannot be drunk again. Fernando drinks a beverage. Adjust the set as appropriate. Emily sees Fernando drinking his beverage and wants the same one. Tell her you're out of that beverage.

2) Modify the FASTA parser from Session 4.1 to handle:


1) non-existent filenames
2) files that do not conform to the FASTA format (i.e. >gene for IDs, and strings of A,T,G, or C for sequence).
3) sequences that are in both cases

3) Modify the FASTA parser to identify file compression format and read compressed files.


We need to add functionality to our FASTA parser for use next week. We will be using very large FASTA files (> 3GB) when they are uncompressed. In order to save disk space, we'll be leaving them compressed (~1 GB) while we use them. However, this will require us to use the gzip module to read a compressed file directly. We will also need the mimetypes module, which can identify the type of file we are using.

First, write a new function called open_file_by_mimetype that determines whether a file is gzipped or not (using the mimetypes modules) and returns an open file handle. Then make a function called read_fasta() using your FASTA parser script, but modify it to call this new function such that your parser can read both types of FASTA files automatically.

When you're done, store your new file-opening function in a module called file_tools.py and your improved read_fasta() parser in the module you already made called sequence_tools.py.

4) All code is bug-free until your first user.


You have another coworker who made an AMAZING secondary structure analysis script. She used it on the protein 3GV2, inputting the pdb file for 3GV2. She asks if you will analyze her protein, interleukin-19, as well.(HINT: use PDB code 1N1F). Crud! This protein breaks your code. Why? Rewrite your code to work on both interleukin-19 and on the original 3GV2 HIV capsid protein.

The script she shared is below. I recommend also downloading the pdb file for 3GV2, to see what a working final output should look like.

#!/usr/bin/python
##SCRIPT TO PARSE OUT SECONDARY STRUCTURE INFORMATION
 
import sys, os
 
full_seq = []
helix_aa = []
sheet_aa = []
 
import sys, os
 
full_seq = []
helix_aa = []
sheet_aa = []
atoms = []
f1 = open('1N1F.pdb' ,'r')
for next in f1:
    tmp = next.strip().split()
    if tmp[0] == 'SEQRES':
        if tmp[2] == 'A':
            full_seq.extend(tmp[4:])
    elif tmp[0] == 'HELIX':
        try:
            int(tmp[5])
        except:
            tmp[5] = tmp[5][:-1]
        helix_aa.append(tmp[:9])
    elif tmp[0] == 'SHEET':
        sheet_aa.append(tmp[:10])
    elif tmp[0] == 'ATOM':
        if len(tmp) < 12:
            begin = tmp[0:2]
            end = tmp[3:]
            middle = [tmp[2][:3], tmp[2][4:]]
            tmp = begin + middle + end
        try:
            int(tmp[5])
        except:
            continue
        atoms.append(tmp)
 
######################
 
num_helix_res = 0.0
print "There are %s residues in the sequence" % len(full_seq)
 
# Set up a listing of features by residue, then fill it in as we go along
feature = ['Other']*(10000)
 
for aa in helix_aa:
    # We add 1 because there are b-a+1 residues between a and b, inclusive
    num_helix_res += float(aa[8]) - float(aa[5]) + 1
    for i in range(int(aa[5]), int(aa[8])+1):
        feature[i] = 'Helix'
 
num_sheet_res = 0.0
for sheet in sheet_aa:
    num_sheet_res += float(sheet[9]) - float(sheet[6]) + 1
    for i in range(int(sheet[6]), int(sheet[9])+1):
        feature[i] = 'Sheet'
 
 
 # atom[4] == chain id
 # atom[5] == residue #
 # atom[10] == b-factor
 
 
helix_bfactors = {}
sheet_bfactors = {}
other_bfactors = {}
 
for atom in atoms:
    Chain = atom[4]
    BFactor = float(atom[10])
    ResidueNum = int(atom[5])
 
    if feature[ResidueNum] == 'Helix':
        if Chain not in helix_bfactors:
            helix_bfactors[Chain] = []
        helix_bfactors[Chain].append(BFactor)
    elif feature[ResidueNum] == 'Sheet':
        if Chain not in sheet_bfactors:
            sheet_bfactors[Chain] = []
        sheet_bfactors[Chain].append(BFactor)
    else:
        if Chain not in sheet_bfactors:
            other_bfactors[Chain] = []
        other_bfactors[Chain].append(BFactor)
 
for chain in helix_bfactors:
    # I could have used any of the different bfactor listings
    avg_helix = sum(helix_bfactors[chain])/len(helix_bfactors[chain])
    avg_sheet = sum(sheet_bfactors[chain])/len(sheet_bfactors[chain])
    avg_other = sum(other_bfactors[chain])/len(other_bfactors[chain])
    print '%s\t %5f\t %5f\t %5f' % (chain, avg_helix, avg_sheet, avg_other)
 

Solutions


1. Exception handling we have known.

a) Discard something that's not there doesn't result in an error;
removing something that's not there does.

b)
print
listOfBeers = ['Racer 5', 'Dogfishhead 90 Minute IPA', 'Chimay',
'Unibroue Maudite', 'Josephsbrau PLZNR']
print 'The party is all stocked with my favorite beers:'
setOfBeers = set(listOfBeers)
for beer in setOfBeers:
    print '\t', beer
print
 
def BxxrHour(x):
    print 'Craving some %s.  Goes to get one.  ' % (x)
    try:
        setOfBeers.remove(x)
    except:
        print "\"Hey! Someone totally pwned my %s!!\"" % (x)
        print 'Beverage FAIL!!!'
    else:
        print 'Epic beverage WINS!'
    finally:
        print "The remaining beverages are:"
        for beer in setOfBeers:
            print '\t', beer
        print
        return setOfBeers
        print
 
print 'Ben wants a beverage:'
setOfBeers = BxxrHour('Racer 5')
 
print 'Mel wants a beverage:'
BxxrHour('Racer 5')
 

2. Modify the FASTA parser from Session 4.1 to handle:

Your new fasta parser should be:
#!/usr/bin/env python
##FROM SECTION 5.1 EXERCISES #2
def fastaParser2(fastafilename):
    current_gene = ""   # Start with an empty string, just in case
    genes = { }         # Make an empty dictionary of genes
    try:
        fh = open(fastafilename, 'r')
    except IOError:
        print 'Could not find file with filename %s' % (fastafilename)
        result = 'Please verify that your filename is correct and try again.'
        return result
    for lineInd, line in enumerate(fh.readlines()):
        if lineInd == 0:
            if not line[0] == '>':
                print 'File does not conform to FASTA format.'
                result = 'Please try again with FASTA formatted file.'
                fh.close( )
                return result
            else:
                pass
        else:
            pass
        line = line.strip()  # Clear out leading/trailing whitespace
        line = line.upper()  # Deals with whatever case the
                             # sequence is by making it all upper case
        if len(line) > 0 and line[0] == ">":   # This one is a new gene
            current_gene = line[1:]
            genes[current_gene] = ""
        else:                # Add onto the current gene
            genes[current_gene] += line
    fh.close()
    return genes
 
 
And this should run the new script:
from sequence_tools import fastaParser2
 
parsed = fastaParser2('seq.FASTA')
print parsed

3. Modify the FASTA parser to identify file compression format and read compressed files.

Below is your updated fasta parser:
##FOR EXER 3 of 5.1###
def read_fasta(fastaFilename):
    from file_tools import open_file_by_mimetype as ofbm
    current_gene = ""   # Start with an empty string, just in case
    geneDict = { }      # Make an empty dictionary of genes
    fh = ofbm(fastaFilename)
    for lineInd, line in enumerate(fh.readlines()):
        if lineInd == 0:
            if not line[0] == '>':
                print 'File does not conform to FASTA format.'
                result = 'Please try again with a FASTA formatted file.'
                fh.close( )
                return result
            else:
                pass
        else:
            pass
        line = line.strip()  # Clear out leading/trailing whitespace
        if lineInd % 10000 == 0:
            print "At line %s." % (lineInd)
        else:
            pass
        line = line.upper() # Deals with whatever case the
                            # sequence is by making it all upper case
        if len(line) > 0 and line[0] == ">":   # This one is a new gene
            current_gene = line[1:]
            geneDict[current_gene] = ""
        else:                # Add onto the current gene
            geneDict[current_gene] += line
    fh.close()
    return geneDict
 
And a script to make this run is:

###EXER 3###
from sequence_tools import *
import sys
 
fastafilename = sys.argv[1]
 
x = read_fasta(fastafilename)
print x

4. All code is bug-free until your first user.


There are three sections at the end of the code. The first is the original code. The second is my fix for the script (you have to comment out the first section). The third is another fix from a previous year (again comment out both of the top two sections. I also altered the file input to take from command line for easier testing of script and printed a header for the table so I could understand my output.

import sys, os
 
full_seq = []
helix_aa = []
sheet_aa = []
atoms = []
f1 = open(sys.argv[1] ,'r')
for next in f1:
    tmp = next.strip().split()
    if tmp[0] == 'SEQRES':
        if tmp[2] == 'A':
            full_seq.extend(tmp[4:])
    elif tmp[0] == 'HELIX':
        try:
            int(tmp[5])
        except:
            tmp[5] = tmp[5][:-1]
        helix_aa.append(tmp[:9])
    elif tmp[0] == 'SHEET':
        sheet_aa.append(tmp[:10])
    elif tmp[0] == 'ATOM':
        if len(tmp) < 12:
            begin = tmp[0:2]
            end = tmp[3:]
            middle = [tmp[2][:3], tmp[2][4:]]
            tmp = begin + middle + end
        try:
            int(tmp[5])
        except:
            continue
        atoms.append(tmp)
 
######################
 
num_helix_res = 0.0
print "There are %s residues in the sequence" % len(full_seq)
 
# Set up a listing of features by residue, then fill it in as we go along
feature = ['Other']*(10000)
 
for aa in helix_aa:
    # We add 1 because there are b-a+1 residues between a and b, inclusive
    num_helix_res += float(aa[8]) - float(aa[5]) + 1
    for i in range(int(aa[5]), int(aa[8])+1):
        feature[i] = 'Helix'
 
num_sheet_res = 0.0
for sheet in sheet_aa:
    num_sheet_res += float(sheet[9]) - float(sheet[6]) + 1
    for i in range(int(sheet[6]), int(sheet[9])+1):
        feature[i] = 'Sheet'
 
 
 # atom[4] == chain id
 # atom[5] == residue #
 # atom[10] == b-factor
 
 
helix_bfactors = {}
sheet_bfactors = {}
other_bfactors = {}
 
for atom in atoms:
    Chain = atom[4]
    BFactor = float(atom[10])
    ResidueNum = int(atom[5])
 
    if feature[ResidueNum] == 'Helix':
        if Chain not in helix_bfactors:
            helix_bfactors[Chain] = []
        helix_bfactors[Chain].append(BFactor)
    elif feature[ResidueNum] == 'Sheet':
        if Chain not in sheet_bfactors:
            sheet_bfactors[Chain] = []
        sheet_bfactors[Chain].append(BFactor)
    else:
        if Chain not in sheet_bfactors:
            other_bfactors[Chain] = []
        other_bfactors[Chain].append(BFactor)
 
 
print "Chain\tHelix\tSheet\tOther"
 
######################
#Method 1
# This is the original code that breaks with 1N1F
######################
for chain in helix_bfactors:
    # I could have used any of the different bfactor listings
    avg_helix = sum(helix_bfactors[chain])/len(helix_bfactors[chain])
    #print helix_bfactors
    avg_sheet = sum(sheet_bfactors[chain])/len(sheet_bfactors[chain])
    avg_other = sum(other_bfactors[chain])/len(other_bfactors[chain])
    print '%s\t %5f\t %5f\t %5f' % (chain, avg_helix, avg_sheet, avg_other)
 
'''
 
######################
#Method 2
# This is my code to modify the program in order to make it robust to different .pdb's
######################
for chain in helix_bfactors:
    try:
        avg_helix = sum(helix_bfactors[chain])/len(helix_bfactors[chain])
    except KeyError:
        avg_helix = 0
    try:
        avg_sheet = sum(sheet_bfactors[chain])/len(sheet_bfactors[chain])
    except KeyError:
        avg_sheet = 0
    try:
        avg_other = sum(other_bfactors[chain])/len(other_bfactors[chain])
    except KeyError:
        avg_other = 0
    print '%s\t %5f\t %5f\t %5f' % (chain, avg_helix, avg_sheet, avg_other)
'''
'''
######################
#Method 3
# This is another way to modify the program in order to make it robust to different .pdb's
######################
 
bfactors = {'Helix':helix_bfactors, 'Beta sheet':sheet_bfactors, 'Other':other_bfactors}
 
for y in bfactors:
    for x in bfactors[y]:
        if len(x) > 0:
            bfactordict = bfactors[y]
            for chain in bfactordict:
                average = sum(bfactordict[chain])/len(bfactordict[chain])
                print 'Chain: %s\t %s: %5f\t' % (chain, y, average)
'''
The output you should get for 3GV2 with Method 2 is:

There are 342 residues in the sequence
Chain Helix Sheet Other
A 42.750833 35.196250 44.535339
C 42.750833 35.196250 44.535339
B 42.750833 35.196250 44.535339
E 42.750833 35.196250 44.535339
D 42.750833 35.196250 44.535339
F 42.750833 35.196250 44.535339

and with 1N1F, it is:
There are 159 residues in the sequence
Chain Helix Sheet Other
A 36.237181 0.000000 57.820000