skleene-221x300.gif

Stephen Cole Kleene, inventor of regular expressions

This problem asks:

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection.

Required reading

  1. DNA sequence alignment, specifically motif finding
  2. DNA consensus sequences
  3. Most likely common ancestor

Restate the problem

I’m going to get a set of DNA strings of equal length. I need to return two items:

  1. a string with the most common symbol shown at each position.
  2. a matrix with four rows (A, C, G, and T) and as many columns as the DNA strings are long. In each position in the matrix, include the count of the specific symbol at that location.

Solution steps

Biopython has a module called “motifs” that includes both a consensus function and a count matrix function.

My code reads the DNA strings into a list and sends that list to the Biopython methods, then writes the results to a file.

This solution requires a manual edit of the solution file to remove the column headers from the matrix section.

Python concepts

I used string replace() to remove the brackets and commas from the output so that it matched what Project Rosalind was expecting.

I used method chains to perform multiple string replacements on a single line like this:

return str(m.counts).replace('.00','').replace('   ', ' ')

Bioinformatics concepts

DNA consensus tools are used to make estimates about most likely common ancestors.

Problem-solving concepts

Long after I had resolved the central conflict of the challenge, I struggled to programmatically remove the matrix header row from the output file. Biopython returned the header row along with the results, and I could not figure out how to use Python string slicing to remove it.

I did have the idea to write delimiters into the result file, then use regular expressions to remove the text between the delimiters and the delimiters themselves.

However, before writing code to do that, I had the clear sense that doing so would be going in the wrong direction. That is, making things more complicated instead of less complicated.

After taking a break and thinking about the challenge away from the keyboard, I had the idea to break with my established habit of writing the solution to a file and uploading it directly to Project Rosalind. Instead, I could use my program to generate the solution file, then edit it manually before uploading it.

Making things more complicated than they need to be is a common pitfall. It’s one that I fall into often. Going forward, I’m going to be on the lookout for simple solutions that are good enough.

Bonus: The man who invented regular expressions in 1951 was named Stephen Cole Kleene, and he lived an interesting life worth reading about.