Computing GC Content
This problem asks:
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string.
Restate the problem
They’re going to send me 10 DNA strings, and I need to find which one has the highest concentration of ‘G’ and ‘C’ elements. Then, I need to return the ID of that string along with the GC-content of that string.
Before beginning
Before I could start on this challenge, I needed to read to learn how the FASTA file format works.
In the process of learning about FASTA, I found the Bio.SeqUtils.GC function that returns the G+C content for a sequence, which turned out to be a good fit for this challenge.
Not so fast
Although I have the latest version of the biopython library installed, I was not able to:
from Bio.SeqUtils import GC
because there is no GC function in the Bio.SeqUtils library, despite the documentation saying there is.
I went to the Bio.SeqUtils repository on GitHub and saw that the function has been renamed to gc_fraction.
List comprehension in Python
While writing this code, I reminded myself how to use list comprehension in Python so that I could write:
for record in SeqIO.parse(file, "fasta"):
String formatting in Python
Having the gc_fraction function made this challenge easier, but I still struggled to convert the highest gc_fraction to a string so that I could save it in my solution file.
Then I learned about all the new methods for string formatting in Python.
The trick, as you can see in my code, is to use:
solution = GCname + '\n' + '%f' % GCmax
Lessons learned
I learned a few important lessons in this challenge:
- how to work with FASTA files using the SeqIO.parse function
- how to look through the source code for biopython in GitHub
- how to format floats into strings in Python versions 3.6 and later