This problem asks:

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string.

Restate the problem

They’re going to send me 10 DNA strings, and I need to find which one has the highest concentration of ‘G’ and ‘C’ elements. Then, I need to return the ID of that string along with the GC-content of that string.

Before beginning

Before I could start on this challenge, I needed to read to learn how the FASTA file format works.

In the process of learning about FASTA, I found the Bio.SeqUtils.GC function that returns the G+C content for a sequence, which turned out to be a good fit for this challenge.

Not so fast

Although I have the latest version of the biopython library installed, I was not able to:

from Bio.SeqUtils import GC

because there is no GC function in the Bio.SeqUtils library, despite the documentation saying there is.

I went to the Bio.SeqUtils repository on GitHub and saw that the function has been renamed to gc_fraction.

List comprehension in Python

While writing this code, I reminded myself how to use list comprehension in Python so that I could write:

for record in SeqIO.parse(file, "fasta"):

String formatting in Python

Having the gc_fraction function made this challenge easier, but I still struggled to convert the highest gc_fraction to a string so that I could save it in my solution file.

Then I learned about all the new methods for string formatting in Python.

The trick, as you can see in my code, is to use:

solution = GCname + '\n' + '%f' % GCmax

Lessons learned

I learned a few important lessons in this challenge:

  • how to work with FASTA files using the SeqIO.parse function
  • how to look through the source code for biopython in GitHub
  • how to format floats into strings in Python versions 3.6 and later