This problem asks:

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string.

Restate the problem

They’re going to send me 10 DNA strings, and I need to find which one has the highest concentration of ‘G’ and ‘C’ elements. Then, I need to return the ID of that string along with the GC-content of that string.

Before beginning

Before I could start on this challenge, I needed to read to learn how the FASTA file format works.

In the process of learning about FASTA, I found the Bio.SeqUtils.GC function that returns the G+C content for a sequence, which turned out to be a good fit for this challenge.

Not so fast

Although I have the latest version of the biopython library installed, I was not able to:

from Bio.SeqUtils import GC

because there is no GC function in the Bio.SeqUtils library, despite the documentation saying there is.

I went to the Bio.SeqUtils repository on GitHub and saw that the function has been renamed to gc_fraction.

List comprehension in Python

While writing this code, I reminded myself how to use list comprehension in Python so that I could write:

for record in SeqIO.parse(file, "fasta"):

String formatting in Python

Having the gc_fraction function made this challenge easier, but I still struggled to convert the highest gc_fraction to a string so that I could save it in my solution file.

Then I learned about all the new methods for string formatting in Python.

The trick, as you can see in my code, is to use:

solution = GCname + '\n' + '%f' % GCmax

Lessons learned

I learned a few important lessons in this challenge:

how to work with FASTA files using the SeqIO.parse function
how to look through the source code for biopython in GitHub
how to format floats into strings in Python versions 3.6 and later