Example_of_simple_undirected_graph_3.svg.png

By Michel Bakni - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=151762031

This problem asks:

Given: A collection of DNA strings in FASTA format having total length at most 10 kbp.

Return: The adjacency list corresponding to O3. You may return edges in any order.

Required reading

The following topics were required to restate the problem:

  1. Adjacency list
  2. Directed graph
  3. Overlap graph

Restate the problem

The little paragraphs that Project Rosalind uses for introductions and problem statements are some of the densest bits of text I’ve ever seen. Consider the following “Brief Introduction to Graph Theory”:

overlap-graphs-intro.png

Graph theory is a massive topic. How well do I need to understand graph theory to solve this problem? They don’t say, but I’m not able to restate the problem in my own words without spending considerable time looking up words and reading wikipedia pages.

They’re going to send me a set of DNA strings in FASTA format. I need to return, in any order, every instance where the first three values in one string are the last three values in the second string.

Solution steps

On a personal note, I need to mention that this is the first challenge where I feel like I’m starting from scratch. In the previous challenges, I remembered some or all of the processes I went through to solve these back in 2018.

Not so with this one.

Project Rosalind says I solved this on March 6, 2018 at 8:18 a.m., but I have absolutely no memory of this.

I started writing nested for loops, but could not get them to work correctly, despite a lot of print-statement debugging.

In the end, the problem wasn’t with my for loops at all. I had not written my read_fasta method to correctly return a list.

Finally, when I had my code working and returning the correct result for the sample dataset in the console, I got a “Wrong answer” response from Project Rosalind. It took some time for me to figure out that I had put line breaks in the console print statements, but not in the solution file write statements.

Python concepts

I was forced to review the difference between iterators, generator functions, and lists in Python because SeqIO.parse() returns a generator function, not a list.

Print statements for debugging also played a role in solving this challenge, along with string slicing.

Bioinformatics concepts

I read more about the fasta format because I didn’t understand how complicated that specification was.

Graph theory came into play for the first time in this challenge. Although this was a relatively simple example, I predict that future Project Rosalind challenges will make extensive use of graph theory, so spending time reading about the basics seemed like time well invested.

Problem-solving topics

This was one of the first challenges where I had quick ideas with long implementation times. That is, I had a clear idea about how to go about solving this challenge quickly, but it took me a long time to write the code that put that idea into practice, mainly because I had unrecognized assumptions.