This problem asks:

Given: A collection of up to 1000 (possibly repeating) DNA strings of equal length (not exceeding 50 bp) corresponding to a set S of (k+1)-mers.

Return: The adjacency list corresponding to the de Bruijn graph corresponding to S∪Src.

References

  1. Read coverage, shotgun sequencing
  2. K-mer
  3. de Bruijn graph
  4. More on de Bruijn graph
  5. Vacuous truth
  6. Directed graph
  7. More on digraphs

Restating the problem

I’m going to get a long list of DNA strings that represent short fragments of a longer genome. I need to create a de Bruin graph of the list and its reverse complements such that:

  1. Each node is a substring of length k-1 from the given set of strings and their reverse complements.
  2. The edges are all directed such that (r[1:k], r[2:k+1])

Solution steps

First, from the given set, I added all the reverse complements. Then I made a list of edges where each edge went from: all but the last character of the string to all but the first character of the string This almost matched the sample output, but was off by a single edge. I decided to try a full dataset anyway. Not surprisingly, the result was wrong.

For the sample dataset, the difference between my result and the given result is that I have two copies of edge (CAT, ATC) while the sample output only has one copy of that edge. Other edges have duplicates.

I changed my code to remove duplicates from the S list at the beginning, then my result matched the given result exactly.

I tried my second full dataset. Correct result! I did this one without using any external libraries or looking at anyone else’s code. I solved this in about 90 minutes. I was the first person to solve this in over 10 days. 1,185 people solved this before me. One person solved this with a shell script! Most of the Python solutions are similar to mine.

I’m glad I wasn’t too intimidated by all the complex theory material at the beginning. Implementing the solution didn’t require understanding any of that.