Genome Assembly as Shortest Superstring
This problem asks:
Given: At most 50 DNA strings of approximately equal length.
Return: A shortest superstring containing all the given strings.
Reference
- Wikipedia article on the shortest common supersequence problem
- Geeks for geeks tutorial on this algorithm
Restating the problem
I’m going to get a set of DNA strings, and I need to return the shortest possible string that contains all given strings.
Solution steps
Personal note: I was surprised to see that I tried and failed to solve this challenge in June 2021 and March 2022. I have no memory of these attempts.
First, I read all the DNA strings into a list.
Next, I wrote a function to calculate the longest possible overlap between two given strings. The function checks for the longest overlap in each direction and returns the greater of the two values.
Then, I wrote a Python implementation of the Geeks for geeks algorithm to crawl through every combination of strings, find the pair with the greatest overlap, combine them, remove them from the list of strings to be combined, then start again.
This solution does not take into account the Project Rosalind condition that:
The dataset is guaranteed to satisfy the following condition: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length.
I had three incorrect responses because I wasn’t properly handling the counter variables in my nested for loops. You can see my code here.
3,952 people solved this before me.