This problem asks:

Given: Two DNA strings s and t.

Return: A shortest common supersequence of s and t.

References

  1. Supersequence
  2. Common supersequence
  3. Shortest common supersequence
  4. More on shortest common supersequence
  5. Set cover problem
  6. Subsequence

Restating the problem

Given two strings, I need to find the shortest possible string that has both of the given strings as subsequences. A subsequence is one that can be created by deleting some or none of the characters from the given sequence. Another way to think of subsequences is that they contain the same symbols in the same order as the given string, but not necessarily consecutively.

Solution steps

I read in Wikipedia that this problem is closely related to the longest common subsequence problem we’ve already solved, so I reviewed my code for that challenge.

I found this helpful hint on the Wikipedia page for Shortest common supersequence:

scsp_hint.png

I focused on this phrase for a while:

By inserting the non-LCS symbols into Z while preserving their original order, we obtain a shortest common supersequence…

Specifically, I wonder about: “… preserving their original order…”. What can that mean?

I decided to try writing a shortest common supersequence function given two strings and their longest common subsequence, lcs.

First, I made an empty list, scs, to hold the shortest common supersequence.

Then, I added to scs all the characters in s up to the first lcs character. Next, I added to scs all the characters in t up to the first lcs character. Then, I added the first lcs character.

I repeated that process until there were no more lcs characters left to add.

To finish, I added any remaining characters from s and t to the scs.

This combination of steps returned the correct result for the sample dataset.

Success on the first attempt! I spent about 90 minutes on this challenge. 1,182 people have solved this before me. Nobody had solved this in a little over a week.

I unlocked two Project Rosalind badges with this result. Level 4 String Algorithms for solving 20 of the 26 string algorithm problems in the set, and Level 2 Dynamic Programming for solving 10 of the 23 dynamic programming problems in the set.