This problem asks:

Given: A collection of k (k≤100) DNA strings of length at most 1 kbp each in FASTA format.

Return: A longest common substring of the collection.

Restate the problem

I’m going to get a collection of up to 100 DNA strings, each of which is up to 1,000 characters long. I need to find the longest substring that is common to every string in the collection.

Solution steps

This is a classic programming exercise that is so common that it has it’s own Wikipedia page, complete with a pseudocode example using dynamic programming.

I found the pylcs library in pypi. pylcs includes an lcs_string_idx method that takes two strings and returns the index of the longest common substring within the second string.

I installed the itertools package and used itertools.combinations() to find the longest common substring for each possible pair of string in a set of strings.

This ran quickly on the sample dataset and gave the expected results for each pair of strings, but did not seem to get me any closer to finding the longest common substring for all three strings.

I re-read the Wikipedia article on longest common substring problem, paying specific attention to the suffix tree.

I found a library, suffix-trees, that implements the suffix tree algorithm and includes an lcs method that takes a list of strings and returns their longest common substring.

Python concepts

Nothing new. I built up the list of strings and passed it to the lcs method.

Bioinformatics concepts

We revisit the topic of motifs here, but nothing new was required to solve the challenge.

Problem-solving concepts

There’s a pattern developing here. Here’s the pattern: I make an initial attempt to solve the problem by writing code from scratch. If that works, I solve the challenge and move on. If I’m not able to write the code from scratch in less than an hour, I go looking for a library that solves the problem for me.

So what’s going on here? What am I learning by finding libraries to solve these challenges? Probably not a lot. I’m learning how to find libraries and use them, and that’s not much.

Going forward, I should set rules around what libraries I can use, and what I should write myself. I think general-purpose libraries like itertools, math, and biopython are good to use. I will stop searching for libraries that solve the exact problem I’m trying to solve.