Creating a Distance Matrix
This problem asks:
Given: A collection of n (n≤10) DNA strings s1,…,sn of equal length (at most 1 kbp).
Return: The matrix D corresponding to the p-distance dp on the given strings.
Required reading
Restate the problem
I’m going to get up to 10 DNA strings. I need to construct a table of the p-distance between each string and every other string. p-distance is the proportion of symbols that are different between the two strings.
Solution steps
First, I wrote a p_dist function that returns the proportion of symbols that are different between two strings rounded to 4 decimal places. The code is straightforward and looks like this:
def p_dist(s1, s2):
diff = 0
for x in range(len(s1)):
if s1[x] != s2[x]:
diff += 1
return str(round(diff/len(s1), 4))
Next, I wrote two nested loops to write the p_dist for every possible combination of strings to a table in a file:
for s1 in sequences:
for s2 in sequences:
file.write(p_dist(s1, s2))
file.write(' ')
file.write('\n')
My solution returned the correct result for the sample dataset, so I downloaded a test dataset and tried my solution with that. My first attempt failed because I did not rename my sample dataset before downloading the real dataset, but once I did that, my second attempt succeeded.
Post-solution notes
Challenges solved so far: 46
How many people solved this before me: 2,653
Most recent solve before me: First solution of the day
Time spent on challenge: 30 minutes
Most time-consuming facet: Making sure the p distances were written out in the correct order
Solutions from others: Several clever solvers reused their Hamming distance code from the Counting Point Mutations challenge. That didn’t occur to me, and probably would have saved a little time.
Closing thoughts: My solution run in O(n^2 * l) for n sequences of length l. If the challenge dataset were orders of magnitude larger, my solution would take a long time to run, and I’m not sure what I would do to speed things up.