This problem asks:

Given: A collection of n (n≤10) DNA strings s1,…,sn of equal length (at most 1 kbp).

Return: The matrix D corresponding to the p-distance dp on the given strings.

Required reading

Restate the problem

I’m going to get up to 10 DNA strings. I need to construct a table of the p-distance between each string and every other string. p-distance is the proportion of symbols that are different between the two strings.

Solution steps

First, I wrote a p_dist function that returns the proportion of symbols that are different between two strings rounded to 4 decimal places. The code is straightforward and looks like this:

def p_dist(s1, s2):
    diff = 0
    for x in range(len(s1)):
        if s1[x] != s2[x]:
            diff += 1
    return str(round(diff/len(s1), 4))

Next, I wrote two nested loops to write the p_dist for every possible combination of strings to a table in a file:

for s1 in sequences:
    for s2 in sequences:
        file.write(p_dist(s1, s2))
        file.write(' ')
    file.write('\n')

My solution returned the correct result for the sample dataset, so I downloaded a test dataset and tried my solution with that. My first attempt failed because I did not rename my sample dataset before downloading the real dataset, but once I did that, my second attempt succeeded.

Post-solution notes

Challenges solved so far: 46

How many people solved this before me: 2,653

Most recent solve before me: First solution of the day

Time spent on challenge: 30 minutes

Most time-consuming facet: Making sure the p distances were written out in the correct order

Solutions from others: Several clever solvers reused their Hamming distance code from the Counting Point Mutations challenge. That didn’t occur to me, and probably would have saved a little time.

Closing thoughts: My solution run in O(n^2 * l) for n sequences of length l. If the challenge dataset were orders of magnitude larger, my solution would take a long time to run, and I’m not sure what I would do to speed things up.