References

Restate the problem

I’m going to set a DNA strands that fall into two groups:

First, correct reads that are confirmed by being in the dataset more than once, either directly or via reverse complement.

Second, incorrect reads, which only appear in the dataset one time.

For each incorrect read, there is a correct read that requires exactly one character change to match the incorrect read.

I need to return each incorrect read with its corresponding correct read.

Solution steps

First, I wrote a function to find all the incorrect reads from the list of reads by removing every string that appeared more than one time either directly or as a reverse complement. That function looked like this:

def find_bad_reads(seqs):
    good_reads = []
    bad_reads = []
    for s in seqs:
        bad_reads.append(s)
    for i in range(len(seqs)):
        for j in range(len(seqs)):
            if i != j:
                if seqs[i] == seqs[j]:
                    if i not in good_reads:
                        good_reads.append(i)
                if seqs[i] == Seq.reverse_complement(seqs[j]):
                    if i not in good_reads:
                        # print(sequences[i], " is the reverse complement of ", sequences[j])
                        good_reads.append(i)
    good_reads.sort(reverse = True)
    for k in good_reads:
        del bad_reads[k]
    return bad_reads

Then, I wrote a function to find all the correct reads by keeping only those reads that appear more than once, either directly or as a reverse complement.

def find_good_reads(seqs):
    good_reads = []
    result = []
    for i in range(len(seqs)):
        for j in range(len(seqs)):
            if i != j:
                if seqs[i] == seqs[j]:
                    if i not in good_reads:
                        good_reads.append(i)
                if seqs[i] == Seq.reverse_complement(seqs[j]):
                    if i not in good_reads:
                        # print(sequences[i], " is the reverse complement of ", sequences[j])
                        good_reads.append(i)
    for k in good_reads:
        result.append(seqs[k])
    return result

Next, I wrote a function to find the match each incorrect read with a correct read or its reverse complement that differed by exactly one character.

def corrections(errors, corrects):
    for a in range(len(corrects)):
        corrects.append(Seq.reverse_complement(corrects[a]))
    # print(corrects)
    response = {}
    for s in errors:
        for t in corrects:
            if hamm(s, t) == 1:
                # print(s, "->", t)
                response.update({s: t})
    return response

My answers were incorrect for the sample dataset, but I couldn’t make any progress troubleshooting with the sample dataset, so I submitted two incorrect answers in order to unlock the “Questions” section, where I got the hint that incorrect responses shouldn’t map to other incorrect responses, which was a case I allowed.

Once I fixed that, I got a successful result on my first attempt.

Python concepts

I used a Python dictionary to store the list of corrections because I wanted to take advantage of the dictionary property of not allowing duplicates.

Post-solution notes

Challenges solved so far: 48

How many people solved this before me: 2,410

Time spent on challenge: 1.5 hours of focused work over two days

Most time-consuming facet: Not allowing incorrect reads to match other incorrect reads

Questions from others: This is the hint that led to my being able to solve this challenge:

Solutions from others: I saw several cases of posted responses with notes that they had run into the same issue I had with matching incorrect reads with other incorrect reads.

I also saw solutions made a list of correct reads by simply removing all the bad reads from the list instead of writing a function to find all the good reads based on their appearing more than once like I did. Their approach is much simpler.

Closing thoughts: My code is extremely inefficient. The challenge dataset had 960 DNA reads of 50 base pairs each. My code takes 28 seconds to produce the correct result. 378 of the 960 reads in my dataset were incorrect.