This problem asks:

Given: Three positive integers k, m, and n, representing a population containing k+m+n organisms: k individuals are homozygous dominant for a factor, m are heterozygous, and n are homozygous recessive.

Return: The probability that two randomly selected mating organisms will produce an individual possessing a dominant allele (and thus displaying the dominant phenotype). Assume that any two organisms can mate.

Restate the problem

What I read before I could put this challenge into my own words:

  1. alleles
  2. homozygous and heterozygous
  3. dominant and recessive
  4. Punnett squares

They’re going to send me the count of homozygous dominant, heterozygous, and homozygous recessive individuals in a system.

They want to know the likelihood that a random child from this population will have at least one dominant allele.

There is a hint, which I’ll keep in mind:

Consider simulating inheritance on a number of small test cases in order to check your solution.

Solution steps

First, I read this LibreText Biology textbook section about Mendel’s First Law, which was helpful, but didn’t get into calculating probabilities.

This article on probability trees was nice, but didn’t move my solution forward.

I had the idea to run a million random trials to estimate the probability of the offspring of two random parent having a dominant allele, but I didn’t try to code that because it didn’t seem like a “real” solution, and I was afraid that a million trails wouldn’t be enough to get the right answer.

Instead, I wanted to count each one of the possible combinations of parents one time. One helpful search term was “Combinatorics”. That lead me to the itertools library in Python, and specifically combinations. However, I wasn’t able to make any progress on a solution, even though this iterator looks perfect for the job.

Then, I read this “solution”, which seems to prove that there’s a direct mathematical way to calculate this probability, but I don’t understand that math, so it was no help to me. This person comes to the same conclusion, but again, I don’t understand that math.

After some struggling, I had this idea:

population = k * [“AA”] + m * [“Aa”] + n * [“aa”]

This generates a set where each item in the set being the genome of an individual in the population.

Then I could use itertools.combinations() to make a set of all the possible pairs of parents in the population.

possible_pairs = sorted(itertools.combinations(population, 2))

Then the only math I needed was:

If your parents are (‘AA’, ‘AA’), (‘AA’, ‘Aa’), or (‘AA, ‘aa’), you have a 100% chance of getting an allele. If your parents are (‘Aa’, ‘Aa’), you have a 75% chance of getting an allele. If your parents are (‘Aa’, ‘aa’), you have a 50% chance of getting an allele.

The core of my code looks like this:

        for x in range(trial_size):
            parents = random.choice(possible_pairs)
            if parents in (('AA', 'AA'), ('AA', 'Aa'), ('AA', 'aa')):
                count += 1
            if parents == ('Aa', 'Aa'):
                count += .75
            if parents == ('Aa', 'aa'):
                count += .50
        return str(round((count/trial_size), 3))

The full code is here.

Python concepts

This is the first time during Project Rosalind that I’ve used the random library. I used the choice(seq) method.

There is a whole rich history of the quest for randomness in mathmatics. Specifically, I enjoyed reading about the Hot Lotto fraud scandal, where an official in charge of validating the randomness of the lottery drawing snuck a back door into the software that generated the lottery numbers.

Bioinformatics concepts

I learned about the genotype-phenotype distinction, which was helpful in understanding why the Punnett Squares work the way they do.