Having spent some time trying to grasp the underlying concept of the Greedy Motif Search problem in chapter 3 of Bioinformatics Algorithms (Part 1) I hoped to cement my understanding and perhaps even make life a little easier for others by attempting to explain the algorithm step by step below.
I will try to provide an overview of the algorithm as well as addressing each section of the pseudo code provided in the course:
Pseudo Code
GREEDYMOTIFSEARCH(Dna, k, t) BestMotifs ← motif matrix formed by first k-mers in each string from Dna for each k-mer Motif in the first string from Dna Motif1 ← Motif for i = 2 to t form Profile from motifs Motif1, …, Motifi - 1 Motifi ← Profile-most probable k-mer in the i-th string in Dna Motifs ← (Motif1, …, Motift) if Score(Motifs) < Score(BestMotifs) BestMotifs ← Motifs output BestMotifs
I haven’t gone into any detail for creating profiles from k-mers, finding profile-most probable k-mers or scoring as I found these topics were relatively clear in the course text. Furthermore I haven’t posted my code for obvious reasons.
Overview
The basic idea of the greedy motif search algorithm is to find the set of motifs across a number of DNA sequences that match each other most closely. To do this we:
- Run through each possible k-mer in our first dna string (illustrated for the first three 3-mers in orange below),
- Identify the best matches for this initial k-mer within each of the following dna strings (using our profile-most probable function, illustrated below in green) thus creating a set of motifs at each step
- Score each set of motifs to find and return the best scoring set.
For example the first three iterations in the main loop if k = 3 (k being the length of our k-mers, or in this case 3-mers) will give us the following motif sets and scores:
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG Score = 7 |
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG Score = 5 |
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG Score = 4 |
The best motif set in the first three iterations, as illustrated above, is therefore the third which scores 4, the best overall score happens at the seventh iteration:
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG Score = 2
The Code
BestMotifs ← motif matrix formed by first k-mers in each string from Dna
First we initialise best_motifs as a list featuring the first k-mer from each dna string.
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG
So using our example DNA list above we get:
best_motifs = [‘GGC‘, ‘AAG‘, ‘CAA‘, ‘CAC‘, ‘CAA‘]
This list will be used to compare against other motif lists that we will build later in the algorithm.
for each k-mer Motif in the first string from Dna
Motif1 ← Motif
In our outer loop we will be iterating through each k-mer in the first dna string (in this case ‘GGCGTTCAGGCA‘ ).
At the beginning of each of these iterations we will first initialise a new list called motif and then add the respective k-mer as its first entry
First iteration:
GGCGTTCAGGCA
So on the first iterations the motif list will be initialised as motif= [‘GGC‘]
Second-iteration
GGCGTTCAGGCA
On the second iteration the motif list will be initialised as motif = [‘GCG‘]
And so on…
Within each iteration of the outer loop we will be repeating the following instructions within our ‘inner loop’ below:
for i = 2 to t
form Profile from motifs Motif1, …, Motifi – 1
Motifi ← Profile-most probable k-mer in the i-th string in Dna
Motifs ← (Motif1, …, Motift)
Our inner loop effectively iterates through the second dna string the final dna string (or rather dna[1] to dna[t]). At each step we will add the most likely candidate k-mer from the current dna string to our current motif list.
As an example we will run through the first iteration of our main loop for which we initialised our motif list as [‘GGC‘] above:
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG
To tie things together I have tried to colour code the current k-mer from the first dna string as orange, the current dna string as blue and the best matches from previously analysed dna strings as green.
Please note that we iterate through dna[1] to dna[t] within the inner loop which excludes dna[0]. dna[0] is used to create the first item in our motif list upon each iteration of our outer loop as described above. This ‘seed’ motif is then compared to each of the other dna strings. If we included dna[0] in our inner loop we would be comparing it to itself at each iteration.
Create a profile from all motifs in our list
We only have one item in our motif list when we first start our loop which results in the profile:
G | G | C | |
A | 0 | 0 | 0 |
C | 0 | 0 | 1.0 |
G | 1.0 | 1.0 | 0 |
T | 0 | 0 | 0 |
And so our first profile might be represented in a form similar to:
profile = [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)]
Find Profile-most probable motif in dna[i] using the current profile
We can now run our profile-most probable function passing the profile above and dna[i] as arguments. In the first iteration of our inner loop we will be using the second dna string; ‘AAGAATCAGTCA‘ and the profile; [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)] which returns the motif AAG
Add the resulting motif to motif list
We now add the resulting k-mer to our motif list:
motif= [‘GGC‘, ‘AAG‘]
Rinse and repeat using the next dna string and the updated motif list:
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG
Again we create a profile from all motifs in our list
G | G | C | |
A | A | G | |
A | 0.5 | 0.5 | 0 |
C | 0 | 0 | 0.5 |
G | 0.5 | 0.5 | 0.5 |
T | 0 | 0 | 0 |
Which we might represent in a form similar to:
profile =[(0.5, 0.0, 0.5, 0.0), (0.5, 0.0, 0.5, 0.0), (0.0, 0.5, 0.5, 0.0)]
And again we find the Profile-most probable motif in dna[i] using the current profile
Using dna[i]; ‘CAAGGAGTTCGC‘ and the profile; [(0.5, 0.0, 0.5, 0.0), (0.5, 0.0, 0.5, 0.0), (0.0, 0.5, 0.5, 0.0)] which returns the motif AAG which results in the updated motif list:
motifs = [‘GGC‘, ‘AAG‘, ‘AAG‘]
And so on…
We carry on this way until we reach the last dna string, dna[t]. In this case our motif list will look something like this:
motif = [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘]
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG
ifScore(Motifs) < Score(BestMotifs) BestMotifs ← Motifs outputBestMotifs
Finally we need to compare our current motif list with our best motif list to see which has the lower score. In the case above motif [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘] scores 7 compared to best_motif which scores 6. As such best_motif is not updated on this iteration.
I have duplicated the results for the first three iterations below now that we have some additional context.
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG Score = 7 |
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG Score = 5 |
GGCGTTCAGGCA AAGAATCAGTCA CAAGGAGTTCGC CACGTCAATCAC CAATAATATTCG Score = 4 |
This entire process is then repeated with the next k-mer in dna[0] and so on until you have reached the last k-mer in the first dna string.
In each case the motif list is emptied and initialised with the current k-mer from the first dna string. As such the last motif list would be initialised as motif = ”GCA”.
Finishing off
Hopefully the above details will give you a good idea how to implement the algorithm.
If you do have any trouble it might be worth attempting to code the inner loop first to recreate the first motif set (starting with ‘GGC‘). Once you have that in hand it should be fairly straight forward to iterate through the remaining k-mers and keep track of the best scoring motif set.
And I think that covers everything… I really hope that this has been helpful! Please let me know if I have missed anything (literally or figuratively) as this is as much an education for myself as anybody else.
Edit: A few people have kindly asked if they can donate a tiny bit towards hosting costs and so I have tentatively added a donate (via paypal) button below. Thanks 🙂
[wpedon id=554]
Leave a Reply