Output measures & Confusion Matrix

  Predicted 0 Predicted 1
Actual 0 True Negative False Positive
Actual 1 False Negative True Positive

 

Accuracy Error Rate
( True Negative + True Positive ) /
Number of observations
( False Negative + False Positive ) /
Number of Observations
Sensitivity Specificity
True Positive /
( True Positive + False Negative )
True Negative /
( True Negative + False Positive )
False Negative Error Rate False Positive Error Rate
False Negative /
( True Positive + False Negative )
False Positive /
( True Negative + False Positive )

Installing rjags on OSX Mavricks

I am currently working to remove some of the thornier dependencies for BAYSIC (source) and though we have just about managed to replace VCF merge (which has been especially ill behaved for us) we have decided that for now R2jags is still our best bet for Gibbs sampling.

I know some people have a bad time installing RJags but the process below seems to have worked pretty well for me.

First you need to install JAGS, for me this was as easy as downloading and installing:
JAGS-3.3.0.dmg
(http://sourceforge.net/projects/mcmc-jags/files/latest/download?source=files)

Once that’s installed we can enter your R console and run:

> install.packages(“rjags”)
and
> install.packages(“R2jags”)

And there might be one or two dependencies left to install,  I had to run:

> install.packages(‘getopt’)
and
> install.packages(‘reshape2’)

Finally check if everything is running with:
> library(rjags)

Hope that went well!  Let me know.

BWA (Burrows-Wheeler Aligner) installation quickie

Download the latest / required version of BWA:
http://sourceforge.net/projects/bio-bwa/files/bwa-0.7.12.tar.bz2/download

Unzip the downloaded file and navigate to the resulting directory:

  • tar -xvf bwa-0.7.12.tar.bz2
  • cd bwa-0.7.12

BWA doesn’t come with a ./configure file so we can just run

  • make

Which should compile everything nicely and result in a executable bwa binary file in the bwa-0.7.12 directory.  Once you have made this binary file all of the other files in the unzipped directory are no longer needed, they’re sole purpose  in life was to make that binary…

From here you can either run the binary from the bwa-0.7.12 directory. e.g.:

…/bwa-0.7.12/bwa aln…

or  move / copy the binary into /usr/bin/ if that’s your kind of thing.

 

Greedy Motif Search

Having spent some time trying to grasp the underlying concept of the Greedy Motif Search problem in chapter 3 of Bioinformatics Algorithms (Part 1) I hoped to cement my understanding and perhaps even make life a little easier for others by attempting to explain the algorithm step by step below.

I will try to provide an overview of the algorithm as well as addressing each section of the pseudo code provided in the course:

Pseudo Code

   GREEDYMOTIFSEARCH(Dna, k, t)
        BestMotifs ← motif matrix formed by first k-mers in each string
                      from Dna
        for each k-mer Motif in the first string from Dna
            Motif1Motif
            for i = 2 to t
                form Profile from motifs Motif1, …, Motifi - 1
                MotifiProfile-most probable k-mer in the i-th string
                          in Dna
            Motifs ← (Motif1, …, Motift)
            if Score(Motifs) < Score(BestMotifs)
                BestMotifsMotifs
        output BestMotifs

I haven’t gone into any detail for creating profiles from k-mers, finding profile-most probable k-mers or scoring as I found these topics were relatively clear in the course text.  Furthermore I haven’t posted my code for obvious reasons.


Overview

The basic idea of the greedy motif search algorithm is to find the set of motifs across a number of DNA sequences that match each other  most closely. To do this we:

  • Run through each possible k-mer in our first dna string (illustrated for the first three 3-mers in orange below),
  • Identify the best matches for this initial k-mer within each of the following dna strings (using our profile-most probable function, illustrated below in green) thus creating a set of  motifs at each step
  • Score each set of motifs to find and return the best scoring set.

For example the first three iterations in the main loop if k = 3 (k being the length of our k-mers, or in this case 3-mers) will give us the following motif sets and scores:

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 7
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 5
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 4

The best motif set in the first three iterations, as illustrated above, is therefore the third which scores 4, the best overall score happens at the seventh iteration:

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 2

The Code

BestMotifs ← motif matrix formed by first k-mers in each string from Dna

First we initialise best_motifs as a list featuring the first k-mer from each dna string.

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

So using our example DNA list above we get:

best_motifs = [‘GGC‘, ‘AAG‘, ‘CAA‘, ‘CAC‘, ‘CAA‘]

This list will be used to compare against other motif lists that we will build later in the algorithm.


for each k-mer Motif in the first string from Dna

Motif1 ← Motif

In our outer loop we will be iterating through each k-mer in the first dna string (in this case ‘GGCGTTCAGGCA‘ ).

At the beginning of each of these iterations we will first initialise a new list called motif and then add the respective k-mer as its first entry

First iteration:

GGCGTTCAGGCA

So on the first iterations the motif list will be initialised as motif= [‘GGC‘]

Second-iteration

GGCGTTCAGGCA

On the second iteration the motif list will be initialised as motif = [‘GCG‘]

And so on…

Within each iteration of the outer loop we will be repeating the following instructions within our ‘inner loop’ below:


for i = 2 to t

form Profile from motifs Motif1, …, Motifi – 1

MotifiProfile-most probable k-mer in the i-th string in Dna

Motifs ← (Motif1, …, Motift)

Our inner loop effectively iterates through the second dna string the final dna string (or rather dna[1] to dna[t]). At each step we will add the most likely candidate k-mer from the current dna string to our current motif list.

As an example we will run through the first iteration of our main loop for which we initialised our motif list as [‘GGC‘] above:

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

To tie things together I have tried to colour code the current k-mer from the first dna string as orange, the current dna string as blue and the best matches from previously analysed dna strings as green.

Please note that we iterate through dna[1] to dna[t] within the inner loop which excludes dna[0]. dna[0] is used to create the first item in our motif list upon each iteration of our outer loop as described above. This ‘seed’ motif is then compared to each of the other dna strings. If we included dna[0] in our inner loop we would be comparing it to itself at each iteration.

Create a profile from all motifs in our list

We only have one item in our motif list when we first start our loop which results in the profile:

G G C
A 0 0 0
C 0 0 1.0
G 1.0 1.0 0
T 0 0 0

And so our first profile might be represented in a form similar to:

profile = [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)]

Find Profile-most probable motif in dna[i] using the current profile

We can now run our profile-most probable function passing the profile above and dna[i] as arguments. In the first iteration of our inner loop we will be using the second dna string;  ‘AAGAATCAGTCA‘ and the profile; [(0.0, 0.0, 1.0, 0.0), (0.0, 0.0, 1.0, 0.0), (0.0, 1.0, 0.0, 0.0)] which returns the motif  AAG

Add the resulting motif to motif list

We now add the resulting k-mer to our motif list:

motif= [‘GGC‘, ‘AAG‘]

Rinse and repeat using the next dna string and the updated motif list:

 GGCGTTCAGGCA
 AAGAATCAGTCA
 CAAGGAGTTCGC
 CACGTCAATCAC
 CAATAATATTCG

Again we create a profile from all motifs in our list

G G C
A A G
A 0.5 0.5 0
C 0 0 0.5
G 0.5 0.5 0.5
T 0 0 0

Which we might represent in a form similar to:

profile =[(0.5, 0.0, 0.5, 0.0), (0.5, 0.0, 0.5, 0.0), (0.0, 0.5, 0.5, 0.0)]

And again we find the Profile-most probable motif in dna[i] using the current profile

Using dna[i];  ‘CAAGGAGTTCGC‘ and the profile; [(0.5, 0.0, 0.5, 0.0), (0.5, 0.0, 0.5, 0.0), (0.0, 0.5, 0.5, 0.0)] which returns the motif  AAG which results in the updated motif list:

motifs = [‘GGC‘, ‘AAG‘, ‘AAG‘]

And so on…

We carry on this way until we reach the last dna string, dna[t]. In this case our motif list will look something like this:

motif = [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘]

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG

ifScore(Motifs) < Score(BestMotifs) BestMotifsMotifs outputBestMotifs

Finally we need to compare our current motif list with our best motif list to see which has the lower score. In the case above motif [‘GGC‘, ‘AAG‘, ‘AAG‘, ‘CAC‘, ‘CAA‘] scores 7 compared to best_motif which scores 6. As such best_motif is not updated on this iteration.

I have duplicated the results for the first three iterations below now that we have some additional context.

GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 7
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 5
GGCGTTCAGGCA 
AAGAATCAGTCA 
CAAGGAGTTCGC 
CACGTCAATCAC 
CAATAATATTCG
Score = 4

This entire process is then repeated with the next k-mer in dna[0] and so on until you have reached the last k-mer in the first dna string.

In each case the motif list is emptied and initialised with the current k-mer from the first dna string. As such the last motif list would be initialised as motif = ”GCA”.


Finishing off

Hopefully the above details will give you a good idea how to implement the algorithm.

If you do have any trouble it might be worth attempting to code the inner loop first to recreate the first motif set (starting with ‘GGC‘). Once you have that in hand it should be fairly straight forward to iterate through the remaining k-mers and keep track of the best scoring motif set.

And I think that covers everything… I really hope that this has been helpful!   Please let me know if I have missed anything (literally or figuratively) as this is as much an education for myself as anybody else.

Edit: A few people have kindly asked if they can donate a tiny bit towards hosting costs and so I have tentatively added a donate (via paypal) button below. Thanks 🙂

Git and Bitbucket Quickie

pixel_tapHi soon-to-be Git fans!

This guide is for people who kind of know what Git is but perhaps haven’t had much experience in using it.

The first thing to do is to install Git if you haven’t already. Everything you need to do that is over at http://git-scm.com/. If you are installing on Windows make sure you are installing it with Git Bash if you want to follow the guide below (you should be fine using the default options throughout the install), otherwise the instructions provided on the site should see you through nicely.

Now head over to bitbucket.org, sign up if you haven’t already and create a new repository to play around with. I won’t go in to to much detail about this step either as Bitbucket is nice and easy to use and will hold your hand through the process.

Next, a quick Git primer might be useful if you are a bit rusty or new to all this. Now, I don’t want anyone to think that I am an on-line course fan-boy or too lazy to write my own guide to Git from scratch but it just so happens that there is a rather fancy free interactive GIT on-line training course available at try.github.io which covers the basics really nicely.

Beautiful!

Rightio, now that you have had some time with Git online we can set a Git repository up on our local machine and link it to our remote repository on Bitbucket!

First of all we need to create a local Git repository on our machine. Each local repository (which is really just a bunch of files describing the directory’s structure) sits in the same directory as the files it manages. To add a repository:

  • Using command line (or Git Bash) to navigate to the folder you want to keep all your files in. For example you could use: ‘ cd work/project
    • If the project directory doesn’t exist yet you can use, as an example, ‘ mkdir project ‘ from the /work directory
  • Once you are inside your work/project directory you can create a repository with ‘ git init

Go team! we have a repository! Type ‘ git status ‘ to check that everything is working out so far.

Next up we need to connect with the remote version on Bitbucket. For now we can do that quickly though you will have to re-authenticate with your password every time you want to push / pull… We’ll fix that in another post though there are loads of guides out there already for the impatient.

  • Making sure you are in the directory with your new repository use:
    git remote add origin https://your_username@bitbucket.org/repository_owner_name/remote_repository_name.git ‘ (you can copy and paste this address from your Bitbucket repository page).
  • To make sure that has worked pull all the remote files to your local repository with ‘ git pull origin master

Fingers crossed you should have all the previously remote files in your local file.

Now lets make sure you can push stuff back.

  • Save a .txt file within your local folder
  • use ‘ git add -A ‘ (The -a bit is asking git to add all changes – this includes deletions and amendments, quick and handy for later)
  • Use ‘ git status ‘ to see what file changes are waiting to be committed
  • Now use ‘ git commit -am ‘A description of the commit ‘ Remember to add these -m messages to all of your commits, it makes me sad when people forget.
    • At this point Git might ask you to add some details about yourself if you haven’t provided them already, just follow the instructions given.

Your local Git repository is now up to date, we are ready to send your changes back to BitBucket!

  • First of all I always like to do a pull so that my local repository is as up-to-date as possible before doing a push – do that! ‘ git pull origin master
  • Now all we need to do is push! You may have already guessed the command… ‘ git push origin master

And there you have the basics!  Of course this covers a only a tiny fraction of everything that Git is good at but it is the first step on that long road.

Let me know if you have any problems!

MrGraeme

3n/2 comparisons to find the smallest and largest numbers in the list

After so much time studying for MIT6.00X and running through Rosalind problems I was rather frustrated to find the first exercise in An Introduction to Bioinformatics Algorithms below a bit tricky…

Design an algorithm that performs only 3n/2 comparisons to find the smallest and largest numbers in a list

As a linear search would end up performing 2n comparisons I looked at all sorts of ways to split the sequence up recursively but all my ideas ended up a bit messy and confused. After much longer than I would have liked I ended up:

  • Pairing up my value into n/2 pairs
  • Comparing the minimum value in each pair with the global minimum and vice versa with the maximum value

As such we see at most 3 comparisons for each pair of values giving us 3n/2 comparisons in total.

The only thing to be wary of is a sequence with an odd number of values, I think I have avoided the issue in the code below with a cheeky list slice and a bit of redundancy.

After fiddling with all sorts of messy ideas I settled for the following:


seq = [1,2,2,3,2,5,7,4,7,9,1,2,7]

def find_max_min(seq):
    mini,maxi = seq[0], seq[0]
    if len(seq) % 2 != 0:
        seq = seq[1:]
    ''' Initalising mini and maxi as seq[0] gives us the redundancy to 
        remove it from any seq with an odd length '''
    while len(seq) > 1:
        if seq[0] < seq[1]:
            if mini >= seq[0]:
                mini = seq[0]
            if maxi <= seq[1]:
                maxi = seq[1]
        else:
            if mini >= seq[1]:
                mini = seq[1]
            if maxi <= seq[0]:
                maxi = seq[0] 
        ''' Here we are performing at most 3 comparisons 
            on each pair of items in our sequence '''
        seq = seq[2:]
    return mini, maxi

print find_max_min(seq)

On life regrets and finishing MIT6.00X

My meatiest lingering life regret has been that I spent most of my Artificial Intelligence major at the student union pub…

My second most meaty regret has been that programming lectures were the worst casualty of my determination to drink rather than learn.

Well, they say that life is all about second chances (do people say that?) and just to prove it MIT6.00X – Introduction to Computer Science course was recently published on EDX. Right in the middle of our world trip. I was going to learn to code at last!

The first thing that I learned was that I was right to regret shirking my lectures at university – I just couldn’t believe how much amazing stuff I missed out on the first time around. One beautiful concept after another was covered, as well as core programming concepts such as Object Orientation and recursion within a couple of lectures we were into algorithm design and complexity theory and then further in to specific problems such as search and sorting.

By the end of the course, as well as other things, I had programmed a cleaning robot, simulated viruses and drugs within a human body, used graphs to build the kind of route-finder you might find used on your GPS and proved that P = NP. O.K. the P = NP bit was a lie… but I do know what P = NP means!

Anyway, after 13 weeks of WIFI dependence, beach dodging and wife neglect I finally completed the final exam and… managed an overall score of 93%… which was crazy. Just to satiate your curiosity as to how the grades were distributed across all students I have generously marked myself in red below.

Screenshot-6.00x Course Info - Google Chrome

So there we are… with that I have managed to partially quell one of my all time all time life regrets. Which feels really good. But… The other thing that I learned is that being able to code a bit better is not enough. Every time I completed a problem set or watched a lecture or did some background reading I wanted to be solving real world problems to use my new powers for good.

And so I have a new regret to work on… that of not getting started in my coding career earlier. I wonder where that might lead…