Field of Science

Yes but what does it do...

I am currently trying to get myself to finished writing an essay (rather terrifyingly my first essay of term) on the different approaches to gene annotation in vertebrates. As I've just woken up (afternoon naps seem like such a good idea until you wake up with a mouth that feels like a hamster died in it) I thought I'd give a quick summary of gene annotation methods:

Gene annotation is the 'interesting' bit of genomics. Quite a lot of gene sequencing work has been done, some of it (especially the human bits) very highly publicised. And while genome sequencing is probably useful (more on that maybe in a more ethically-inclined post) on it's own it's not terribly exciting. You're left with a big database full of mindless streams of nucleotides and one bit embarrassing question:

What does it all do?

Gene annotation attempts to answer that; trying to work out which proteins each gene codes for, essentially what the end function of the genome is, what each piece of DNA is used for. There are two main methods: just using DNA, and using data from protein/cDNA sources. Both of these methods can be either comparative or non-comparative:

1) Just using DNA: Non-Comparative
This relies on getting a program such as GENSCAN to, quite literally, scan along the DNA looking for the beginning and end of genes based on sequence patterns it had been told to recognise. Not so good for function, but useful enough for finding the damn genes in the first place. Also relatively cheap and you can go run it overnight.

2)Just using DNA: Comparative
Like it says, this compares your DNA with other previously annotated pieces of DNA to see if there are any very similar bits it can ascribe function to. It's a good starting point, especially now the pool of annotated genomes is increasing, but it's really bad at finding gene start point, especially when there are 'introns', or bits of DNA that are not actually turned into protein. Which is around 95% of the human genome incidentally. (an e.g of this, if anyones interested, is TWINSCAN)

3) cDNA/Protein data: Non-comparitive
cDNA, just to clarify, is DNA that has been reverse transcribed from RNA templates; i.e itt's all the DNA that will get turned into protein, and without any of the introns. A good way to use this is to make cDNA 'libraries' i.e all the cDNA within the cell stored on plasmids, choose one at random, see what it makes and, at the same time, find where it is in the genome. Simple and useful.

4) cDNA/Protein data: Comparative
This compares your genome with bits of cDNA from other genomes, where the cDNA has known function. Protein comparison is even more useful as seeing what protein your protein most resembles provides structural information, as well as functional and allows you to build up homologous families of proteins with similar function (if you have enough genomes). Also if you have enough protein data you can say you're doing 'proteomics' and the more 'omics' words in your project, the more funding you're likely to get :)

By the way, all of these comparative methods are based on homologous evolutionary relationships between the genomes, so anyone who says that scientists never use evolution is WRONG. (and probably pissing off the evodevo people as well)

As always, any questions are welcomed, leave them in the comments and I'll get back to you.

Disclaimer: This post was written while half asleep. Any spelling/grammer mistakes are therefore completely the fault of the writers Brain On Sleep.


Luke said...

Sequence analysis... you're in my territory now, Lab Rat!

First up, a minor correction; introns are actually only bits of DNA that are transcribed but not translated - so they are read and turned into RNA, but then they are cut out by the spliceosome. The 95% junk DNA is almost entirely intergenic regions and heterochromatin.

Secondly, genome scans can produce far more information on function than you give them credit for. They can find start and stop codons easily, but the more advanced annotation software (like the Ensembl gene scans at the Sanger) can make a guess at the transcript, the UTR and the exon/intron boundaries too. Once they have done that, they can make a predicted protein sequence.

You can then a) fit the protein into a protein family, which gives you an idea of it's function and b) look for known binding sites and functional motifs in the protein. Generally, you can make a pretty good bet on what the protein function is from this information.

Of course, the analysis will often miss genes, and sometimes it won't be able to pick out much information, but it can still do an awful lot.

Lab Rat said...

Thanks for the information! I didn't know that about introns, I assumed they were untranscribed as well as untranslated. That's what over two years of top university education does for you *feels silly*

heh, my lack of trust in genome programs is probably caused by listening to too many bacteria people complaining about how no programs will properly find their PKA sequences. Although that's mainly caused by a difficulty to sequence them as well. :)

rhan said...

I'd say the problem for bacteriologists (and the rest of us, too) isn't so much the genome scan as the lack of known protein structures. While we have determined a good number of structures, they only cover a very small percentage of proteins. The dang things just don't crystallize well, and it's slow going (but essential) getting high resolution structures via most techniques. Until we have structures for most families, full genome sequences are good for drug design, but we still face that question: what the bleeping bleep does it DO?