Expressed Sequence Tags and the UniGene Repository


Abstract

A genes can be uniquely identified by one or more “short tagged sequences” (STSs), stretches of DNA about nine bases long. The UniGene repository collates the STSs derived by independent researchers to avoid excess confusion. This document describes the process by which the notion of STS was derived. It was originally written in 1999 while at SmithKline Beecham Pharmaceuticals R&D, Bioinformatics department.


Contents

[Back to top of page]

J. Random Genomicist (“JRG”) decides he wants to look for human genes. He says, “I could start sequencing each chromosome from one end to the other, but that would be inefficient, even if possible, because most of each eukaryotic chromosome is non-coding DNA. I only want to know about genes (since they code for proteins); and even those are interrupted by introns.”

So JRG thinks awhile, and says, “Say I have a pancreas cell, churning out lots of insulin. So the cell will contain many insulin-producing ribosomes, and many copies of the mRNA to direct them.” He blenderizes the pancreas cells to extract their mRNA, then mixes them with the enzyme reverse transcriptase to create single-stranded DNA copies, which he calls complementary DNA or cDNA.

But because of limitations of the process, he doesn't get a single continuous cDNA strand; he gets lots of short ones. “I'll call these ESTs, expressed sequence tags, because they're made from the ‘expression’ of the gene, rather than its raw form; and because they're not the whole sequence, but a tag for it.” Because the fragments have certain regions in common, JRG can mix and match and overlap them into a contiguous sequence. (Other researchers have done this before, and call such collections contigs.) JRG calls this collection of ESTs an assembly.

But even though each EST is shorter than the complete mRNA strand, and much shorter than the raw gene, it's still long and unwieldy. “A single EST is overkill; it's so long, it overspecifies the target gene. I know,” JR decides with sudden inspiration. “I'll chop each EST into shorter parts until I find one that's the smallest unique identifying tag.” He names them sequence tagged sites, or STSs. “Then I can hybridize the cDNA, carrying a chemical marker, to a metaphase chromosome and figure out where the gene is physically located.”

“How short an STS is too short?” JRG wonders. “Let's apply some probability. Since there are 4 bases in DNA, a single base should appear one-quarter of the time, a two-base sequence should appear one-sixteenth, and so forth.”

1		1/4	25.00%
2	(1/4)^2	1/16	 6.25%
3	(1/4)^3	1/64	 1.56% 
4	(1/4)^4	1/256	 0.39%
5	(1/4)^5 1/1024	 0.098%
6	(1/4)^6	1/4096	 0.024%
7		1/16384	 0.0061%
8		1/65536	 0.0015%
9		1/262144 0.00038%

“And since there are three billion bases, but only about 100 thousand genes, a sequence that will statistically appear only one-one hundred thousandth of the time, or 0.001%, should be adequate. Since 4^9 is over 260 thousand, nine bases should be enough.”

So JRG plays with restriction enzymes and slices out a candidate STS from one of his insulin ESTs. He adds a fluorescent flag to the single-stranded DNA, then mixes it with the chromosome he knows the insulin gene is on. He waits for hybridization to occur, then puts the dish under his microscope, finds the chromosome, and shines a laser on it to make the fluorescent flag glow.

“I don't see anything!” JRG complains. “Oops, I forgot how dim a single flourescent molecule would be.” So he hooks up the photomultiplier and sets the camera for a long exposure time.

“Oh dear, it's lit up like a christmas tree!” JRG exclaims. “Obviously that wasn't a very good STS. That DNA sequence doesn't happen only 0.00038% of time, on average once per the length of this chromosome; it happened dozens of times! Clearly, this short DNA sequence is repeated many times along the chromosome. Probably there are other such noise sequences. I will have to avoid them from now on; they aren't specific enough for my needs.”

So JRG mixes up new restriction enzymes and gets a new candidate STS. He prepares the chromosome, looks at it, and finds to his delight only one region is lit up. “Yay, I found the gene for insulin!” he says. Then he stains the chromosome with Giemsa dye, and compares where the glowing region is in relation to the G bands. He sticks a thumbtack in the striped cytogenetic ideogram on the wall. “And now I have it located on my cytogenetic map.”

Then JRG publishes his work, and many other genomicists perform similar experiments with their own proteins of interest. A huge body of ESTs and STSs is published. It's confusing because there are many ESTs per gene, and many STSs per EST; and so when two researchers compare STSs, they may not realize they're talking about the same gene. Finally, one group decides to create a repository of unique, nonredundant gene markers, and they call it UniGene.