How do Cells Read Genes?

Like words in a sentence, the DNA sequence of a gene determines the amino acid sequence for the protein it encodes. In the protein-coding region of a gene, the DNA sequence is interpreted in groups of three nucleotide bases, called codons. Each codon specifies a single amino acid in a protein.

Learn about the other parts of a gene in Anatomy of a Gene.

DNA to protein

DNA as a sentence

We can think about the protein-coding sequence of a gene as a sentence made up entirely of 3-letter words. In the sequence, each 3-letter word is a codon, specifying a single amino acid in a protein. Have a look at this sentence:

Thesunwashotbuttheoldmandidnotgethishat.

If you were to split this sentence into individual 3-letter words, you would probably read it like this:

The sun was hot but the old man did not get his hat.

This sentence represents a gene. Each letter corresponds to a nucleotide base, and each word represents a codon. What if you shifted the "reading frame?" You would end up with:

T hes unw ash otb utt heo ldm and idn otg eth ish at.

Or Th esu nwa sho tbu tth eol dma ndi dno tge thi sha t.

As you can see, only one of these reading frames translates into an understandable sentence. In the same way, only one reading frame within a gene codes for the correct protein.

DNA sequence

Mutating a DNA sentence

Take this DNA sequence:

GCATGCTGCGAAACTTTGGCTGA

You can separate the sequence into 3-letter codons, in 3 different ways:

  1. GCA TGC TGC GAA ACT TTG GCT GA
  2. G CAT GCT GCG AAA CTT TGG CTG A
  3. GC ATG CTG CGA AAC TTT GGC TGA

How can you tell which reading frames is the correct one?

All protein-coding regions begin with the sequence "ATG," which encodes the amino acid methionine (Met). Therefore, the correct reading frame will contain the codon "ATG."

You can predict the amino acid sequence of the protein by using the Universal Genetic Code.

Why isn't the start codon in DNA complementary to AUG?

DNA sequencing machine

A DNA sequence is read by a sequencing machine.

Image courtesy CDC/James Gathany

The Universal Genetic Code

The Universal Genetic Code is the instruction manual that all cells use to read the DNA sequence of a gene and build a corresponding protein. Proteins are made of amino acids that are strung together in a chain. Each 3-letter DNA sequence, or codon, encodes a specific amino acid.

The code has several key features:

  • All protein-coding regions begin with the "start" codon, ATG.
  • There are three "stop" codons that mark the end of the protein-coding region.
  • Multiple codons can code for the same amino acid.

Note: Protein-building machinery does not read DNA directly. Instead, it reads an intermediate molecule, called messenger RNA, that is copied from DNA. Learn more about this process in Transcribe and Translate a Gene.

Universal Genetic Code

This graphic is based on the codon look-up table in Miller and Levine's Biology textbook. For more information, see their description. Click image for full-size table and instructions.

Mutation

Mutation is a process that makes a permanent change in a DNA sequence. Changing a gene's DNA sequence can change the amino acid sequence of the protein it codes for.

Point Mutations

Point mutations are single base changes in a gene's DNA sequence. They can be further categorized:

  • Missense mutations cause a single amino acid change within the protein.
  • Nonsense mutations create a premature "stop" codon, causing the protein to be shortened.
  • Silent mutations do not cause amino acid changes.

Insertion and Deletion Mutations

Insertion mutations and deletion mutations add or remove one or more DNA bases. Insertions and deletions (unless they happen in multiples of 3) can shift the reading frame of a gene, changing the grouping of bases into codons. Also called frameshift mutations, these changes can greatly affect a protein's amino acid sequence.

Mutations

  • Funding

    Funding provided by grant 51006109 from the Howard Hughes Medical Institute, Precollege Science Education Initiative for Biomedical Research.

Look-up table

Look-up table Key

How to use this table
Look-up table how to

It's not a mistake when we say that ATG is a start codon. Scientists generally consider AUG to be a start codon in mRNA sequence and ATG to be a start codon in a DNA sequence.

But...
If AUG on an mRNA molecule means "start,"
and mRNA is copied from a DNA template,
and the DNA template is complementary to the mRNA copy,
then why isn't a DNA start codon TAC?

The key thing to remember is that DNA is double stranded.

Here's a DNA sequence, with the start codon in red:

GC ATG CTG CGA AAC TTT GGC TGA

We've shown the sequence of just one of the DNA strands. It's a shortcut, and it's tidier to look at, and it's how DNA sequences are typically written. If we wanted to, we could include the sequences of both strands:

GC ATG CTG CGA AAC TTT GGC TGA
CG TAC GAC GCT TTG AAA CCG ACT

While our shorthand version shows just the top strand, it's actually the bottom strand that RNA polymerase reads to build an mRNA molecule. And if we're being literal about the actual nucleotides in the DNA strand that are read to build the mRNA's AUG start codon, we might consider the start codon on a DNA molecule to be TAC.

But that's not quite right. The chemical structure of DNA gives it a polarity, and the two complementary DNA strands are anti-parallel. That is, the 5' (5-prime) and 3' (3-prime) ends of the two DNA strands face in opposite directions:

5' GC ATG CTG CGA AAC TTT GGC TGA 3'
3' CG TAC GAC GCT TTG AAA CCG ACT 5'

The scientific standard is to write a nucleotide sequence from 5' to 3'. That means we'd have to write the sequence of the bottom strand like this:

5' TCA GCC AAA GTT TCG CAG CAT GC 3'

It would be more accurate to say that the DNA sequence of the "start codon" on the bottom strand is CAT. But that's an inconvenient way to talk about a protein-coding DNA sequence: everything's not only complementary but also backwards.

For the sake of ease and clarity, scientists tend to ignore the bottom strand (they call it the "non-coding" or "antisense" strand). Instead, they refer to the sequence of the "coding" or "sense" strand: the one that's almost identical to mRNA—the difference of course being that every T in DNA is replaced by a U in RNA. They know there's another strand, and they know how to figure out what its sequence is if they need to.