| Bioinformatics Toolbox | ![]() |
Determining the protein-coding sequence for a eukaryotic gene can be a difficult task because introns (noncoding sections) are mixed with exons. However, prokaryotic genes generally do not have introns and mRNA sequences have the introns removed. Identifying the start and stop codons for translation determines the protein-coding section or open reading frame (ORF) in a sequence. Once you know the ORF for a gene or mRNA, you can translate a nucleotide sequence to its corresponding amino acid sequence.
After you read a sequence into MATLAB, you can analyze the sequence for open reading frames. This procedure uses the human mitochondria genome as an example. See Getting Sequence Information into MATLAB.
Display open reading frames (ORFs) in a nucleotide sequence. In the MATLAB Command window, type
seqshoworfs(mitochondria);
If you compare this output to the genes shown on the NCBI page for NC_001807, there are fewer genes than expected. This is because vertebrate mitochondria use a genetic code slightly different from the standard genetic code. For a table of genetic codes, see Genetic Code.
Display ORFs using the Vertebrate Mitochondrial code.
orfs= seqshoworfs(mitochondria,
'GeneticCode','Vertebrate Mitochondrial',
'alternativestart',true);
Notice that there are now two large ORFs on the first reading frame. One starts at position 4471 and the other starts at 5905. These correspond to the genes ND2 (NADH dehydrogenase subunit 2 [Homo sapiens] ) and COX1 (cytochrome c oxidase subunit I) genes.
Find the corresponding stop codon. The start and stop positions for ORFs have the same indices as the start positions in the fields Start and Stop.
ND2Start = 4471; StartIndex = find(orfs(1).Start == ND2Start) ND2Stop = orfs(1).Stop(StartIndex)
MATLAB displays the stop position.
ND2Stop =
5512
Using the sequence indices for the start and stop of the gene, extract the subsequence from the sequence.
ND2Seq = mitochondria(ND2Start:ND2Stop); codoncount (ND2Seq)
The subsequence (protein-coding region) is stored in ND2Seq and displayed on the screen.
attaatcccctggcccaacccgtcatctactctaccatctttgcaggcac actcatcacagcgctaagctcgcactgattttttacctgagtaggcctag aaataaacatgctagcttttattccagttctaaccaaaaaaataaaccct cgttccacagaagctgccatcaagtatttcctcacgcaagcaaccgcatc cataatccttc . . .
Determine the codon distribution.
codoncount (ND2Seq)
The codon count shows a high amount of ACC, ATA, CTA, and ATC.
AAA-10 AAC-14 AAG-2 AAT-6 ACA-11 ACC-24 ACG-3 ACT-5 AGA-0 AGC-4 AGG-0 AGT-1 ATA-22 ATC-24 ATG-2 ATT-8 CAA-8 CAC-3 CAG-2 CAT-1 CCA-4 CCC-12 CCG-2 CCT-5 CGA-0 CGC-3 CGG-0 CGT-1 CTA-26 CTC-18 CTG-4 CTT-7 GAA-5 GAC-0 GAG-1 GAT-0 GCA-8 GCC-7 GCG-1 GCT-4 GGA-5 GGC-7 GGG-0 GGT-1 GTA-3 GTC-2 GTG-0 GTT-3 TAA-0 TAC-8 TAG-0 TAT-2 TCA-7 TCC-11 TCG-1 TCT-4 TGA-10 TGC-0 TGG-1 TGT-0 TTA-8 TTC-7 TTG-1 TTT-8
Look up the amino acids for codons ATA, CTA, ACC, and ATC.
aminolookup('code',nt2aa('ATA'))
aminolookup('code',nt2aa('CTA'))
aminolookup('code',nt2aa('ACC'))
aminolookup('code',nt2aa('ATC'))
MATLAB displays the following
Ile isoleucine Leu leucine Thr threonine Ile isoleucine
| Determining Codon Composition | Amino Acid Conversion and Composition | ![]() |
© 1994-2005 The MathWorks, Inc.