KEY TERMS:
- Synteny describes a relationship between chromosomal regions of different species where homologous genes occur in the same order.
- Algorithms for identifying genes are not perfect and many corrections must be made to the initial data set.
- Pseudogenes must be distinguished from active genes.
- Syntenic relationships are extensive between mouse and human genomes, and most active genes are in a syntenic region.
Once we have assembled the sequence of a genome, we still
have to identify the genes within it. Coding sequences represent a very small
fraction. Exons can be identified as uninterrupted open reading frames flanked
by appropriate sequences. What criteria need to be satisfied to identify an
active gene from a series of exons?
Figure 3.18 shows that an active
gene should consist of a series of exons where the first exon immediately
follows a promoter, the internal exons are flanked by appropriate splicing
junctions, the last exon is followed by 3 processing signals, and a single open reading frame
starting with an initiation codon and ending with a termination codon can be
deduced by joining the exons together. Internal exons can be identified as open
reading frames flanked by splicing junctions. In the simplest cases, the first
and last exons contain the start and end of the coding region, respectively, (as
well as the 5' and 3' untranslated regions), but in more complex cases the first
or last exons may have only untranslated regions, and may therefore be more
difficult to identify.
The algorithms that are used to connect exons are not
completely effective when the genome is very large and the exons may be
separated by very large distances. For example, the initial analysis of the
human genome mapped 170,000 exons into 32,000 genes. This is unlikely to be
correct, because it gives an average of 5.3 exons per gene, whereas the average
of individual genes that have been fully characterized is 10.2. Either we have
missed many exons, or they should be connected differently into a smaller number
of genes in the whole genome sequence.
Even when the organization of a gene is correctly
identified, there is the problem of distinguishing active genes from
pseudogenes. Many pseudogenes can be recognized by obvious defects in the form
of multiple mutations that create an inactive coding sequence. However,
pseudogenes that have arisen more recently, and which have not accumulated so
many mutations, may be more difficult to recognize. In an extreme example, the
mouse has only one active Gapdh gene (coding for glyceraldehyde
phosphate dehydrogenase), but has ~400 pseudogenes. However, >100 of these pseudogenes initially appeared to be
active in the mouse genome sequence. Individual examination was necessary to
exclude them from the list of active genes.
Confidence that a gene is active can be increased by
comparing regions of the genomes of different species. There has been extensive
overall reorganization of sequences between the mouse and human genomes, as seen
in the simple fact that there are 23 chromosomes in the human haploid genome and
20 chromosomes in the mouse haploid genome. However, at the local level, the
order of genes is generally the same: when pairs of human and mouse homologues
are compared, the genes located on either side also tend to be homologues. This
relationship is called synteny.
Figure 3.19 shows the relationship
between mouse chromosome 1 and the human chromosomal set (Waterston et al., 2002). We can recognize 21 segments
in this mouse chromosome that have syntenic counterparts in human chromosomes.
The extent of reshuffling that has occurred between the genomes is shown by the
fact that the segments are spread among 6 different human chromosome. The same
types of relationships are found in all mouse chromosomes, except for the X
chromosome, which is syntenic only with the human X chromosome. This is
explained by the fact that the X is a special case, subject to dosage
compensation to adjust for the difference between males (one copy) and females
(two copies) (see 23.17 X
chromosomes undergo global changes). This may apply selective pressure
against the translocation of genes to and from the X chromosome.
Comparison of the mouse and human genome sequences shows
that >90% of each genome lies in syntenic
blocks that range widely in size (from 300 kb to 65 Mb). There is a total of 342
syntenic segments, with an average length of 7 Mb (0.3% of the genome) (Waterston et al., 2002). 99% of mouse genes have a
homologue in the human genome; and for 96% that homologue is in a syntenic
region.
Comparing the genomes provides interesting information about
the evolution of species. The number of gene families in the mouse and human
genomes is the same, and a major difference between the species is the
differential expansion of particular families in one of the genomes. This is
especially noticeable in genes that affect phenotypic features that are unique
to the species. Of 25 families where the size has been expanded in mouse, 14
contain genes specifically involved in rodent reproduction, and 5 contain genes
specific to the immune system.
A validation of the importance of syntenic blocks comes from
pairwise comparisons of the genes within them. Looking for likely pseudogenes on
the basis of sequence comparisons, a gene that is not in a syntenic location
(that is, its context is different in the two species) is twice as likely to be
a pseudogene. Put another way, translocation away from the original locus tends
to be associated with the creation of pseudogenes. The lack of a related gene in
a syntenic position is therefore grounds for suspecting that an apparent gene
may really be a pseudogene. Overall, >10% of
the genes that are initially identified by analysis of the genome are likely to
turn out to be pseudogenes.
As a general rule, comparisons between genomes add
significantly to the effectiveness of gene prediction. When sequence features
indicating active genes are conserved, for example, between Man and mouse, there
is an increased probability that they identify active homologues.
Identifying genes coding for RNA is more difficult, because
we cannot use the criterion of the open reading frame. It is true here also that
comparative genome analysis increased the rigor of the analysis. For example,
analysis of either the human or mouse genome alone identifies ~500 genes coding
for tRNA in each case, but comparison of features suggests that <350 of these genes are in fact active in each
genome.
No comments:
Post a Comment