KEY CONCEPTS:
- There are 6000 genes in yeast, 18,500 in worm, 13,600 in fly, 25,000 in the small plant Arabidopsis, and probably 30,000 in mouse and <40,000 in Man.
As soon as we look at eukaryotic genomes, the relationship
between genome size and gene number is lost. The genomes of unicellular
eukaryotes fall in the same size range as the largest bacterial genomes. Higher
eukaryotes have more genes, but the number does not correlate with genome size,
as can be seen from Figure 3.12.
The most extensive data for lower eukaryotes are available
from the sequences of the genomes of the yeasts S. cerevisiae and
S. pombe. Figure 3.13 summarizes the most
important features. The yeast genomes of 13.5 Mb and 12.5 Mb have ~6000 and
~5000 genes, respectively. The average open reading frame is ~1.4 kb, so that
~70% of the genome is occupied by coding regions. The major difference between
them is that only 5% of S. cerevisiae genes have introns, compared to
43% in S. pombe. The density of genes is high; organization is
generally similar, although the spaces between genes are a bit shorter in S.
cerevisiae. About half of the genes identified by sequence were either
known previously or related to known genes. The remainder are new, which gives
some indication of the number of new types of genes that may be discovered (Oliver et al., 1992, Dujon et al., 1994, Johnston et al., 1994, Wood et al., 2002).
The identification of long reading frames on the basis of
sequence is quite accurate. However, ORFs coding for <100 amino acids cannot be identified solely by
sequence because of the high occurrence of false positives. Analysis of gene
expression suggests that ~300 of 600 such ORFs in S. cerevisiae are
likely to be genuine genes.
A powerful way to validate gene structure is to compare
sequences in closely related species ― if a gene
is active, it is likely to be conserved. Comparisons between the sequences of
four closely related yeast species suggest that 503 of the genes originally
identified in S. cerevisiae do not have counterparts in the other
species, and therefore should be deleted from the catalog. This reduces the
total gene number for S. cerevisiae to 5726 (Kellis et al., 2003).
The genome of C. elegans DNA varies between regions
rich in genes and regions in which genes are more sparsely organized. The total
sequence contains ~18,500 genes. Only ~42% of the genes have putative
counterparts outside the Nematoda (Wilson et al., 1994, C. elegans sequencing consortium, 1998).
Although the fly genome is larger than the worm genome,
there are fewer genes (13,600) in D. melanogaster (Adams et al., 2000). The number of different
transcripts is slightly larger (14,100) as the result of alternative splicing.
We do not understand why the fly —a much more complex organism—has only 70% of
the number of genes in the worm. This emphasizes forcefully the lack of an exact
relationship between gene number and complexity of the organism.
The plant Arabidopsis thaliana has a genome size
intermediate between the worm and the fly, but has a larger gene number (25,000)
than either (The Arabidopsis Genome Initiative., 2000). This again
shows the lack of a clear relationship, and also emphasizes the special quality
of plants, which may have more genes (due to ancestral duplications) than animal
cells. A majority of the Arabidopsis genome is found in duplicated
segments, suggesting that there was an ancient doubling of the genome (to give a
tetraploid). Only 35% of Arabidopsis genes are present as single
copies.
The genome of rice (Oryza sativa) is ~4 larger than
Arabidopsis, but the number of genes is only ~50% larger, probably
~40,000 (Duffy and Grof, 2001, Goff et al., 2002). Repetitive DNA occupies 42-45% of
the genome. More than 80% of the genes found in Arabidopsis are
represented in rice. Of these common genes, ~8000 are found in
Arabidopsis and rice but not in any of the bacterial or animal genomes
that have been sequenced. These are probably the set of genes that code for
plant-specific functions, such as photosynthesis.
From the fly genome, we can form an impression of how many
genes are devoted to each type of function. Figure 3.14
breaks down the functions into different categories. Among the genes that are
identified, we find 2500 enzymes, ~750 transcription factors, ~700 transporters
and ion channels, and ~700 proteins involved with signal transduction. But just
over the half genes code for products of unknown function. ~20% of the proteins
reside in membranes.
Protein size increases from prokaryotes and archaea to
eukaryotes. The archaea M. jannaschi and bacterium E. coli
have average protein lengths of 287 and 317 amino acids, respectively; whereas
S. cerevisiae and C. elegans have average lengths of 484 and
442 amino acids, respectively. Large proteins (500 amino acids) are rare in
bacteria, but comprise a significant component (~1/3) in eukaryotes. The
increase in length is due to the addition of extra domains, with each domain
typically constituting 100-300 amino acids. But the increase in protein size is
responsible for only a very small part of the increase in genome size.
Another insight into gene number is obtained by counting the
number of expressed genes. If we rely upon the estimates of the number of
different mRNA species that can be counted in a cell, we would conclude that the
average vertebrate cell expresses ~10,000-20,000 genes. The existence of
significant overlaps between the messenger populations in different cell types
would suggest that the total expressed gene number for the organism should be
within a few fold of this. The estimate for the total human genome number of
30,000-40,000 (see 3.11 The human
genome has fewer genes than expected) would imply that a significant
proportion of the total gene number is actually expressed in any given
cell.
Eukaryotic genes are transcribed individually, each gene
producing a monocistronic messenger. There is only one general exception to this
rule; in the genome of C. elegans, ~15% of the genes are organized into
polycistronic units (which is associated with the use of trans-splicing
to allow expression of the downstream genes in these units; see 24.13 trans-splicing reactions
use small RNAs).
No comments:
Post a Comment