- There are 6000 genes in yeast, 18,500 in worm, 13,600 in fly, 25,000 in the small plant Arabidopsis, and probably 30,000 in mouse and <40,000 in Man.
As soon as we look at eukaryotic genomes, the relationship between genome size and gene number is lost. The genomes of unicellular eukaryotes fall in the same size range as the largest bacterial genomes. Higher eukaryotes have more genes, but the number does not correlate with genome size, as can be seen from Figure 3.12.
The most extensive data for lower eukaryotes are available from the sequences of the genomes of the yeasts S. cerevisiae and S. pombe. Figure 3.13 summarizes the most important features. The yeast genomes of 13.5 Mb and 12.5 Mb have ~6000 and ~5000 genes, respectively. The average open reading frame is ~1.4 kb, so that ~70% of the genome is occupied by coding regions. The major difference between them is that only 5% of S. cerevisiae genes have introns, compared to 43% in S. pombe. The density of genes is high; organization is generally similar, although the spaces between genes are a bit shorter in S. cerevisiae. About half of the genes identified by sequence were either known previously or related to known genes. The remainder are new, which gives some indication of the number of new types of genes that may be discovered (Oliver et al., 1992, Dujon et al., 1994, Johnston et al., 1994, Wood et al., 2002).
The identification of long reading frames on the basis of sequence is quite accurate. However, ORFs coding for <100 amino acids cannot be identified solely by sequence because of the high occurrence of false positives. Analysis of gene expression suggests that ~300 of 600 such ORFs in S. cerevisiae are likely to be genuine genes.
A powerful way to validate gene structure is to compare sequences in closely related species ― if a gene is active, it is likely to be conserved. Comparisons between the sequences of four closely related yeast species suggest that 503 of the genes originally identified in S. cerevisiae do not have counterparts in the other species, and therefore should be deleted from the catalog. This reduces the total gene number for S. cerevisiae to 5726 (Kellis et al., 2003).
The genome of C. elegans DNA varies between regions rich in genes and regions in which genes are more sparsely organized. The total sequence contains ~18,500 genes. Only ~42% of the genes have putative counterparts outside the Nematoda (Wilson et al., 1994, C. elegans sequencing consortium, 1998).
Although the fly genome is larger than the worm genome, there are fewer genes (13,600) in D. melanogaster (Adams et al., 2000). The number of different transcripts is slightly larger (14,100) as the result of alternative splicing. We do not understand why the fly —a much more complex organism—has only 70% of the number of genes in the worm. This emphasizes forcefully the lack of an exact relationship between gene number and complexity of the organism.
The plant Arabidopsis thaliana has a genome size intermediate between the worm and the fly, but has a larger gene number (25,000) than either (The Arabidopsis Genome Initiative., 2000). This again shows the lack of a clear relationship, and also emphasizes the special quality of plants, which may have more genes (due to ancestral duplications) than animal cells. A majority of the Arabidopsis genome is found in duplicated segments, suggesting that there was an ancient doubling of the genome (to give a tetraploid). Only 35% of Arabidopsis genes are present as single copies.
The genome of rice (Oryza sativa) is ~4 larger than Arabidopsis, but the number of genes is only ~50% larger, probably ~40,000 (Duffy and Grof, 2001, Goff et al., 2002). Repetitive DNA occupies 42-45% of the genome. More than 80% of the genes found in Arabidopsis are represented in rice. Of these common genes, ~8000 are found in Arabidopsis and rice but not in any of the bacterial or animal genomes that have been sequenced. These are probably the set of genes that code for plant-specific functions, such as photosynthesis.
From the fly genome, we can form an impression of how many genes are devoted to each type of function. Figure 3.14 breaks down the functions into different categories. Among the genes that are identified, we find 2500 enzymes, ~750 transcription factors, ~700 transporters and ion channels, and ~700 proteins involved with signal transduction. But just over the half genes code for products of unknown function. ~20% of the proteins reside in membranes.
Protein size increases from prokaryotes and archaea to eukaryotes. The archaea M. jannaschi and bacterium E. coli have average protein lengths of 287 and 317 amino acids, respectively; whereas S. cerevisiae and C. elegans have average lengths of 484 and 442 amino acids, respectively. Large proteins (500 amino acids) are rare in bacteria, but comprise a significant component (~1/3) in eukaryotes. The increase in length is due to the addition of extra domains, with each domain typically constituting 100-300 amino acids. But the increase in protein size is responsible for only a very small part of the increase in genome size.
Another insight into gene number is obtained by counting the number of expressed genes. If we rely upon the estimates of the number of different mRNA species that can be counted in a cell, we would conclude that the average vertebrate cell expresses ~10,000-20,000 genes. The existence of significant overlaps between the messenger populations in different cell types would suggest that the total expressed gene number for the organism should be within a few fold of this. The estimate for the total human genome number of 30,000-40,000 (see 3.11 The human genome has fewer genes than expected) would imply that a significant proportion of the total gene number is actually expressed in any given cell.
Eukaryotic genes are transcribed individually, each gene producing a monocistronic messenger. There is only one general exception to this rule; in the genome of C. elegans, ~15% of the genes are organized into polycistronic units (which is associated with the use of trans-splicing to allow expression of the downstream genes in these units; see 24.13 trans-splicing reactions use small RNAs).