- The proteome is the complete set of proteins that is expressed by the entire genome. Because some genes code for multiple proteins, the size of the proteome is greater than the number of genes. Sometimes the term is used to describe complement of proteins expressed by a cell at any one time.
- Orthologs are corresponding proteins in two species as defined by sequence homologies.
Only some genes are unique; others belong to families where the other members are related (but not usually identical).The proportion of unique genes declines with genome size, and the proportion of genes in families increases.The minimum number of gene families required to code a bacterium is >1000, a yeast is >4000, and a higher eukaryote 11,000-14,000.
Because some genes are present in more than one copy or are related to one another, the number of different types of genes is less than the total number of genes. We can divide the total number of genes into sets that have related members, as defined by comparing their exons. (A family of related genes arises by duplication of an ancestral gene followed by accumulation of changes in sequence between the copies. Most often the members of a family are related but not identical.) The number of types of genes is calculated by adding the number of unique genes (where there is no other related gene at all) to the numbers of families that have 2 or more members.
Figure 3.15 compares the total number of genes with the number of distinct families in each of six genomes (Rubin et al., 2000, The Arabidopsis Genome Initiative., 2000, Venter et al., 2001). In bacteria, most genes are unique, so the number of distinct families is close to the total gene number. The situation is different even in the lower eukaryote S. cerevisiae, where there is a significant proportion of repeated genes. The most striking effect is that the number of genes increases quite sharply in the higher eukaryotes, but the number of gene families does not change much.
Figure 3.16 shows that the proportion of unique genes drops sharply with genome size. When genes are present in families, the number of members in a family is small in bacteria and lower eukaryotes, but is large in higher eukaryotes. Much of the extra genome size of Arabidopsis is accounted for by families with >4 members (The Arabidopsis Genome Initiative., 2000).
If every gene is expressed, the total number of genes will account the total number of proteins required to make the organism (the proteome). However, two effects mean that the proteome is different from the total gene number. Because genes are duplicated, some of them code for the same protein (although it may be expressed in a different time or place) and others may code for related proteins that again play the same role in different times or places. And because some genes can produce more than one protein by means of alternative splicing, the proteome can be larger than the number of genes.
What is the core proteome—the basic number of the different types of proteins in the organism? A minimum estimate is given by the number of gene families, ranging from 1400 in the bacterium, >4000 in the yeast, and a range of 11,000-14,000 for the fly and worm.
What is the distribution of the proteome among types of proteins? The 6000 proteins of the yeast proteome include 5000 soluble proteins and 1000 transmembrane proteins. About half of the proteins are cytoplasmic, a quarter are in the nucleolus, and the remainder are split between the mitochondrion and the ER/Golgi system (Agarwal et al., 2002).
How many genes are common to all organisms (or to groups such as bacteria or higher eukaryotes) and how many are specific for the individual type of organism? Figure 3.17 summarizes the comparison between yeast, worm, and fly (Rubin et al., 2000). Genes that code for corresponding proteins in different organisms are called orthologs. Operationally, we usually reckon that two genes in different organisms can be considered to provide corresponding functions if their sequences are similar over >80% of the length. By this criterion, ~20% of the fly genes have orthologs in both yeast and the worm. These genes are probably required by all eukaryotes. The proportion increases to 30% when fly and worm are compared, probably representing the addition of gene functions that are common to multicellular eukaryotes. This still leaves a major proportion of genes as coding for proteins that are required specifically by either flies or worms, respectively.
The proteome can be deduced from the number and structures of genes, and can also be directly measured by analyzing the total protein content of a cell or organism. By such approaches, some proteins have been identified that were not suspected on the basis of genome analysis, and that have therefore led to the identification of new genes. Several methods are used for large scale analysis of proteins. Mass spectrometry can be used for separating and identifying proteins in a mixture obtained directly from cells or tissues (for review see Aebersold and Mann, 2003). Hybrid proteins bearing tags can be obtained by expression of cDNAs made by linking the sequences of open reading frames to appropriate expression vectors that incorporate the sequences for affinity tags. This allows array analysis to be used to analyze the products (for review see Phizicky et al., 2003). These methods also can be effective in comparing the proteins of two tissues, for example, a tissue from a normal individual and one from a patient with disease, to pinpoint the differences (for review see Hanash, 2003).
Once we know the total number of proteins, we can ask how they interact. By definition, proteins in structural multiprotein assemblies must form stable interactions with one another. Proteins in signaling pathways interact with one another transiently. In both cases, such interactions can be detected in test systems where essentially a readout system magnifies the effect of the interaction. One popular such system is the two hybrid assay discussed in Independent domains bind DNA and activate transcription. Such assays cannot detect all interactions: for example, if one enzyme in a metabolic pathway releases a soluble metabolite that then interacts with the next enzyme, the proteins may not interact directly.
As a practical matter, assays of pairwise interactions can give us an indication of the minimum number of independent structures or pathways. An analysis of the ability of all 6000 (predicted) yeast proteins to interact in pairwise combinations shows that ~1000 proteins can bind to at least one other protein (Uetz et al., 2000). Direct analyses of complex formation have identified 1440 different proteins in 232 multiprotein complexes (Gavin et al., 2002, Ho et al., 2002). This is the beginning of an analysis that will lead to definition of the number of functional assemblies or pathways (for review see Sali et al., 2003).
In addition to functional genes, there are also copies of genes that have become nonfunctional (identified as such by interruptions in their protein-coding sequences). These are called pseudogenes (see 4.6 Pseudogenes are dead ends of evolution). The number of pseudogenes can be large. In the mouse and human genomes, the number of pseudogenes is ~10% of the number of (potentially) active genes (see 3.10 The conservation of genome organization helps to identify genes).
Besides needing to know the density of genes to estimate the total gene number, we must also ask: is it important in itself? Are there structural constraints that make it necessary for genes to have a certain spacing, and does this contribute to the large size of eukaryotic genomes?