KEY TERMS:
- The proteome is the complete set of proteins that is expressed by the entire genome. Because some genes code for multiple proteins, the size of the proteome is greater than the number of genes. Sometimes the term is used to describe complement of proteins expressed by a cell at any one time.
- Orthologs are corresponding proteins in two species as defined by sequence homologies.
Only some genes are unique; others belong to families where
the other members are related (but not usually identical).The proportion of
unique genes declines with genome size, and the proportion of genes in families
increases.The minimum number of gene families required to code a bacterium is
>1000, a yeast is >4000, and a higher eukaryote 11,000-14,000.
Because some genes are present in more than one copy or are
related to one another, the number of different types of genes is less than the
total number of genes. We can divide the total number of genes into sets that
have related members, as defined by comparing their exons. (A family of related
genes arises by duplication of an ancestral gene followed by accumulation of
changes in sequence between the copies. Most often the members of a family are
related but not identical.) The number of types of genes is calculated by adding
the number of unique genes (where there is no other related gene at all) to the
numbers of families that have 2 or more members.
Figure 3.15 compares the total number
of genes with the number of distinct families in each of six genomes (Rubin et al., 2000, The Arabidopsis Genome Initiative., 2000, Venter et al., 2001). In bacteria, most genes are
unique, so the number of distinct families is close to the total gene number.
The situation is different even in the lower eukaryote S. cerevisiae, where
there is a significant proportion of repeated genes. The most striking effect is
that the number of genes increases quite sharply in the higher eukaryotes, but
the number of gene families does not change much.
Figure 3.16 shows that the
proportion of unique genes drops sharply with genome size. When genes are
present in families, the number of members in a family is small in bacteria and
lower eukaryotes, but is large in higher eukaryotes. Much of the extra genome
size of Arabidopsis is accounted for by families with >4 members (The Arabidopsis Genome Initiative., 2000).
If every gene is expressed, the total number of genes will
account the total number of proteins required to make the organism (the proteome). However, two effects mean that the proteome is
different from the total gene number. Because genes are duplicated, some of them
code for the same protein (although it may be expressed in a different time or
place) and others may code for related proteins that again play the same role in
different times or places. And because some genes can produce more than one
protein by means of alternative splicing, the proteome can be larger than the
number of genes.
What is the core proteome—the basic number of the different
types of proteins in the organism? A minimum estimate is given by the number of
gene families, ranging from 1400 in the bacterium, >4000 in the yeast, and a range of 11,000-14,000 for
the fly and worm.
What is the distribution of the proteome among types of
proteins? The 6000 proteins of the yeast proteome include 5000 soluble proteins
and 1000 transmembrane proteins. About half of the proteins are cytoplasmic, a
quarter are in the nucleolus, and the remainder are split between the
mitochondrion and the ER/Golgi system (Agarwal et al., 2002).
How many genes are common to all organisms (or to groups
such as bacteria or higher eukaryotes) and how many are specific for the
individual type of organism? Figure 3.17 summarizes the
comparison between yeast, worm, and fly (Rubin et al., 2000). Genes that code for
corresponding proteins in different organisms are called orthologs. Operationally, we usually reckon that two
genes in different organisms can be considered to provide corresponding
functions if their sequences are similar over >80% of the length. By this criterion, ~20% of the
fly genes have orthologs in both yeast and the worm. These genes are probably
required by all eukaryotes. The proportion increases to 30% when fly and worm
are compared, probably representing the addition of gene functions that are
common to multicellular eukaryotes. This still leaves a major proportion of
genes as coding for proteins that are required specifically by either flies or
worms, respectively.
The proteome can be deduced from the number and structures
of genes, and can also be directly measured by analyzing the total protein
content of a cell or organism. By such approaches, some proteins have been
identified that were not suspected on the basis of genome analysis, and that
have therefore led to the identification of new genes. Several methods are used
for large scale analysis of proteins. Mass spectrometry can be used for
separating and identifying proteins in a mixture obtained directly from cells or
tissues (for review see Aebersold and Mann, 2003). Hybrid proteins bearing
tags can be obtained by expression of cDNAs made by linking the sequences of
open reading frames to appropriate expression vectors that incorporate the
sequences for affinity tags. This allows array analysis to be used to analyze
the products (for review see Phizicky et al., 2003). These methods also can be
effective in comparing the proteins of two tissues, for example, a tissue from a
normal individual and one from a patient with disease, to pinpoint the
differences (for review see Hanash, 2003).
Once we know the total number of proteins, we can ask how
they interact. By definition, proteins in structural multiprotein assemblies
must form stable interactions with one another. Proteins in signaling pathways
interact with one another transiently. In both cases, such interactions can be
detected in test systems where essentially a readout system magnifies the effect
of the interaction. One popular such system is the two hybrid assay discussed in
Independent domains bind DNA and activate transcription. Such assays cannot
detect all interactions: for example, if one enzyme in a metabolic pathway
releases a soluble metabolite that then interacts with the next enzyme, the
proteins may not interact directly.
As a practical matter, assays of pairwise interactions can
give us an indication of the minimum number of independent structures or
pathways. An analysis of the ability of all 6000 (predicted) yeast proteins to
interact in pairwise combinations shows that ~1000 proteins can bind to at least
one other protein (Uetz et al., 2000). Direct analyses of complex
formation have identified 1440 different proteins in 232 multiprotein complexes
(Gavin et al., 2002, Ho et al., 2002). This is the beginning of an
analysis that will lead to definition of the number of functional assemblies or
pathways (for review see Sali et al., 2003).
In addition to functional genes, there are also copies of
genes that have become nonfunctional (identified as such by interruptions in
their protein-coding sequences). These are called pseudogenes (see 4.6 Pseudogenes are dead ends of
evolution). The number of pseudogenes can be large. In the mouse and human
genomes, the number of pseudogenes is ~10% of the number of (potentially) active
genes (see 3.10 The conservation of
genome organization helps to identify genes).
Besides needing to know the density of genes to estimate the
total gene number, we must also ask: is it important in itself? Are there
structural constraints that make it necessary for genes to have a certain
spacing, and does this contribute to the large size of eukaryotic genomes?
No comments:
Post a Comment