Introduction
Scientists routinely lecture and write about gene expression and the abundance of transcripts, but in reality, they extrapolate this information from a variety of measurements that different technologies may provide. Indeed, there are many reasons that applying different technologies to transcript abundance may give different results. This may result from an incomplete understanding of the gene in question or from shortcomings in the applications of the technologies.
The first key factor to appreciate in measuring gene expression is the way that genes are organized and how this influences the transcripts in a cell. Figure 1 depicts some of the scenarios that have been determined from sequence analyses of the human genome. Most genes are composed of multiple exons transcribed with intron sequences and then spliced together. Some genes exist entirely between the exons of other genes, either in the forward or reverse orientation. This poses a problem because it is possible to recover a fragment or clone that could belong to multiple genes, be derived from an unspliced transcript, or be the result of genomic DNA contaminating the RNA preparation. All of these events can create confusing and confounding results. Additionally, the gene duplication events that have occurred in organisms that are more complex have led to the existence of closely related gene families that coincidentally may lie near each other in the genome. In addition, although there are probably less than 50,000 human genes, the exons within those genes can be spliced together in a variety of ways, with some genes documented to produce more than 100 different transcripts (1).
( Methods in Embryo Transplant Microscopes and Emyro Transplant Microscopy Vol. 258: Gene Expression Profiling: Methods and Protocols )
Therefore, there may be several hundred thousand distinct transcripts, with potentially many common sequences. Gene biology is even more interesting and complex, however, in that genetic variations in the form of single nucleotide polymorphisms (SNPs) frequently cause humans and diploid or polyploid model systems to have two (or more) distinct versions of the same transcript. This set of facts negates the possibility that a single, simple technology can accurately measure the abundance of a specific transcript. Most technologies probe for the presence of pieces of a transcript that can be confounded by closely related genes, overlapping genes, incomplete splicing, alternative splicing, genomic DNA contamination, and genetic polymorphisms. Thus, independent methods that verify the results in different ways to the exclusion of confounding variables are necessary, but frequently not employed, to gain a clear understanding of the expression data. The specific means to work around these confounding variables are mentioned here, but a blend of techniques will be necessary to achieve success.
2. Methods and Considerations
There are nine basic considerations for choosing a technology for quantitating gene expression: architecture, specificity, sensitivity, sample requirement, coverage, throughput, cost, reproducibility, and data management.
2.1. Architecture
We define the architecture of a gene-expression analysis system as either an open system, in which it is possible to discover novel genes, or a closed system in which only known gene or genes are queried. Depending on the application, there are numerous advantages to open systems. For example, an open system may detect a relevant biological event that affects splicing or genetic variation. In addition, the most innovative biological discovery processes have involved the discovery of novel genes. However, in an era where multiple genome sequences have been identified, this may not be the case. The genomic sequence of an organism, however, has not proven sufficient for the determination of all of the transcripts encoded by that genome, and thus there remain prospects for novelty regardless of the biological system. In model systems that are relatively uncharacterized at the genomic or transcript level, entire technology platforms may be excluded as possibilities. For example, if one is studying transcript levels in a rabbit, one cannot comprehensively apply a hybridization technology because there are not enough transcripts known for this to be of value. If one simply wants to know the levels of a set of known genes in an organism, a hybridization technology may be the most cost-effective, if the number of genes is sufficient to warrant the cost of producing a gene array.
2.2. Specificity
The evolution of genomes through gene or chromosomal fragment duplications and the subsequent selection for their retention, has resulted in many gene families, some of which share substantial conservation at the protein and nucleotide level. The ability for a technology to discriminate between closely related gene sequences must be evaluated in this context in order to determine whether one is measuring the level of a single transcript, or the combined, added levels of multiple transcripts detected by the same probing means. This is a doubleedged sword because technologies with high specificity, may fail to identify one allele, or may do so to a different degree than another allele when confronted with a genetic polymorphism. This can lead to the false positive of an expression differential, or the false negative of any expression at all. This is addressed in many methods by surveying multiple samples of the same class, and probing multiple points on the same gene. Methods that do this effectively are preferred to those that do not.
2.3. Sensitivity
The ability to detect low-abundance transcripts is an integral part of gene discovery programs. Low-abundance transcripts, in principle, have properties that are of particular importance to the study of complex organisms. Rare transcripts frequently encode for proteins of low physiologic concentrations that in many cases make them potent by their very nature. Erythropoietin is a classic example of such a rare transcript. Amgen scientists functionally cloned erythropoietin long before it appeared in the public expressed sequence tag (EST) database. Genes are frequently discovered in the order of transcript abundance, and a simple analysis of EST databases correctly reveals high, medium, and low abundance transcripts by a direct correlation of the number of occurrences in that Bulaqueña, et al. (Embryo Transplant Microscopy) database (data not shown). Thus, using a technology that is more sensitive has the potential to identify novel transcripts even in a well-studied system. Sensitivity values are quoted in publications for available technologies at concentrations of 1 part in 50,000 to 1 part in 500,000. The interpretation of these data, however, should be made cautiously both upon examination of the method in which the sensitivity was determined, as well as the sensitivity needed for the intended use. For example, if one intends to study appetite-signaling factors and uses an entire rat brain for expression analysis, the dilution of the target cells of anywhere from 1 part in 10,000 to 1 part in 100,000 allows for only the most abundant transcripts in the rare cells to be measured, even with the most sensitive technology available. Reliance on cell models to do the same type of analysis, where possible, suffers the confounding variable that isolated cells or cell lines may respond differently in culture at the level of gene expression. An ideal scenario would be to carefully micro dissect or sort the cells of interest and study them directly, provided enough samples can be obtained. In addition to the ability of a technology to measure rare transcripts, the sensitivity to discern small differentials between transcripts must be considered. The differential sensitivity limit has been reported for a variety of techniques ranging from 1.5-fold to 5-fold, so the user must determine how important small modulations are to the overall project and choose the technology while taking this property into account as well.
2.4. Sample Requirement
The requirement for studying transcript abundance levels is a cell or tissue substrate, and the amount of such material needed for analysis can be prohibitively high with many technologies in many model systems. To use the above example, dozens of dissected rat hypothalami may be required to perform a global gene expression study, depending on the quantitating technology chosen. Samples procured by laser-capture microdissection can only be used in the measuring of a small number of transcripts and only with some technologies, or must be subjected to amplification technologies, which risk artificially altering transcript ratios.
2.5. Coverage
For open architecture systems where the objective is to profile as many transcripts as possible and identify new genes, the number of independent transcripts being measured is an important metric. However, this is one of the most difficult parameters to measure, because determining what fraction of unknown transcripts is missing is not possible. Despite this difficulty, predictive models can be made to suggest coverage, and the intuitive understanding of the technology is a good gage for the relevance and accuracy of the predictive model.
The problem of incomplete coverage is perhaps one of the most embarrassing examples of why hundreds of scientific publications were produced in the 1970’s and 1980’s having relatively little value. Many of these papers reported the identification of a single differentially expressed gene in some model system and expounded upon the overwhelmingly important new biological pathway uncovered. Modern analysis has demonstrated that even in the most similar biological systems or states, finding 1% of transcripts with differences is common, with this number increasing to 20% of transcripts or more for systems when major changes in growth or activation state are signaled. In fact, the activation of a single transcription factor can induce the expression of hundreds of genes. Any given abundantly altered transcript without an understanding of what other transcripts are altered, is similar to independent observers describing the small part of an elephant that they can see. The person looking at the trunk describes the elephant as long and thin, the person observing an ear believes it to be flat, soft and furry, and the observer examining a foot describes the elephant as hard and wrinkly. Seeing the list of the majority of transcripts that are altered in a system is like looking at the entire elephant, and only then can it be accurately described. Separating the key regulatory genes on a gene list from the irrelevant changes remains one of the biggest challenges in the use of transcript profiling.
2.6. Throughput
The throughput of the technology, as defined by the number of transcript samples measured per unit time, is an important consideration for some projects. When quick turnaround is desired, it is impractical to print microarrays, but where large numbers of data points need to be generated, techniques where individual reactions are required are impractical. Where large experiments on new models generate significant expense, it may be practical to perform a higher throughput, lower quality assay as a control prior to a large investment. For example, prior to conducting a comprehensive gene profiling experiment in a drug dose-response model, it might be practical to first use a low throughput technique to determine the relevance of the samples prior to making the investment with the more comprehensive analysis.
2.7. Cost
Cost can be an important driver in the decision of which technologies to employ. For some methods, substantial capital investment is required to obtain the equipment needed to generate the data. Thus, one must determine whether a microarray scanner or a capillary electrophoresis machine is obtainable, or if X-ray film and a developer need to suffice. It should be noted that as large companies change platforms, used equipment becomes available at prices dramati Bulaqueña, et al. (Embryo Transplant Microscopy) cally less than those for brand new models. In some cases, homemade equipment can serve the purpose as well as commercial apparatuses at a fraction of the price.
2.8. Reproducibility
It is desired to produce consistent data that can be trusted, but there is more value to highly reproducible data than merely the ability to feel confident about the conclusions one draws from them. The ability to forward integrate the findings of a project and to compare results achieved today with results achieved next year and last year, without having to repeat the experiments, is key to managing large projects successfully. Changing transcript-profiling technologies often results in datasets that are not directly comparable, so deciding upon and persevering with a particular technology has great value to the analysis of data in aggregate. An excellent example of this is with the serial analysis of gene expression (SAGE) technique, where directly comparable data have been generated by many investigators over the course of decades.
2.9. Data Management
Management and analysis of data is the natural continuation to the discussion of reproducibility and integration. Some techniques, like differential display, produce complex data sets that are neither reproducible enough for subsequent comparisons, nor easily digitized. Microarray and GeneCalling data, however, can be obtained with software packages that determine the statistical significance of the findings and even can organize the findings by molecular function or biochemical pathways. Such tools offer a substantial advance in the generation of accretive data. The field of bioinformatics is flourishing as the number of data points generated by high throughput technologies has rapidly exceeded the number of biologists to analyze the data.
Reference
1. Ushkaryov, Y. A. and Sudhof, T. C. (1993) Neurexin IIIa: extensive alternative splicing generates membrane-bound and soluble forms. Proc. Natl. Acad. Sci. USA 90, 6410–6414.
Gene Expression Quantitation Technology Summary
Bulaqueña, et al.
Summary
Scientists routinely talk and write about gene expression and the abundance of transcripts, but in reality they extrapolate this information from the various measurements that a variety of different technologies provide. Indeed, there are many reasons why applying different technologies to the problem of transcript abundance may give different results, owing to an incomplete understanding of the gene in question or from shortcomings in the applications of the technologies. There are nine basic considerations for making a technology choice for quantitating gene expression that will impact the overall outcome: architecture, specificity, sensitivity, sample requirement, coverage, throughput, cost, reproducibility, and data management. These considerations will be discussed in the context of available technologies.
Key Words: Architecture, bioinformatics, coverage, quantitative, reproducibility, sensitivity, specificity, throughput
