There has been a recent set of papers regarding the proper reference setting of microarray DNA expression data. The authors of the study argue that the methodology used in ascertaining the effects of one transcription protein, c-Myc, may exhibit significant and substantial errors.
The c-Myc Analysis
A review of the papers and related discourse has been
provided in Science[1].
The author of the Science review states:
(The authors)…argue that Myc's cancerous effects are much broader than most
people have assumed, and that a flawed experimental method may have thrown off
a decade's worth of Myc
research…
The problem highlighted by the Myc studies may sound modest, but
after thinking about it for a year, Levens and Young don't see a simple fix.
The only way to update the older work on Myc
and gene expression, they say, is to go back to the lab and redo the
experiments. That view, Young says, causes “the most angst” among biologists.
He says he understands the “desperation” of bioinformaticists seeking a way to
tweak existing data into better shape, but he can't offer one that doesn't
require lab work.
Levens suggests that this problem arose partly
because scientists viewed Myc
as a “master regulator of master regulators,” one that sends a signal along
defined pathways to an array of specific targets that send out additional
signals, creating a dizzying pattern of interactions. In reality, Levens says,
“Myc is not a high
executive making lots of decisions but a dumb bureaucrat enforcing a rule.” And
the rule itself seems pretty simple: If a gene is expressed, increasing Myc in nearly all cases increases that
gene's level of expression. Genes that are already highly expressed are boosted
more, so the impact of Myc
is “exponential,” Levens says.
The broad effects of Myc were overlooked, according to
Levens and Young, because the standard procedure in gene-expression experiments
has been to use similar quantities of RNA from the samples being compared and
to normalize results to mean RNA. Young calculates that this practice
erroneously deflates Myc's
effects two- to threefold. He says it also creates the impression that some
genes are turned down when they are not.
The paper by Louven et al states[2]:
Gene expression analysis is a widely used and powerful
method for investigating the transcriptional behavior of biological systems,
for classifying cell states in disease, and for many other purposes. Recent
studies indicate that common assumptions currently embedded in experimental and
analytical practices can lead to misinterpretation of global gene expression
data. We discuss these assumptions and describe solutions that should minimize
erroneous interpretation of gene expression data from multiple analysis
platforms.
Let us examine what they are saying. We
shall slightly reproduce their argument. Let us consider two cases.
First, we assume a limited transcriptional response. Namely
we assume that say c-Myc activates and transcribes genes B and E. We
demonstrate this below where we show what we believe nature is truly doing:
Now we examine via say a microcell analysis the expression
of the transcribed RNA and we plot the expression for each gene when we have
the effector and when we do not. Namely when a condition such as a cancer is
present and when we know it is not. The objective is to ascertain whether there
is some gene expression which we can then putatively relate to say this
specific cancer. Perhaps we can use it as a target for therapy. If we were to
plot for each gene the intensity of expression as measured in a microarray analysis
we may get the following chart.
Note in the above that we have a cluster of unexpressed
genes and just two expressed genes. The expressed genes have a higher count and
thus stand out. Now what the authors propose is that we:
1) Normalize the chart above. Namely set a normalized value
of say 1.0 count metrics,
2) Then plot a log of what they call the fold changes.
Simply this is a plot which accentuates large variations while de-emphasizing
small ones. The resulting chart is as follows:
Thus we see that the two genes putatively expressed are
identified as standing out in the process. This they authors argue is what researchers
have assumed and have been doing.
Now the authors argue that in fact a protein like c-Myc
actually activates a whole set of transcriptions. They call this transcription
amplification. It results in all genes being activated and transcribed.
In the case of transcriptional amplification we obtain a
plot as follows:
But now when we normalize it we obtain:
Note that now they align somewhat closely but using the log
metric again we can get them to be expressed.
We may not be seeing the correct answer it is argued. The
normalization is a concern and, it is argued, fails to truly represent the
expression involved. The solution is to use “spiked in” RNA standards. When
that is done we have the following:
Note that now we have all positive expressions. This is the
response that was anticipated and it is obtained by using the reference set of
RNA to standardize the data.
The authors conclude:
Our results indicate that spike-in controls of the type
described here are a robust, cross-platform method to allow normalization to
cell number and thus enable more accurate detection of differential gene
expression and changes in gene expression programs. The clear implication is
that the use of spike-in controls normalized to cell number should become the
default standard for all expression experiments, as opposed to their more
limited use in experiments where gross changes in RNA levels are already
anticipated, as exemplified by transcription shutdown experiments …
When cell counting may be problematic, as for expression
experiments from solid tumors or tissues, DNA content may be used as a surrogate
if ploidy and DNA replication profiles are also characterized to prevent the introduction
of a DNA content-based artifact.
The discovery of transcriptional amplification and the
realization that common experimental methods may lead to erroneous interpretation
of gene expression experiments has implications for much current biological
research.
How prevalent is misinterpretation of genome-wide
expression data due to the assumption that cells produce similar levels of
total RNA? The answer is likely related to the prevalence of regulatory
mechanisms that globally amplify or suppress transcription. What are the implications
for classifying cell states in disease? Significant effort is being devoted to
expression profiling cancer cells and these studies use standard normalization
methods …
Because c-Myc expression occurs at widely varying levels
in various tumor cells, transcriptional amplification is likely having a
profound impact on cancer cell signatures. Where expression data are being used
to gain insights into cancer cell behavior and regulation, it should be
interpreted with added caution.
There is great validity to these conclusions but as the
Science article states it may force a massive re-examination of many of the
results from prior analyses.
Pathway System Models
c-Myc is a strong transcription factor and is implicated in
many cancers. It also is examined as we look for other gene expressions which
may be diagnostic, prognostic as well as targets for therapeutics. Thus there
is undoubtedly a great significance. We briefly look at the system elements of
c-Myc.
First, we display the generic Weinberg model below. This is
the ligand, receptor signalling and transcription factor model with the resulting
elements of growth, migration, apoptosis, adhesion and differentiation. When we
look at c-Myc we are looking at one element in a chain, with inputs and outputs
and resulting changes in state. We must always remember that we are examining a
system of interconnected elements.
Now we can take the detail a layer below. Here we show c-Myc
being the dominant transcription factor in the changes in gene expression. It
is a driver, it makes genes transcribe more than they would usually in many
cases and this in turn is one of many elements that result in the change of
state in cells. Remember it is one of many. Remember also that it is single
elements in a chain of related, more importantly, interrelated, genes which are
expressed.
Let us examine two driving pathways. We have detailed these
is our studies of melanoma and prostate cancer. We have seen in them the
drivers above such as PTEN and B-RAF genes and their protein products.
Finally the total complexity can be modeled as we show
below.
Ultimately we have multiple genes and gene products which
are arranged in a system manner and which we are seeking the driving models for
these elements. We have discussed this previously, namely we know that there
are scores of interrelated genes which “drive” other genes, directly as
transcription factors or indirectly as drivers of transcription factors.
In the final description of c-Myc below we show the actual
transcription of c-Myc and then its use as a transcription factor and the
resulting down chain products. This complete analysis will be critical when
assembling a measurement methodology and validation.
Observations
As with many other results of this kind they often raise
more issues than they solve. We will just comment on a few.
1. Normalization: Normalization and reference levels are always
key to understanding the results. We have examined this factor in microarrays
before. We have also examined this in our analysis of flower color tessellation,
attempting to reconstruct pathways from spectroscopic analysis of cell by cell
anthocyanin expression. It is a standard problem. However we have argued that
by having a system model and using identification techniques we have two key
factors: first, we have a model which link elements together; second, we have a
model with end points which allow for consistency testing.
2. System Models: We have argued that having a system model
for gene and gene product interaction is essential to validate measurements. We
believe that system models allow for three key results:
- Linkages: The system models expressly stipulate linkages between products and in turn their activation capacities. These provide substantial added elements for normalization.
- Boundary Constraints: The system models allow for the checking of boundary values. Namely do the results make sense?
- Identification Metrics: The system models also provide a strong basis for identification methodologies to be used to identify new linkages.
3. Validation: Validation of data is key. This means more
than just calibration. Validation means true causality. As we have noted it
seems that each and every day one sees more genes related to more diseases than
ever before. But one must always question the issue of causality.
[1] See
Marshall, in Science 2 Nov 2012, http://www.sciencemag.org/content/338/6107/593.full
[2]
Louven, et al, Revisiting Global Gene Expression
Analysis, Cell, 151, October 26, 2012.