Monday, November 12, 2012

c-Myc, Systems Models, and Calibration

We briefly examine the result and then we attempt to place it in the context of a full system model.
There has been a recent set of papers regarding the proper reference setting of microarray DNA expression data. The authors of the study argue that the methodology used in ascertaining the effects of one transcription protein, c-Myc, may exhibit significant and substantial errors.

The c-Myc Analysis

A review of the papers and related discourse has been provided in Science[1].

The author of the Science review states:

(The authors)…argue that Myc's cancerous effects are much broader than most people have assumed, and that a flawed experimental method may have thrown off a decade's worth of Myc research…

The problem highlighted by the Myc studies may sound modest, but after thinking about it for a year, Levens and Young don't see a simple fix. The only way to update the older work on Myc and gene expression, they say, is to go back to the lab and redo the experiments. That view, Young says, causes “the most angst” among biologists. He says he understands the “desperation” of bioinformaticists seeking a way to tweak existing data into better shape, but he can't offer one that doesn't require lab work.

Levens suggests that this problem arose partly because scientists viewed Myc as a “master regulator of master regulators,” one that sends a signal along defined pathways to an array of specific targets that send out additional signals, creating a dizzying pattern of interactions. In reality, Levens says, “Myc is not a high executive making lots of decisions but a dumb bureaucrat enforcing a rule.” And the rule itself seems pretty simple: If a gene is expressed, increasing Myc in nearly all cases increases that gene's level of expression. Genes that are already highly expressed are boosted more, so the impact of Myc is “exponential,” Levens says.

The broad effects of Myc were overlooked, according to Levens and Young, because the standard procedure in gene-expression experiments has been to use similar quantities of RNA from the samples being compared and to normalize results to mean RNA. Young calculates that this practice erroneously deflates Myc's effects two- to threefold. He says it also creates the impression that some genes are turned down when they are not.

The paper by Louven et al states[2]:

Gene expression analysis is a widely used and powerful method for investigating the transcriptional behavior of biological systems, for classifying cell states in disease, and for many other purposes. Recent studies indicate that common assumptions currently embedded in experimental and analytical practices can lead to misinterpretation of global gene expression data. We discuss these assumptions and describe solutions that should minimize erroneous interpretation of gene expression data from multiple analysis platforms.

Let us examine what  they are saying. We shall slightly reproduce their argument. Let us consider two cases.

First, we assume a limited transcriptional response. Namely we assume that say c-Myc activates and transcribes genes B and E. We demonstrate this below where we show what we believe nature is truly doing:

Now we examine via say a microcell analysis the expression of the transcribed RNA and we plot the expression for each gene when we have the effector and when we do not. Namely when a condition such as a cancer is present and when we know it is not. The objective is to ascertain whether there is some gene expression which we can then putatively relate to say this specific cancer. Perhaps we can use it as a target for therapy. If we were to plot for each gene the intensity of expression as measured in a microarray analysis we may get the following chart.

 Note in the above that we have a cluster of unexpressed genes and just two expressed genes. The expressed genes have a higher count and thus stand out. Now what the authors propose is that we:

1) Normalize the chart above. Namely set a normalized value of say 1.0 count metrics,

2) Then plot a log of what they call the fold changes. Simply this is a plot which accentuates large variations while de-emphasizing small ones. The resulting chart is as follows:
 Thus we see that the two genes putatively expressed are identified as standing out in the process. This they authors argue is what researchers have assumed and have been doing.

Now the authors argue that in fact a protein like c-Myc actually activates a whole set of transcriptions. They call this transcription amplification. It results in all genes being activated and transcribed.

In the case of transcriptional amplification we obtain a plot as follows:

But now when we normalize it we obtain:


Note that now they align somewhat closely but using the log metric again we can get them to be expressed.

 We may not be seeing the correct answer it is argued. The normalization is a concern and, it is argued, fails to truly represent the expression involved. The solution is to use “spiked in” RNA standards. When that is done we have the following:

Note that now we have all positive expressions. This is the response that was anticipated and it is obtained by using the reference set of RNA to standardize the data.

The authors conclude:

Our results indicate that spike-in controls of the type described here are a robust, cross-platform method to allow normalization to cell number and thus enable more accurate detection of differential gene expression and changes in gene expression programs. The clear implication is that the use of spike-in controls normalized to cell number should become the default standard for all expression experiments, as opposed to their more limited use in experiments where gross changes in RNA levels are already anticipated, as exemplified by transcription shutdown experiments …

When cell counting may be problematic, as for expression experiments from solid tumors or tissues, DNA content may be used as a surrogate if ploidy and DNA replication profiles are also characterized to prevent the introduction of a DNA content-based artifact.

The discovery of transcriptional amplification and the realization that common experimental methods may lead to erroneous interpretation of gene expression experiments has implications for much current biological research.

How prevalent is misinterpretation of genome-wide expression data due to the assumption that cells produce similar levels of total RNA? The answer is likely related to the prevalence of regulatory mechanisms that globally amplify or suppress transcription. What are the implications for classifying cell states in disease? Significant effort is being devoted to expression profiling cancer cells and these studies use standard normalization methods …

Because c-Myc expression occurs at widely varying levels in various tumor cells, transcriptional amplification is likely having a profound impact on cancer cell signatures. Where expression data are being used to gain insights into cancer cell behavior and regulation, it should be interpreted with added caution.

There is great validity to these conclusions but as the Science article states it may force a massive re-examination of many of the results from prior analyses.

Pathway System Models

c-Myc is a strong transcription factor and is implicated in many cancers. It also is examined as we look for other gene expressions which may be diagnostic, prognostic as well as targets for therapeutics. Thus there is undoubtedly a great significance. We briefly look at the system elements of c-Myc.

First, we display the generic Weinberg model below. This is the ligand, receptor signalling and transcription factor model with the resulting elements of growth, migration, apoptosis, adhesion and differentiation. When we look at c-Myc we are looking at one element in a chain, with inputs and outputs and resulting changes in state. We must always remember that we are examining a system of interconnected elements.

Now we can take the detail a layer below. Here we show c-Myc being the dominant transcription factor in the changes in gene expression. It is a driver, it makes genes transcribe more than they would usually in many cases and this in turn is one of many elements that result in the change of state in cells. Remember it is one of many. Remember also that it is single elements in a chain of related, more importantly, interrelated, genes which are expressed.
 
Let us examine two driving pathways. We have detailed these is our studies of melanoma and prostate cancer. We have seen in them the drivers above such as PTEN and B-RAF genes and their protein products.

Finally the total complexity can be modeled as we show below.
 
Ultimately we have multiple genes and gene products which are arranged in a system manner and which we are seeking the driving models for these elements. We have discussed this previously, namely we know that there are scores of interrelated genes which “drive” other genes, directly as transcription factors or indirectly as drivers of transcription factors.

In the final description of c-Myc below we show the actual transcription of c-Myc and then its use as a transcription factor and the resulting down chain products. This complete analysis will be critical when assembling a measurement methodology and validation.

Observations

 As with many other results of this kind they often raise more issues than they solve. We will just comment on a few.

1. Normalization: Normalization and reference levels are always key to understanding the results. We have examined this factor in microarrays before. We have also examined this in our analysis of flower color tessellation, attempting to reconstruct pathways from spectroscopic analysis of cell by cell anthocyanin expression. It is a standard problem. However we have argued that by having a system model and using identification techniques we have two key factors: first, we have a model which link elements together; second, we have a model with end points which allow for consistency testing.

2. System Models: We have argued that having a system model for gene and gene product interaction is essential to validate measurements. We believe that system models allow for three key results:

  • Linkages: The system models expressly stipulate linkages between products and in turn their activation capacities. These provide substantial added elements for normalization. 
  • Boundary Constraints: The system models allow for the checking of boundary values. Namely do the results make sense?
  • Identification Metrics: The system models also provide a strong basis for identification methodologies to be used to identify new linkages.
  
3. Validation: Validation of data is key. This means more than just calibration. Validation means true causality. As we have noted it seems that each and every day one sees more genes related to more diseases than ever before. But one must always question the issue of causality.


[1] See Marshall, in Science 2 Nov 2012, http://www.sciencemag.org/content/338/6107/593.full

[2] Louven, et al, Revisiting Global Gene Expression Analysis, Cell, 151, October 26, 2012.