Tuesday, December 20, 2011

Microarrays and Too Much Data


In a recent article by Spector at Stanford the author tells how for little money one can develop their own tests for genetic markers. She states:

So it takes years of hard work and serious cash to create one of these “simple” tests, right? Not anymore. “All you really need is a computer browser and Excel,” says computer scientist Purvesh Khatri, PhD, who, working with Atul Butte, MD, PhD, associate professor of systems medicine in pediatrics, identified telltale chemicals (aka biomarkers) for three types of cancer all in the span of one year. How was this possible? By analyzing some of the vast amount of genetic information from tumor cell samples that has been amassed over the past decade in free, publicly accessible databases, and by outsourcing the lab work. “We say ‘outsource everything except the genius,’” says Butte. “You come up with the question and the target, and let everyone else do the work.”  As Khatri walked me through the discovery process, I learned there’s a little more to it than that. Some work and cash is involved, not to mention high-school level biology. And basic statistics will be a big help. But with those tools, skills and about five days’ work, plus $4,000 to confirm through blood tests, you’re on your way.
 
Yes, for just a few dollars and a few hours of time you too can develop a genetic profile. In contrast a set of papers by Detours and colleagues raises some doubts about this.

The problem is that it is all too easy to get correlations of almost anything with anything. They are not markers unless we have a system with verifiable causality. This was discussed in the work of Dougherty. What Dougherty has observed is that one must have a system underlying the process, with causality, and that what one then looks for are the coefficients which define that system. From that we can ascertain if the result is true and consistent.

Recently Detours has addressed this issue in PLOS and he and his co-authors have demonstrated that the plethora of markers for say breast cancer can be shown to be nothing more than almost random choices, my words not theirs. Namely one may be able to find correlates almost anywhere.


Ethic guidelines drastically limit experiments on human subjects. Hence, the fundamental mechanisms of human diseases are mostly studied in vitro or in animal models. These are only substitutes for understanding human physiology and disease. Proving that a mechanism responsible for disease progression in a model system is also relevant to human diseases—not to mention then translating it into a new therapeutic—is a major bottleneck in biomedicine. In the end, only clinical interventions on human will bridge models and human disease.

One approach is to look for correlations. If you can show that patients with tumors expressing, for example, stem cell markers have a much worse prognosis than those without them, that would suggest that stem cells are involved in human disease progression. This line of thinking has long been popular in oncology because you need only access surgical specimens, some mRNA or protein marker, and a follow up of patients. And with the recent advent of efficient microarray screens, this approach has become all the rage, reducing the discovery of signatures, i.e. multi genes markers, to a nearly automatic procedure.

In their PLOS paper Venet et al state:

Hundreds of studies in oncology have suggested the biological relevance to human of putative cancer-driving mechanisms with the  following three  steps: 1) characterize  the  mechanism  in  a model system, 2) derive from the model system a marker whose expression changes when the mechanism is altered, and 3) show that   marker   expression  correlates  with  disease  outcome   in patients—the last figure of such paper is typically a Kaplan-Meier plot illustrating this correlation.

Detours continues:

The signatures’ prognostic potential can then be tested instantly in genome-wide compendia of expression profiles for hundreds of human tumors, all available for free in the public domain. Besides stem cells markers, signatures linked to all sorts of biological mechanisms or states have been shown to be associated with human cancer outcome. Indeed, several new signatures are published every month in prominent journals.

But such correlations are not all that they seem. The accumulation of signatures with all sorts of biological meaning, but nearly identical prognostic values, already looked suspicious to us and others back in 2007. It seemed that every newly discovered signature was prognostic. We collected from the literature some signatures with as little connection to cancer as possible. We found, for example, a signature of the blood cells of Japanese patients who were told jokes after lunch, and a signature derived from the microarray analysis of the brains from mice that suffered social defeat. Both of these signatures were associated with breast cancer outcome by any statistical standards.

In PLOS they state:

 Our   study  questions  the   biological  interpretation   of  the prognostic value of published breast cancer signatures, but  has no bearing on their  usefulness in the  clinic: a marker  may  be accurate without yielding interesting biological insight regarding the mechanism of disease progression. Nevertheless, the  prominence  of proliferation should be  taken  into  account  in  future clinical research. Are there transcriptional signals in breast cancer that are prognostic, but independent of proliferation?
 
And they conclude:

In  conclusion, we  have  shown  that  1) random  single- and multiple-genes expression markers have a high probability to be associated  with   breast   cancer   outcome;   2)  most   published signatures are  not  significantly more  associated with  outcome than random predictors; 3) the meta-PCNA  metagene integrates most of the outcome-related information contained in the breast cancer transcriptome; 4) this information is present in over 50% of the transcriptome and cannot be removed by purging known cell-cycle genes from a signature.

As Detours concludes in his short piece in The Scientist:

It took us four years and six rejections to get this work finally published in a computational
biology journal (PLoS Comput Biol, 2011)—not the most efficient venue to reach the oncology community. Meanwhile, a  steady stream of studies confounded by proliferation rates has appeared. This has to be said, one  can no longer stay silent about the rather limited self-correction capability of the top tier  publishing system (Cell, Nature Genetics, PNAS, etc.), which promoted these studies in the first  place. The oncogenomic-based literature has forgotten the pitfalls of non-specific effects and the value  of negative controls. It is not enough to show that a signature is prognostic; biological  conclusions may be drawn only if its prognostic value is specifically driven by the mechanism/state  under  investigation. Importantly, we question prognostic signatures as specific research tools, not  as clinical guides: smoke does not drive fire, yet it is powerful indicator of when and where a fire is burning.


His point is well taken. The challenge is to determine the intra-cellular and inter-cellular pathways as defined as dynamic distributed systems, and to do what Dougherty and others suggest, namely understand what is happening and why and then seek to identify the system. Failure to have a viable a provable model of the system will lead to volumes of data which are far from prognostic. In fact they may be very well deadly to the patient.