Saturday, June 1, 2013

Genomics: Statistics versus Systems

As NEJM recently noted[1]:

This is the age of massive genome surveys — at least for a little while longer. Sixty years after Watson and Crick's discovery, and a decade after the completion of the Human Genome Project, large-scale sequencing efforts directed at human disease abound, especially for cancer and rare congenital syndromes. International research teams supported by public funding agencies such as the National Institutes of Health and by private foundations such as the Wellcome Trust are rapidly enlarging the catalogue of genetic changes associated with neoplasia and other ailments, using ever faster, ever cheaper sequencing methods and heavy-duty bioinformatics.

Critics of big genomics projects have argued that such work is resource-intensive, is not hypothesis-driven, and amounts to little more than molecular philately. But as discoveries from these projects stack up, and as terabytes of observational data yield new insights into disease biology and prompt the development of pathway-driven targeted therapies, the usefulness of such approaches is becoming undeniable. When the Cancer Genome Atlas (TCGA) wraps up its 8-year effort next year, it will have provided detailed information on 10,000 cancer genomes for less than the cost of a trio of F-22 Raptor stealth fighters.

Let us examine where and where "not" models function. The issue we are interested in is that to develop models and substantiated inferences we need a well understood reality for cause and effect. Namely gravity causes a force between two masses, it is a measured effect. Charge causes a force between two particles, it is a measured effect. In gene structures PTEN modulates PI3K, BRAF activates MEK and the pathways are well known. We know ligands, we know receptors, we know gene activators.

We also understand how these function, the forces, charges, conformations. We know how to inhibit and activate. We know these by facts not by correlations. Thus when we know these facts we have a basis for, indeed a demand for, using these dynamic models as an integral part of our understanding. We cannot and must not resort to random correlations. At times everything can be correlated.

Let me first explain by examining what human intellect has developed using models like this and then examine those thing where our knowledge is severely lacking. The driver for this is that genomics is more akin to physical systems which we can do a great deal with and NOT akin to Economics and its correlations which frankly led us nearly to an economic collapse.

Inherent in each of the areas where the use of the knowledge of relationships is integral to the descriptions, and not correlations, is also simple yet high level descriptives; Schroedinger's Equation, Navier Stokes Equation, Fokker Plank Equation, Kushner Stratonovich Equation. We have argued elsewhere that for genomic systems we are almost already there, just a few more steps.

Where They Work:


Early 19th century thermodynamics did not understand the true nature of heat, namely the movement of molecules and the statistical behavior. From the gross concepts we arrived at such things as internal energy, heat, enthalpy and other gross measures of a system’s thermodynamic properties. With the development of statistical mechanics there was the move from gross measurement to understanding the statistical distribution of the molecules and this was presented via the Fokker Planck equation, a means to examine the detailed statistical fine structure of complex mixtures with thermodynamic issues.

Fluid Flow

Understanding fluid dynamics was initially a study with many tables and data taken from past experiments. Slowly as the theory evolved the Navier Stokes theory came forth and constructs such as flow fields evolved and then random changes in turbulence theory also was developed.

Stochastic Dynamic Systems

Complex systems, namely entities which are based on physical realities, aircraft control systems, chemical process controls, and the like can be modeled by a complex multidimensional spatio temporal model. Applying statistical methods developed by Wiener and Kolmogoroff one saw the development of the Kushner Stratonovich equations and then saw them extended to distributed system as well. This analysis allows one to analyze highly complex multidimensional stochastic systems in time and space.

Wiener considered these in his studies of Cybernetics, and furthermore it was Wiener who also started the development of understanding complex organic systems.

The most important elements in using system models is our ability to ascertain Observability and Controllability; the ability to reproduce the model and the ability to send the states in the model to a desired end point. We also need to have the ability to identify the coefficients in the model. We often have good initial guesses but having measurement means that we can continually refine them


When the transistor was invented the manager of the people who did it promptly published a book on solid state theory. Very few had a clue as to what Shockley was saying and frankly for those who used the device no one really cared. The electronics designer knew the linkages, and how to modify the, A good electronics designer knew that if this voltage went up the other went down, or whatever.

One knew the equations of the voltages and currents, one understood the complexities of the time varying circuits. There was a substantial amount of well proved physics behind all of this. However a good engineer also understood the ebb and flow, as for example a good neurologist can be examining the patient understand where the lesion is and then find out what the lesion is.

We are starting to get to that point with genomics. We know if we activate a kinase receptor then we activate a certain set of pathways which activate a certain set of transcription factors. A good Genomics Engineer does not need to “know” the protein structures, just what they do, at a very high level, yet detailed enough to catch the unique events which may occur.

Quantum Mechanics and Electrodynamics

Erwin Schroedinger came up with a simple manifestation of electrons in a quantum world. Feynman came up with simple diagrams to show what sub atomic particles are doing when they interact. Now solving the Schroedinger equation for a complex organic molecule is not readily achieved it can be attacked using sophisticated computers.

Where The Models do NOT Work:


Economics pretends to be mathematical. In reality, other than the tautologies of balancing financial sheets, the demand and supply theories are pure abject speculation. Econometrics is merely a fallacious use of correlation theory. There is no fundamental cause and effect, no demonstrable demonstration between input and output. This should be a warning for those working in Genomics. Just having a correlation is not causality and furthermore there is an underlying reality that is being ignored.

Social Sciences

The social sciences have tried to ascertain human responses. Approaches such as those used in election prediction may function in the short term but humans are all too often less than a herd and change opinions all too frequently.


Psychology is strewn with the dead bodies of mathematical approaches to understanding human behavior. The problem is that we do not understand the brain, yet, and thus models of thinking, such as those in artificial intelligence are at best primitive guesses.

Fundamental Requirements

To have a Genomic Model or something useful we must have the following:

1. A Verifiable Realization of How the System Works. This entails the understanding of pathways and their effects on cell proliferation and movement.

2. Some Understanding of What Causes Changes in Pathways. Here we have a difficulty. We not only have somatic changes, but we have epigenetic changes such as micro RNAs and even methylation and the like. We also must understand germ line predispositions. These are gene predispositions inherited as well as SNP predispositions which can result in subsequent translation of proteins.

3. Environmental Understandings: This would include the extracellular matrix issues and its environment as well as the ability of the invading malignant cells to activate surrounding benign cells to assist in proliferation.

4. An Acceptable Measure of the Malignant Cell in Space and Time. As with the Fokker Planck equation or the Kushner Stratonovich model this may mean a measure such as the average number of a specific type of malignant cell per volume at each spatio-temporal location.

5. A Realization for the Progression of Somatic Changes: This means an understanding of the statistical nature of the somatic genetic changes as cells progress in time. For example what happens when we go from MDS to AML. This is not AML de novo. We do not need the details but the transition rates and the possible states. This issue is akin to the electronics world where we need higher level understanding and not the details.

6. We Need Ability to Integrate Parameter Identification with Stochastic Models: Clearly if we know the models and if we know that certain factors are the drivers of these models, we may use this to identify the parameters at the same time we are estimating the states.

7. We need a Methodology to Quantify Our Representations and to validate them: This is akin to having a Kushner Stratonovich equation. It is what we have developed by using average number of cells by type at specific spatio-temporal sites. I believe we have the techniques, they are built on the many other approaches.

The Genomic Model

The Genomic Model is a systematizeable model. It is s system wherein we have well known causes and effects. We know that if a ligan attaches to a receptor then one has an activated pathway that induces a transcription factor which results in cellular proliferation. We know cause and effect. We know the rates of these factors. We can also develop models wherein we can estimate the average number of cells of a particular genetic structure at a specific time and at a specific place in the body.

The NEJM article concludes:

In 1803, a few years before the inaugural issue of the Journal, Thomas Jefferson commissioned Meriwether Lewis and William Clark to survey the vast unknown American frontier. Lewis and Clark departed from St. Louis, where Ley et al. initiated the AML genome survey. Less than a century later, the western frontier was declared “closed,” but land surveyors did not disappear; today, they focus on construction projects and property boundaries. Likewise, although the initial epic AML genomic survey that began in St. Louis is now largely complete and surveys of other neoplasms will soon conclude, the use of genomics in quotidian practice is just beginning.

Now if I were to interpret this correctly it sends just the wrong message. The developments in Genomics are not Lewis and Clark like, they require Newtonian and Maxwellian insight. Fundamental laws and relationships, causality, albeit stochastic, and determine ng the right measures.

Thus in a sense one could imagine the Genomics Engineer being akin to the electronics engineer, or even to the neurologist. As many a medical student would recall from anatomy, the tracing of the cranial nerves is a critical skill, but one enhanced by seeing how they migrate from back to front. Diagnosis of cranial nerve issues are resolved by understanding the network. In a similar manner we would hope the same is double with Genomics pathways,