Tuesday, June 11, 2013

Data, Data, Data

This flap about the monitoring of personal data reminds me of the general problem of data. Back some decades ago I did some work tracking Soviet nuclear subs, a summer job type thing. My task was to try various schemes on the massive data files to discriminate between whales and subs. Specifically my job was to determine the sensitivity of a single parameter on discrimination on a stored data base of alleged whales and submarines. It was not clear whether this was ever used or even if it was important.

Now I worked this data to death. There were tons of data. It was the worst job I ever did in my life. Working for the New York Sanitation Department in January 1960 shoveling snow was better. At least there was an achievement, no snow at the cross walks. I ran every possible variation, with little understanding of what needle in this haystack I was looking for.

But then what I did was to step back and re-frame the question and ask whether it was ever possible to do what the process I was thinking about would achieve what I set out to do. I did a detailed analysis of the situation and when complete, the data notwithstanding, it was clear you could not do what I was set out to do, at least the way I was set out to do it. Then again this may have been a real challenge or it may have just been one of those management games that large companies played, and after all it was just for the summer. What it did teach me was that one just does not wander aimlessly around data, one needs a theory, a model, a physical realization and embodiment. Once I created the model and did the analysis the data could become meaningful or meaningless. Sometimes data helps, often it can confuse. In fact one may have captured the wrong data.

Now what does this have to do with this current issue? Before answering that let me give two other examples. First is Excel. I would argue that the market collapse of the dot com bubble was as much the fault of Excel as it was the hype, indeed the hype was Excel. For by then any moron could gin up data by the truckload, put it in an Excel spread sheet and come up with a trillion dollar business worth billions. And since it was based upon data and done with Excel it must be true. Nonsense!

The next example is the microarray and cancer genes. We have enabled the folks to run arrays on hundreds of genes and from that using again their Excel spread sheets we have almost daily announcements of new genes causing some form of cancer. Namely some loose correlation is causation.

Now to massive data. One needs discriminant functions, namely one must have some idea as to what to look for. Frankly given no initial data one can find anything and anything can be big, real big. Data supports theories, it is not the theory. Data can often be wrong; wrong by interpretation or by collection.

Now how does good Intel really work? The same old way it always has, feet on the ground, snippets at a bar in Athens, a coffee shop in Tangiers, a small bistro in Marseille. It is listening and gathering and having a team of dedicated smart loyal people, not Government employees.It used to work that way for a while. Today it has become all politics.

Besides the current problem is what has been going on for a long time now, sloppy control of data. Solve that problem and you solve everything. We did that once, it worked somewhat better than this mess. Perhaps we just kill the computers and reintroduce the typewriter.