In 1987 or so, I was working for Welch Medical Library at the early stages of the human gene-mapping project. We had funding from Howard Hughes Medical Institute, and the National Library of Medicine, and had a small lab of programmers and researchers thinking hard about medical informatics and retrieval systems.
When the gene-mapping project kicked off, and there was funding to be had, a lot of people came out of the woodwork with various ideas. One of them was particularly memorable – a fellow by the name of Hunkapillar, who had a brother who worked for TRW in Redondo Beach. I hadn’t read “The Falcon and the Snowman” yet, or I would have recognized the name of the facility where Chris Boyd worked. One of the problems we were expecting to have was searching for gene patterns in the genome databases, so there was an awareness that we were going to have to find arbitrary patterns in huge amounts of data; this was a concern. In 1987, a 400MB hard drive cost $25,000 and weighed 50-75lbs, and the estimates for the storage requirement for the genome was about 1,000,000MB – it was going to be an interesting technical problem!
According to Hunkapillar, TRW had a device that did searching very fast, and he wanted us to see it. So my boss and I flew out to LA and spent a day on the freeway to get to TRW for a demo of the technology. It was called the “FDF 2000” for “Fast Data Finder” and – at the time – it was pretty darned impressive. What they showed us was a Sun workstation with an SMD disk controller and a couple of 300MB drives, and the FDF array board, which could search the drives at the speed of the drive – in other words, it just gobbled the bits as fast as the drive could toss them to the FDF, and anything that matched the patterns the FDF was looking for, would come through the other side. We’re talking a whopping 200MB/second search-rate; it was very exciting. It was also very expensive.
During the demo, the guy from TRW explained the architecture of the FDF, as a series of parallelized processors that partitioned a search-space so that each one generated partial matches, which fed into a second tier of processors, or a third and a fourth, so a large number of search items could be looked for in parallel across a large amount of data. Nowadays I’d understand that it was a hardware implementation of an NFA (Nondeterministic Finite Automaton) matching engine similar to what you can get on a modern network processor.*
The TRW demo and documents emphasized the ability of the chip to apply some basic language structures to its default expression-set, i.e.: you could look for “‘bomb’ within paragraph of ‘president'” I remember thinking, at the time, “what the heck is that for?” We went back to Baltimore, talked about the TRW system, and decided it was too expensive and specialized and that (my argument at the time) we were going to get most of the work done by partitioning the genome into searchable zones and parallel-searching across zones and indexing within zones.
I didn’t hear about ECHELON until the first shots were fired during the crypto-wars in 1992 or so. That was when the NSA tried to promote the “Clipper” (Mykotronx Fortezza chip/card) – an encryption system with a government-mandated backdoor for “law enforcement access.” By that time I was working at Trusted Information Systems and was started to get a better understanding of how the spy agencies (and the FBI) were concerned with being able to continue to monitor the people’s communications.
Wait: “continue?” In 1992 it looked like the government was monitoring a lot. A great deal. That was when I learned about the “5 eyes” treaty group, and ECHELON. That was when I realized that the government was already cheerfully ignoring all of its own platitudes about privacy, rule of law, 4th amendment. I read Yardley’s “American Black Chamber” and discovered that the US Government was reading its citizens mail and telegram traffic shortly before World War I. It was a bit of a rude awakening for me, and, when I started getting sucked deeper and deeper into computer security, I kept meeting people like Duncan Campbell and others who had either helped build the intelligence apparatus of the police state, or had tried to out it.
I’ve met a lot of the spies who watch us all, and they’re generally decent, well-meaning, deeply deluded people who believe in the nationalist agenda. A few years ago I started asking some of them “why do you believe that the US is ‘right’ and why do you believe that it must ‘defend’ itself?” And that was when a lot of my contacts in the intelligence community started to dry up. I realized that these were people who had whole-heartedly adopted the nationalist agenda: the idea that these lines on the map mean something so important that they’re willing to “other” someone who was born outside of those lines, simply because of those lines.
I was fortunate to grow up without religion; I never absorbed that idea. But letting go of nationalism was hard. When I talk to someone who’s letting go of religion and they say “I know it’s B.S. but I am still afraid of hell.” I think “I used to believe that there was such a thing as ‘my country'” and I understand and sympathize. There are totalitarian liars who tell themselves that the authority they want to enforce is for their victims’ own good. Religious or nationalist, it’s the same lies, the same agenda: social control and enforcement of the status quo.
The TRW FDF chip probably sold like hotcakes in Ft Meade, Maryland and otherwise was not even a blip in the history of computing. Eventually, TRW spun the search-board technology off to a company called Paranet, which offered it in a few form-factors (suitable to monitoring internet traffic) Now, it’s not even a footnote in google. But, when I encounter people who are skeptical when they hear Edward Snowden’s disclosures, “Oh, surely that would be hard. They wouldn’t go to that kind of trouble!” I remember my weird trip to TRW in 1987. I probably would have felt the same way, before I saw the elephant.
One of my projects at Welch was recoding “Principles of Ambulatory Medicine” into a retrieval/full text search system called IRx (Information Retrieval eXperiment) My boss at the time was interested in mark-up languages and suggested SGML (Standardized General Mark-up Language) which later became the basis for HTML. At the time there was a research group at Brown University working on using SGML tags to do retrieval heirarchy and I proposed a system I called “Tocs and Docs” (“Toc” being a Table of Contents) that I implemented as a prototype using a bunch of shell scripts to implement something much like what we’d now call a “browser.” Unfortunately, when I went to explain my system to our grant officer at NLM, I hadn’t learned how to pitch technology, yet, so he told me to sit down, shut up, and go back to coding on IRx. (sigh) I was only 5 years ahead of CERN.
I realized years later that the TRW FDF chip only made sense if you were searching through streams of data – you know, like you might collect off satellite downlinks or network connections – because your problem at that point is match->classify->collect. Searching FDF-style on a hard drive is stupid and nobody does it: what you do is ingest the data, classify and index it, then search your index. At no time would anyone be so goofy as to do a brute-force end-t0-end search of their entire data-set over and over. While I was still working at Welch, I studied Volume 2 of Knuth, discovered B-trees and hash tables, then implemented my first variable-key B+tree library and started thinking about how to index a genome. Hunkapillar’s heart was maybe in the right place, but he was trying to sell us a hammer when we needed a screwdriver.
(* Take a look at the Cavium Octeon, it’s got a regular expression NFA processor on the same die as a bunch of MIPS magnum CPUs, packet shuffling hardware, and content-addressable memory. An Octeon can do pattern-matching at 200Gb/s)