La Chose Du Jour


Each RSA conference seems to have its hyped ${thing} – the buzzword that everyone is selling (usually in the form of “next-generation ${buzzword}.” A few years ago it was Big Data. Then it was Machine Intelligence. Now it’s…

… Data Lake.

There’s a relationship between all three of those things, it’s just not obvious. Big Data is: when you collect lots and lots of stuff about things and then use clustering algorithms and business intelligence to search around in it and uncover unexpected relationships between things that suddenly make you realize that you could make millions of dollars if you did something just a little bit different. Like, one of the things you could do differently is not buy a gigantic Big Data solution, and save a couple of million dollars; a penny saved is a penny earned and all that.

This is a problem I’ve worked on and with for almost 30 years, in the form of system/event log analysis. For decades, system log analysts assumed that their task was to get all the logs to one place (“aggregation”) and then store them. Ta da! Of course it turned out that it was also important to look through the logs and try to find indications of things going wrong. So, what do we do? Now that we have aggregated all the logs together, we parse them apart again – searching and separating that which was already separate to begin with. It turns out that what we wanted to do was aggregate some logs and leave others at the endpoints, we just didn’t know which ones belonged where and now that we’d spent the time aggregating them it was easier to just keep on doing it that way. It’s actually even funnier than this because the system logs already aggregate all the event logs in the system – so you’ve jammed the web software logs in with the system boot reports and if you want to look at one without the other, you have to process the logs to re-separate them. Sometimes that processing is time-consuming or expensive.

[mjr, 1997, from a class on system log analysis I used to teach at USENIX]

Big Data is basically the same idea, only more so. One of the things that makes system log data difficult is that it’s free-form text; each message has its own structure, but someone had the brilliant idea of letting software engineers define what system diagnostics should look like – so now they’re garbage. With Big Data, the idea is to push everything into some kind of server cluster, running business analytics and searching software and then you can use much better and fancier tools than the system log guys have, to re-extract data and turn it into business charts. In the mind of a CEO who buys a Big Data “solution” (we don’t know what the problem is, yet, so it cannot be a solution) there is this pile of data and someone can ask a smart question and get a fascinating answer. I.e.: “What is the most popular, up-trending brand of gym socks in our inventory?” That’s a fair example, except that anyone who was working for a retail company and had access to that kind of information would expect that question and pre-compute the answer (by categories) as the data was ingested in the first place. The reason WALMART is WALMART is because they figured that out decades ago and built automated supply-chain analysis routines that monitored their sales pipeline and did just-in-time order placement and shipping. When you talk to the Big Data salespeople they’ll cite WALMART but they don’t understand that, even in a best case where the potential customer is also a massive retailer, this is the kind of business analytics that are much easier to compute on ingestion, rather than in a query loop. Unless you sell storage and Big Data solutions, that is.

A few years after Big Data it was Machine Learning and AI. What was the Machine Learning going to help us do? It was going to help us sort out our Big Data, naturally!

I’m not kidding. The Machine Learning and AI are going to sort through the Big Data and tell you what it thinks is important. AIs are good at clustering data to find similarities, and they are good at classifying things based on prior probability. I’m being a bit facetious here, but there’s a high probability that a classifier will conclude that “that stuff you didn’t look at before is not something you’ll want to look at again later.” Sure, if you update your expert systems’ knowledge-base and you want it to scrub through the entire Big Data all over again, based on the new knowledge, you’ve got all the data in one place. The Big Data proponent will say that there might be interesting previously undetected relationships in the data, which will be revealed, which is true, but it will be a human discovering that relationship and figuring out what it means – and if it’s interesting it’ll be pre-computed on ingestion.

Conversations I’ve had with Big Data proponents get surreal – these subtle interrelationships in the data only appear when all the data are in one place. “But if that’s true, you ought to be able to take a random sub-sample of your data and search for relationships there, where you can do it fast end efficiently, then you’ll know what to pre-compute on ingestion.” Let the AIs search an unbiased subset of the data and if the interesting relationships are there, the AI will find them. There are Big Data success stories, of course, but they’re mostly organizations that didn’t have any useful business data collection at all until they put the Big Data in place and discovered a bunch of stuff that they would have known already if they had only gone to the trouble to look for it. “See? Big Data is a huge win!” say the Big Data salesmen.

The AIs are trawling through the data using prior probabilities to bring things to people’s attention because of textual/keyword resemblances to other things that were interesting. It’s a step forward but it’s hardly ground-breaking.

Data Lakes are this year’s thing. A Data Lake appears to be: stick all your data at ${cloud service provider} and there will be indexers and classifiers and search engines and they will allow you to find your stuff again! Note that: you already had the stuff. Then you threw it in the Data Lake and it sank without a trace into everyone else’s stuff. But fear not: you can ask the Data Lake “where is my stuff?” and the Lady Of The Lake, her hand clad in the purest shimmering samite, will hold aloft your stuff from the bosom of the Data Lake.

Joking aside, it appears that the premise of the Data Lake is that it will use all the classifiers and rules and keywords in context and figure out what your stuff is. “Oho!” mutters your AI, “that file name ends in .PDF! I’ll mark that as being a PDF file!” So you can ask for “All the PDFs that have been created by marketing” and it will return a list of files! Note that, if your data had been organized to begin with, you could have just looked for PDF files in the marketing area on your storage area network.

Not an actual human’s sock drawer

I have an idea for a new computing product. It’s call a “Sock Drawer.” It’s where you put all the files that you have and it will organize them for you. Actually, that feature doesn’t work yet; we’ve only got the first part done. For now, you just open the sock drawer, hammer new things in (warning: if you put underwear in, it will become permanently un-findable) and close the drawer. From the outside, it looks very neat. Now, you need more socks.

Joking aside: there is a knowledge problem in computing today. Tremendous amounts of data are collected, but they are not understood. Systems are lashed together in increasingly complex meta-systems, yet the sub-components they depend on are not understood. I have talked to CTOs from FORTUNE500 companies and asked them “what’s on your network” and they unhesitatingly say “I don’t know.” Everything in computing is expanding vastly and rapidly and getting more complex, as well; the point where anyone understood what they are doing was passed back in the mid 1990s. Unfettered growth has been the mandate since then – growth without plan (because you cannot plan if you do not comprehend) – and hackers and governments have been embedding themselves deeply in that reality.

------ divider ------

Over the years I have come to see a lot of computer security problems as simply knowledge-management problems. People do not want to go to the extreme length of actually understanding what they have – whether it’s security alerts, firewall events, system errors, software errors, assets under management, data in databases; we have built this massive amount of stuff and nobody knows what any of it is, so they keep trying to fail it upward into something bigger. Throw your Big Data into your Data Lake, let it sink without a trace, and you’ve solved your Big Data deployment failure.

Sockdreams has great socks. [sock]

“Unfettered growth has been the mandate” – you know, like a carcinoma.

“A Data Lake appears to be: stick all your data at ${cloud service provider} ” – apparently today’s CTOs have never heard of “lock in”… Oh, and by the way if you’re thinking “this sounds like an idea that was cooked up by cloud storage salespeople” you win a valuable pat on the back from me.

Comments

  1. says

    How bad was the RSA conference? I assume there’s no point in me asking whether you enjoyed it, right? So, instead, I’ll just ask about how bad it was. I can only assume that you didn’t enjoy listening to everybody else talking about Data Lake as a great thing.

    “Unfettered growth has been the mandate” – you know, like a carcinoma.

    There’s a German band Saltatio Mortis who have made a song “Wachstum über alles.” The German title translates as “Growth Over Everything Else.” In this song there’s a line that means in English “wild growth is also called ‘cancer.’”
    Here is the song: https://www.youtube.com/watch?v=MTSitlFXEX8
    And here is an English translation of the lyrics: https://lyricstranslate.com/en/wachstum-%C3%BCber-alles-growth-over-all.html

    I really like it when artists find cool ways how to incorporate political arguments in their works. In this case, I can agree with the lyrics of this song about how our modern desire for more and more growth is harmful.

  2. Jazzlet says

    My sock drawer does look rather like that if you imagine that as a rainbow, with wool at the front and cotton at the back in winter, vice versa in summer, plus the long socks and short socks seperated too. Yes I am particular about my socks and their arrangement so I can find the colour, fabric and length I desire easily. But I lose track of things easily if I don’t have them organised to start with, which is I guess also what you are saying about data.

  3. says

    Jazzlet@#2:
    But I lose track of things easily if I don’t have them organised to start with, which is I guess also what you are saying about data.

    Exactly!
    Companies have been piling on the technology and many have made ineffective efforts to organize it. Things like Big Data and Cloud and Data Lake and AI are organizational principles they are trying to add onto something that they no longer understand. The AI will understand it for them. How’s that going to work? Invest in some AI and find out!

  4. says

    Ieva Skrebele@#1:
    How bad was the RSA conference? I assume there’s no point in me asking whether you enjoyed it, right? So, instead, I’ll just ask about how bad it was. I can only assume that you didn’t enjoy listening to everybody else talking about Data Lake as a great thing.

    I wandered through the show floor and saw nothing interesting. At least there were no “booth babes” and only one vendor that had a sports car (“look! we can buy a car with the money we got from you!”) on display. The topics were predictable.

    My panel went OK but I sort went dark and cynical and told a few unpleasant truths. Not surprisingly, that went over fairly well. I think that a little truth offsetting all the false optimism becomes a glaring beacon of reality.

  5. says

    Marcus @#4

    At least there were no “booth babes”

    As I was reading this line, my first thought was, “Great, that’s awesome.” Then, a second later, my mind shifted to, “Wait a minute, what am I thinking. How could my standards for ‘great’ drop so low? ‘No booth babes’ should be the norm and not an accomplishment. Having low expectations is one thing, but I shouldn’t allow my low expectations and cynicism to translate into low standards.”

    Jazzlet @#2

    Yes I am particular about my socks and their arrangement so I can find the colour, fabric and length I desire easily.

    I also dislike not being able to immediately find whatever I’m looking for. Yet my approach to this problem is the exact opposite of yours. I don’t even have a sock drawer; instead I just throw all my socks in a cardboard box. Instead of spending time organizing mountains of stuff, I choose to reduce the amount of things I own and simplify their variety. If you are a guy, you can comfortably live with a minimal wardrobe (this one’s much harder for women). Whenever I buy new socks, I buy at least 5 identical pairs. All my socks are the same length. My socks only differ in color—I have beige, light grey, and dark grey socks (I always try to wear a pair of socks that are the same color as my pants). At any given time, in my sock box there are several pairs of identical socks. Thus, finding two socks that are the same requires no time at all even though I never bother to fold or arrange my socks.

    Of course, my attitude remains the same for most of the stuff I own—the fewer possessions I have, the simpler it will be to organize them. Thus I try not to buy stuff that I can survive without. A large variety of different socks is an example of something a guy can live without.

  6. cvoinescu says

    Marcus, your cynicism lifts the spirit. (I’m saying this mostly without sarcasm. It feels good to hear someone else call the same bullshit.)

    I have mixed feelings about the Data Lake. Not that it’s both good and bad, no. Merely that it’s a bad in more than one way.

    1. As you point out, it’s a trick to sell more “cloud” storage. How can you sell storage to people who don’t want to invest in software that would do something with their data, nor have the resources to at least keep their data organized? Tell them the benefit comes from storing it. Which happens to be what they pay you for.

    2. It’s a bit like paying a company to “preserve” your brain after you die, so that you can be revived when the technology gets there. They basically dunk your brain in antifreeze and dump it in a dewar of liquid nitrogen. We have no idea how to retrieve that data. While at least we know how to get stuff out of a Data Lake if we know it’s there, it’s the same principle: they don’t sell storage, or a solution. They sell hope.

    3. There has been a long trend of adding layers of indirection and abstraction. Often this is good for productivity (C is easier than assembly is easier than typing octal is easier than flipping toggle switches); it can do wonders for maintenance and portability; sometimes it’s even better for performance (OpenGL is better than poking pixels in the frame buffer). It can also be better for safety. But sometimes this trend veers deep into the weeds. “If only the programming language was more like English, the accountants could write the programs.” Services are popular today — just buy the right number of subscriptions, get the pieces to talk to each other, and your problem is solved. We’re constantly sweeping complexity under the rug. Sometimes this is fine: as long as there’s still a handful of people in the world who understand what goes on under the hood of my compiler, I’m fine not knowing the details. But often this is not fine, and the complexity bites you in the ass (or slowly drains your will to live, or both). We’ve swept whole rugs under the rug, and our teetering piles of rugs are now at the point where they collapse more often than not.

    4. Data Lake is to data processing as Buy n Large is to environmental cleanup. Make it into unidentifiable commingled cubes and stack it somewhere. We’ll lease a large flat area and a fleet of Wall-Es to you. Fuck up your virtual environment too.

  7. brucegee1962 says

    A few years ago it was Big Data. Then it was Machine Intelligence. Now it’s…

    … Data Lake.

    Wait — I thought that it was blockchain that was the big thing nowadays.

    Are you telling me blockchain isn’t a thing anymore, before I even figured out what it meant???

  8. jrkrideau says

    Just where are these clouds?

    Unless we have a totally new data storage mechanism that will read and write to a literal cumulus or cirrus clouds the data must be physically somewhere or somewheres.

    Somewhere subject to political upheaval or political whim? Physical disasters?

  9. Ketil Tveiten says

    Data Lakes are essentially the enterprise-scale version of putting all your files, regardless of type or content, in the same folder. Or maybe, putting all the stuff in your house in that one drawer full of random junk where you put all the stuff you don’t know where else to put.

  10. says

    Ieva Skrebele @ 5:

    ‘No booth babes’ should be the norm and not an accomplishment.

    Damn right. Especially for a conference like RSA. I would have expected better of that. Something like E3, you would expect booth babes and other pandering to the lowest common denominator, but RSA has at least pretensions to professionalism. Shows how ingrained these attitudes were.

  11. says

    jrkrideau@#8:
    Unless we have a totally new data storage mechanism that will read and write to a literal cumulus or cirrus clouds the data must be physically somewhere or somewheres.

    “Cloud computing” is loosely defined as “computing done on someone else’s machine.” It’s out there, somewhere but it’s not the user’s problem where it is. Unless it becomes their problem, that is.

    I am still old school and crusty enough to believe that if it’s not your machine, it’s not your data. It’s a copy of your data that someone else has, that they are letting you access. If you throw away your original copy, then it’s their data they’re letting you access.

    The whole idea of “Cloud computing” depends on people not understanding that, and believing strongly that the service provider will never conspire with a government to grant them access to the data, or would never hold the data hostage over an unpaid invoice, or would never raise their prices once they have control of the data.

  12. says

    Marcus @#12

    The whole idea of “Cloud computing” depends on people not understanding that, and believing strongly that the service provider will never conspire with a government to grant them access to the data, or would never hold the data hostage over an unpaid invoice, or would never raise their prices once they have control of the data.

    I perceive it frustrating how I have to routinely defend my refusal to use cloud services. I don’t want to upload any of my data to a cloud, and I feel like everybody else is attempting to push their damn clouds on me: “You want a microSC card slot on your e-reader or your mobile phone? Why would you even need one? After all, you can use cloud storage instead.” It’s frustrating how often I have had this argument in online discussions. Maybe I just don’t know how to best argue for this particular position, because my argument just boils down to “I don’t trust cloud service providers.” And at that point my opponents put on me the burden of proof to show that some business that’s offering (often free) cloud storage is untrustworthy. At this point I usually don’t know how to proceed with the discussion. How do I prove that some company that’s offering me free cloud storage for my e-books is highly likely to abuse my data? When everybody seems to be using cloud storage, arguing against it makes me sound like a conspiracy theorist.

    It’s actually simpler to argue that I cannot use cloud storage, because I don’t have internet access at all times. This argument also is sufficient to justify my demand to have microSD slots on my electronics.

  13. says

    Ieva Skrebele@#16:
    “I don’t trust cloud service providers.”

    I usually say “I don’t like to be locked in to any one company’s system because then I lose control of pricing, accessibility, and recovery.”
    If that doesn’t convince them, then I realize they don’t understand IT at all, and stop wasting my time.

  14. Sunday Afternoon says

    Our IT seems to be doing the cloud thing while thinking about some of Marcus’ concerns. IT has moved (forced) almost all of our high performance computing simulation work from on-site computing clusters that we own and maintain to AWS.

    The change is/was disruptive. We get the benefits of AWS’s scale as we can run clusters that are far larger than anything we could hope to have on-site. Dealing with the data becomes an exercise in logistics: eg: what’s the best way to process 22 million files that are sitting in what is delightfully called a ‘bucket’ on AWS S3 (Amazon’s data lake)?

    IT does provide our own data archive – when we are finished with tasks on AWS, we move stuff to our own cheaper long-term storage in our own datacenter. Given the work we had to do to get our computations into AWS, we’re now in a position that our work is relatively portable. With some fairly straightforward changes, we could move to another cloud computation platform if required (but don’t give IT any ideas).

  15. Dunc says

    OK, devil’s advocate here… I’d argue that for a lot of people, relying on a third-party cloud storage provider is every bit as sensible as relying on third-parties for web hosting or email. It’s a generic, commodity service, where a specialist can use efficiencies of scale to provide a far better service than the average punter or SME, at a much lower cost. It should be treated as a utility, like electricity or your internet connection.

    As a punter, cloud storage gives me a zero-effort, completely transparent, multi-site backup solution, with full versioning, that also gives me seamless access to my data across multiple devices and operating systems. If I should decide that I don’t like the terms or the service I’m getting from cloud storage provider A, then moving to cloud storage providers B through Z is simply a matter of setting up a new account, installing a different sync client, and copying some files – and all of them offer a better, more reliable solution than I could reasonably implement myself. This business of having physical drives in safety deposit boxes belongs in a previous century (if I even had a safety deposit box, which I don’t). And I don’t worry about cloud storage providers “abusing my data” because none of my data is of any use or interest to anybody else whatsoever. (Except my password database, which is robustly encrypted.)

    Similarly, for an SME, operating your own NAS is an unnecessary pain in the backside, which most people make a horrible mess of. Even large enterprises with dedicated IT departments make a horrible mess of backup. Migrating from one provider to another is potentially a little bit more hassle, but only because there’s more data involved. You don’t need to worry about being locked-in (any more than you need to worry about being locked-in to your on-prem hardware) because it is literally just a bucket for bits.

    Obviously this is all based on the assumption that you live in a modern, developed nation, where bandwidth is as ubiquitous and reliable as electricity… If you live in some bandwidth-capped throwback to the late 20th century, things might look different.

  16. says

    Dunc@#19:
    If you live in some bandwidth-capped throwback to the late 20th century, things might look different.

    Oooh, late 20th century would be a nice upgrade!

    Seriously, though, your points are good. But should we worry that history may repeat itself? If so, we need to look carefully at what happened when IBM dominated the data center in the 1970s: the prices went sky high and that triggered the minicomputer revolution and desktop revolution. In one sense cloud computing is just DataCenter2.0. and it’s got the same potential problems – a lot of companies that went into the “glass house” de-skilled their staff to the point where they were helpless. I know because I made my first couple consulting gigs, when I was in college, recovering captured databases from mainframes where my clients weren’t being allowed to migrate without ruinous migration fees. (And it turned out that a well-configured 286 with DbaseIII could out-perform an IBM Series 1, easily…) Watching those organizations getting royally screwed (they were paying nearly $250,000/year for one database!) – it really made me see the consequences of losing power over your critical resources.

    Organizations that are careful about lock-in can ride the tiger sucessfully but to avoid lock-in you have to retain some basic understanding of IT. Before I stopped doing any government work, I witnessed firsthand what happens when you have a huge footprint, no skills, a lot of money, and lock-in. In the case of the government, it’s been especially bad since the first thing that all the contractors did when they were locked in was make the lock-in worse. There are government systems and networks that are entirely in the control of contractors and any attempt to move them or renegotiate terms of service would amount to a complete re-build. Next up: JEDI. Amazon is doing a very good job with Lambda, making sure that a proprietary backbone is a justification for sole-source contracts and a lock-in. This stuff does not simply happen, it’s deliberate. It’s discussed quietly in the go to market meetings at vendors.

    For individuals or organizations that have data of such relative unimportance that it’s not a threat, by all means, cloud is the way to go. Otherwise it’s the usual tradeoff between convenience, cost, and security/future/governance. Since I’m sort of a long-term internet mortician, I see the worse side of that equation so I tend to hedge my risk against that direction.

  17. Dunc says

    Certainly large organisations (and particularly large public sector organisations) have all sorts of problems to consider here, but I’d argue that in many of those cases, the problems are more to do with the nature and scale of the organisation than with the specifics of the technology. At that scale the potential for lock-in exists everywhere, from the hardware in your data centre to the office software on the desktop, simply because changing anything in an organisation of that sort is an absolutely bloody nightmare. Remember, we’re talking about the sort of people who need two years and team of procurement specialists to decide which brand of paper clips to buy.

  18. Jazzlet says

    Ieva @#5
    I have lots of colours of socks because I enjoy the colours. I have cotton and wool because I don’t get athletes foot if I wear them, which I do if there is much man-made fibre in the mix. If I had enough money the wool would be merino and/or cashmere, and the cotton bamboo. I do buy multiple identicle pairs of each colour, fabric and length so as individual socks die I can still have pairs that work, but also because for any given colour that I like it may be years efore they come round into fashion again. *fashion – spit*. So practically my drawer is actually like a cross between the picture and your box, there are pools of socks of each colour, fabric and length. It’s a small pleasure, but opening up my sock drawer makes me smile, I’ve been poor enough to not be able to buy underwear – you can’t see it so it doesn’t matter what state it’s in, and all the rest of my clothes I bought from charity shops. I never forget that and am always cheered by evidence that I am not that poor anymore, in fact I’m very comfortably off.

  19. says

    Jazzlet@#22:
    The scene in Gone With The Wind where Scarlett swears, “I will never go without woolen thigh highs again!” – was the best scene in the whole movie.

  20. dangerousbeans says

    damn am i feeling this. if you can’t sort the data (or anything) as it goes in it will just become a mess.
    I’m sitting here looking a 300k records (so small fry really) that are almost useless because there is no standardisation on what got recorded. this is costing millions a year, and it’s mostly useless because someone took the data lake approach.
    if i had any experience with machine learning i would talk them into buying some machine learning capacity, not that i think it would work.
    too much excitement about being able to collect data, not enough thought into what data you want to collect and how to define it.

  21. Dunc says

    too much excitement about being able to collect data, not enough thought into what data you want to collect and how to define it.

    Exactly. Since people realised that they could collect vast amounts of data, they’ve become obsessed with what you might call “data fetishism” – the idea that if you can only pile up enough straw user data, somebody will figure out a way to spin it into gold “actionable insights”.

    I’d love to see what happens if you run enough ML on a bunch of physiognomic measurements and online shopping data… We could re-invent phrenology for the 21st century.

Leave a Reply