Scientific information must be free! Now where to put it…


Do you read the ‘supplementary information’ in science articles? If you’re familiar with the way journal articles work, they publish a traditional and formally formatted article in the print version of the journal, but now they often also have a supplementary information section stored in an online database that contains material that would be impractical or impossible to cram into print: raw data, spreadsheets, multimedia such as movie files. This is important stuff, especially if you want to dig deeper or re-analyze or otherwise rework the information.

Another important function is, I think, preserving data. In a previous life, I moved into an old lab that was piled high with the cluttered debris of the previous tenant’s scientific career; some we boxed up and moved to a storage space (where it is probably moldering, untouched since the early 90s) and the remainder found a resting place in a dumpster. I felt terrible about that, but it was a necessity.

Maybe it won’t always be such a necessity, though. The ‘supplemental information’ section of science papers represents a way to archive data that would otherwise lie in heaps at the bottom of file cabinets until lost. Those sections have their own problems—’supplemental information’ is an amorphous category that can contain anything, is clearly going to require some kind of formal metadata support, and is going to be a storage headache for publishers. We might also wonder whether the big publishing companies are also the appropriate repositories for what ought to be publicly accessible data.

One other possibility is storing raw data on these growing free databases. YouTube is essentially a free database specifically for storing any (almost) video data — I’ve seen some scientific work tucked away there, although it also creates new concerns: resolution is limited, you never know when the YouTube management might decide they dislike you and throw away your work, and a lot of raw scientific data isn’t going to have a large audience and therefore isn’t going to draw in a lot of ad revenue. Another interesting possibility is Google Base. Did you know that Google is providing a free online database in which you can store just about anything? Free storage, public access and searching, a reliable host — it’s a wonderful idea, as long as you don’t mind Google owning all of the information in the world.

Comments

  1. Caledonian says

    Did you know that Google is providing a free online database in which you can store just about anything? Free storage, public access and searching, a reliable host — it’s a wonderful idea, as long as you don’t mind Google owning all of the information in the world.

    “Of course we’ll bundle our MorganNet software with the new network nodes; our customers expect no less of us. We have never sought to become a monopoly. Our products are simply so good that no one feels the need to compete with us.”

    – CEO Nwabudike Morgan
    Morgan Data Systems press release
    “The Network Backbone”

  2. says

    One of the enormous advantages of the internet is that it provides a quick, convenient way to reference supporting material. In the ideal case, all raw data, bibliography and footnotes would be found on clickable links to sources which themselves were supported with clickable links, so that you could go as deeply into the interesting or problematic parts of the paper as you wished. This isn’t at all impractical or prohibitively expensive to make this possible, it just hasn’t been done yet.

    In print, tracking an idea to its origin minimally takes a trip to the library and a trip to the stacks — but often you have to use ILL, not all libraries are open-stack, and sometimes the thing you want is unavailable. So we’re talking about reducing a lag time of a coupe weeks to a couple hours.

    It’s amazing the disproportion between the negative and the positive things people (especially academic people) say about the internet. The downside is pretty minimal and the positive benefits are enormous, but you would know that from listening to people. A lot of it, as with journalism, is that some of the gateways have been broken open, reducing the exclusivity of the scholarly world.

  3. Flex says

    I know that some people think it’s a heretical thought, but intellectual property laws do not have to be as restrictive as they are today.

    Based on Google’s track record, they may not be concerned if the information they store is in the public domain. Meaning that they couldn’t/wouldn’t prosecute anyone for using information they store.

    If people are interested in ideas about intellectual property, I’ll give a plug for Lessig’s book, Free Culture. It is available without charge under a creative commons license here, http://free-culture.cc/freecontent/

    It’s about a 2.5K .pdf file. Great for a PDA download for a place trip.

  4. says

    A lot of people have been clamoring for access to scientific papers recently, so perhaps a Google database where scientists can store their data/papers would be a good idea. Give them a way to make ad revenue off of it, and I’m sure they’d do it.

  5. Russell says

    If a company like Google did not host it, what organization would? Library of Congress? The problem is that someone has to own the servers and routers and wires and things. One might argue that a long-lived institution like the Library of Congress would be preferable; corporations tend to have shorter lifespans.

  6. Johnny Vector says

    Well, this time Phil Plait is way ahead of you, PZ. Or anyway, the astronomical community is ahead of other sciences. The AAS was the first to get electronic publishing right, like 10 years ago, and still one of the better scientific publishers. For instance, they chose SGML (you can call their implementation XML if you like, though it’s not quite exactly) for archival storage, to increase the likelihood of recovering the data should all the reader software go extinct.

    That possibility, by the way, is one of the huge downsides of electronic publishing. Adobe has very little incentive to keep their formats forever backward-compatible, so it’s quite likely that a PDF from 1999 will eventually be readable only with a very old version of Adobe Reader, which will run only on a very old version of Windows, which itself won’t load on any computers available even on eBay.

    You got a piece of paper, you can always (assuming it hasn’t turned to dust) read it, with just your eyes. You got a CD-ROM, even if the bits are 100% intact (which is unlikely after 100 years for many values of “CD-ROM”), you still need a specialized reader to make any sense of it.

    Mind you, I think electronic publishing is great, and putting the actual data online as well is a huge step forward, but we do have to be aware of the downstream consequences.

  7. Graculus says

    “The Network Backbone”

    Posted by: Caledonian

    “Beware of he who would deny you access to information, for in his heart he dreams himself your master.”

    Commissioner Pravin Lal

  8. Crow says

    This has been a major concern for academic librarians for better than a decade, and there are some elegant solutions that have been developed and implemented by a number of the major research libraries.

    To get you started: http://www.dspace.org/

    -Crow

  9. ColinB says

    scholar.google.com will FIND the papers, although they don’t store them. It’s a bit like peersearch I guess.

  10. Andy Groves says

    One of the dirty little secrets of supplementary data is how quickly it can disappear from the websites of journals. I’m not naming names, but trying to access supplementary data from certain journals gets harder and harder the older the article gets…….

  11. John Emerson says

    Anglo-Saxon and medieval studies have also been using the internet effectively for a long time. Like astronomy, these are low-profit specialties.

    Seemingly sites should load downloadable obsolete software at some point.

  12. says

    The dumpster part is the worst of the old paradigm.

    My wife, before she became a Physics professor, worked as a PI (pricipal Investigator) on an ion-sputtered gallium arsenide thin-film research project, where she’d written the proposal, and also did the administration. He team spent years perfecting the apparatus, and got results that the US government ranked #3 in the world, just behind IBM and Bell Labs.

    Then the little (~220 person) company cut back (it was Dick Cheney personally who screwed them on a contract). Eventually, the company went bankrupt, and dragged its president and vice president (now married) into personal bankruptcy.

    In the sale of assets, my wife bid on her lab notebooks. She was outbid by someone who just liked the loose-leaf covers, but refused to give the contents to my wife. We presume that all the hard-won data went into a dumpster.

    Your tax dollars at work.

    I may have other things to say later on this thread — I publish a LOT of these on-line appendices — but I wanted to get that horror story off my chest, first.

  13. says

    We forget that not all raw data is of value. I publish the interesting parts of my data. But I would never want to inflict big chunks of my notebook on anyone. Not all data needs to be saved.

  14. says

    Jonathan Vos Post, that is indeed a horror story. “Ouch” doesn’t cover it. Now I feel bad.

    When I was a Master’s student (only a few years ago), we moved labs. The rooms we moved in to were formerly occupied by a semi-retired professor who eventually decided to make it official and stopped showing up on campus. He didn’t do much in terms of cleaning out, and among the jars of specimens and a file-cabinet full of empty hangers and decaying scrap paper, were the theses of every graduate student that professor had served as primary advisor for during his career – a stack of books about 2.5 feet tall.

    I couldn’t bear to throw away books, especially theses, so I cleared a shelf and hung on to them. I was very glad I did when a former student of that professor, himself now a professor, came to our university and gave a (quite good) presentation. He very gladly took his thesis with him, saving it from whatever whims or fate the rest may be in store for now that I’m no longer there to look after them.

    As for supplemental information, I’ve been trying to get the family-level phylogeny for Actinopterygiians from a couple of papers of 2005 and 2006 vintage from Springer. They’ve been very polite in their emails, and have assured me they’re working on this, but it’s been almost a year and still no luck. The reference to “supplemental materials available at http://www.springerlink.com” in the papers is way too vague to be useful – Springer is a big publisher, with tons of material. Finding one figure from one article from one journal on a big site like that is basically not possible.

  15. Bob O'H says

    What, nobody mentioned GenBank yet? It’s a database of DNA sequences, that now has all sorts of uses. Access is free, and it’s the norm to put you sequences in Genbank: I think some journals insist on it.

    And once the sequences are there, you get to BLAST them.

    Bob

  16. Kagehi says

    Hmm. Data redundancy! Never, ***ever*** store something only in one place. Its not safe. Just look at what happened to the original non-blurry digital tape from the moon landing, because first off, even NASA didn’t (and sadly still don’t) give a shit if they kept/find it, and second, there where no copies. One has to be slightly horrified at what has gone missing in some place like the pentagon, where not only are no copies often kept, but half the damn pages have big black marks through them to hide names, pieces of data they don’t want people to know, even after its released for public examination, etc.

  17. Col Bat Guano says

    I’ve worked in biotech for most of the past 25 years and I always wonder what has become of all those old lab notebooks archived at now defunct companies. Initialing and dating each entry and then getting it co-signed took a bit of work and I’m sure all that data (and my illegible notes) is filling a dump somewhere.

  18. says

    Addendum to my small company horror story: I have observed much worse in government agencies and Fortune 500 corporations. My CV — the non-teaching version is at
    http://www.magicdragon.com/SherlockHolmes/resumes/JonathanPost.html

    shows that I’ve worked at, as employee, contractor, or consultant:

    Boeing, Burroughs, European Space Agency, Federal Aviation Administration, Ford, General Motors, Hughes, JPL, Lear Astronics, NASA, Systems Development Corporation, U.S. Army, U.S. Navy, U.S. Air Force, Venture Technologies, Yamaha, and others.

    Again and again I’ve seen this (generic version conflates several incidents):

    Some assistant VP with time on his/her hands dredges up the urban myth that Filing Cabinets are Bad for Business. A specious old management myth holds that the facility cost per square foot of filing cabinets exceeds the value of the paperwork kept in filing folders in those filing cabinets.

    A company-wide memorandum goes out, offering a prize to the Department Manager who throws out the greatest weight of filing cabinet contents. A firestorm ensues. A manager gets the award — typically a cash bonus, paid vacation, and letter of commendation.

    A few weeks later someone requests vitally important data from the department that I worked in. I have to say, sadly, that my boss threw it out.

    I’ve seen even more extreme versions of this. In one case, when I was a highly paid very senior manager at Rockwell International’s Space Transportation Systems Division, there was a small section of the technical library that held the company’s own films and videos of Saturn V Apollo launches. These were NOT the official NASA records, but used different cameras. There was a $2,000 per year line-item for keeping these archived. A manager decided to save $2,000 per year and annnounce that he was saving $10,000 per year. The budget line item was cancelled. The films and videos were literally thrown into a dumpster.

    I watched someone go into the dumptser to retrieve them, and donate them to a nonprofit space organization. I watched Security order the rescuer to throw the material back into the dumpster, or being immediately fired for insubordination. Back into the dumpster they went. Security accmpanied the dumpster to ensure that the records were actually destroyed.

    The same then happened to Rockwell’s own videos of Space Shuttle launches. I reported this to the Columbia Accident Investigation Board, and the NASA Inspector General.

    The NASA Inspector General refused, in writing, to gather some 6,000 pages of data that I had kept. The only Nobel Laureate on the Columbia Accident Investigation Board resigned, and gave a press conference. The Nobelist (stand-in for my mentor Feynman on the Challenger investigation board) said that NASA’s management defects had been identified in copious detail, but that it was now evident that they did not have the will or ability to follow the recommendations.

    The Rockwell International’s Space Transportation Systems Division has been acquired by Boeing. The bad managers at Rockwell are now bad managers at Boeing — and some have been promoted.

    As I say, I’ve seen this sort of thing done at ALL large companies for which I’ve worked.

    My salary was reduced over 75% when I went from the corporate world to college and university teaching. But at least I don’t wake up screaming from nightmares now about the Space Shuttle blowing up and killing people because I could not get management to heed my thoroughly documented concerns.

    The corporate-university connection is changing. See the editorials today warning of problems if BP’s $50,000,000 per year deal with the University of california and the University of Illinois is finalized. Can a multibillion dollar company de facto “buy” a major research university, and skew its science research and ethical conduct forever? Too soon to tell.

    But the issue is much much bigger than lab notebooks and on-line publications.

    By the way, I’m looking for a job right now. But my blogging suggests that I put the interests of the truth ahead of what pointy-headed bosses order. And that seems to be a problem.

  19. Torbjörn Larsson says

    Jonathan Vos Post, that is indeed a horror story.

    I have to second that.

    While my own material never where of that caliber, I have lost raw data two times inside large corporations. In the days of paper or Mac/Windows war, copying was often out of the question.

    Once I lost original raw data (also on sputtering, btw) while being stationed in US because of mistakes while the main company moved into new buildings. Luckily, the results were already extracted.

    The other time was their own rather expensive data that disappeared from a “safe” place. (There was a wave of internal and surprisingly easy theft at the time, so no one was certain if it was intentional or not. Maybe that was the intention. :-( And again, ‘luckily’, the original project was already abandoned.

    I do hope the web will help, keeping duplicate, cheap and accessible copies. The copyright problem is solvable, and I would like a solution is along Flex’s suggestions. I may be culturally biased, since for example swedish governmental tradition has been openness outside primary executive functions (which need to be fast and not second guess to guard for unavoidable mistakes), but it has worked well.

    The compatibility problem is perhaps a problem of view point. To me it doesn’t seem quite analogous to old language needing an interpretation to be understandable, it is rather like refusing to use old (and sometimes local) currencies (markup standards).

    What we perhaps need is ‘banks’ that will cover the transactions, like GenBank/BLAST interfaces or the astronomical equivalents. Perhaps they could perform the same functions with data ‘hidden under the pillow’ by changing outdated markups during a notice period.

  20. Torbjörn Larsson says

    Jonathan Vos Post, that is indeed a horror story.

    I have to second that.

    While my own material never where of that caliber, I have lost raw data two times inside large corporations. In the days of paper or Mac/Windows war, copying was often out of the question.

    Once I lost original raw data (also on sputtering, btw) while being stationed in US because of mistakes while the main company moved into new buildings. Luckily, the results were already extracted.

    The other time was their own rather expensive data that disappeared from a “safe” place. (There was a wave of internal and surprisingly easy theft at the time, so no one was certain if it was intentional or not. Maybe that was the intention. :-( And again, ‘luckily’, the original project was already abandoned.

    I do hope the web will help, keeping duplicate, cheap and accessible copies. The copyright problem is solvable, and I would like a solution is along Flex’s suggestions. I may be culturally biased, since for example swedish governmental tradition has been openness outside primary executive functions (which need to be fast and not second guess to guard for unavoidable mistakes), but it has worked well.

    The compatibility problem is perhaps a problem of view point. To me it doesn’t seem quite analogous to old language needing an interpretation to be understandable, it is rather like refusing to use old (and sometimes local) currencies (markup standards).

    What we perhaps need is ‘banks’ that will cover the transactions, like GenBank/BLAST interfaces or the astronomical equivalents. Perhaps they could perform the same functions with data ‘hidden under the pillow’ by changing outdated markups during a notice period.

  21. MpM says

    First: Public storage by Google or any other proprietary entity is no storage. You are totally at their mercy. If in 5 years that entity is bought out, or their offerings change, you could see valuable data get shit-canned faster than Irish Priest on St. Patty’s day.

    Second: What is important? Some folk feel any and all “information” is valuable, but is it valuable enough to invest in archiving? Not to trivialize Jonathan’s eloquently stated concerns, but do we need to keep every camera angle of every Saturn V launch? If the answer is yes, then who foots the bill? Information storage is cheap, but ARCHIVING is not. Searchable databases are expensive to maintain.

    Lastly: I do believe we need a national protocol for storing knowledge. I feel that universities should work together, just as they did at the birth of the Internet, to devise a unified database scheme with agreed upon platforms, security, permissions and access. National labs, and reposiories such as the Smithsonian, National Parks, and the CDC should participate. It is a single system, with distributed storage.

    Just my 2 cents.

  22. CalGeorge says

    There are lots of non-commercial alternatives such as dSpace (developed by MIT) that could do the job too.

    http://www.dspace.org/

    Keep the stuff in academia, where it won’t be dumped at some point in the future, and give librarians the opportunity to help with the metadata so that the stuff can be retrieved easily.

  23. says

    Proceedings of international conferences often have a page limit for papers. Many, maybe most of my dozens of conference papers are trimmed down from 100-page drafts to, say, 12 page Proceedings versions.

    But I hate the trimmed away parts to go unpublished. So I tend now to almost always write drafts that are 2 or 3 times too long, and then carve 2 or 3 different papers from them, for different conferences. However, nobody but me and my coauthors sees the connections between these fragments. There’s no cost-effective way to publish monographs (which is what a 100-page paper really is). Except online…

    In essence, each author (perhaps with the assistance of their department or corporation or government agency or whatever) should have a master metadata map of ALL their publications, in full and updated form, showing how the pieces fit together in semantic space.

    My annotated list of publications had exceeded 250 paper pages before I gave up including it in job applications. Instead, I have to say “my 2,400 publications, presentations, and broadcasts are too numerous for this application…”, so I list four of the most recent only, and then give the URL of the best and hardcopy of the cover/abstract pages of 3 others.

    I understand that the average application to Harvard Med School faculty has over 100 publications listed, so they ask for the top 10 only.

    Also, for every publication I have, there is at least one half-baked draft not quite ready for publication. This is in every category I publish: Math, Physics, Biology, Astronomy, Computers, Science Fiction, Poetry, …

    So with over 1,000,000 words in hardcopy print (word in the old magazine/newspaper sense of number of alphanumeric characters including spaces, divided by 6) and closer to 5,000,000 words online, I still have that much again in versions unsearchable and invisible. And there are only 24 hours in a day. Lawers and consultant usually add: “… and I bill for 36 of them.”

    [insert reference here to the asymptotic approach to “LPU” — the Least Publishable Unit, as short as possible, with as many coauthors as possible, as a fragment of as large as possible a series of papers].

  24. says

    By the way, I don’t get paid for 99% of these publications. In fact, I have to spend about $20,000 per year out of pocket to go to the conferences, fly, stay in hotels, and eat overpriced food. When I’m employed, that’s a useful tax loss on my Schedule C to reduce my taxable income and thus my taxes on the salary. But when between jobs (as I have been for 2 years) it is extremely rough.

    In essence, the Web has made it possible to publish an order of magnitude more, and reach a wider audience, but there is no micropayment infrastructure to reward online publication. In essence, one theoretically gains reputation points, but they are not monetized.

    Hence the vast bulk of worthwhile research is done with institutional affiliation: corporate, university, government grants picking up the journal page charges and conference travel budget. But this goes against the tide of the global distributed wortld of virtual companies and online co-authors who rarely or never see each other face-to-face (or f2f in netspeak).

    What is the role of the unaffiliated independent researcher in Web 2.0, Web 3.0, and the bulk of the 21st century?

    Oh yeah, blogging. Right. How many people every got tenure for blogging? Or won a research grant from blogging?

  25. says

    MPM reasonably commented: “…Not to trivialize Jonathan’s eloquently stated concerns, but do we need to keep every camera angle of every Saturn V launch?…”

    (1) Archivies of film and video are relatively cheap compared to the cost of digitizing them.

    (2) Photographs and films are usually admissible evidence in legal cases (think: investigation of scientific fraud), whereas digital files require additional proof of authenticity and chain-of-evidence.

    (3) As it turned out, for the question “do we need to keep every camera angle of every Space Shuttle launch” the answer, in the context of ice breaking off on launch and making holes in wings, was a matter of life or death.

    (4) In the latter case, the NASA Inspector General, in dismissing my complaints (which were filtered through a non-engineer investigator!) reportedly said “throwing out archived information is not illegal.” I contend that it IS if it is covering up knowledge of a problem, and thus turning negligence into intentional risk, i.e. the difference between manslaughter and murder.

    Oh yeah — corporations cannot be charged with murder. Google the phrase “mens rea” in the legal context.

    (5) Granted, most science data is never going to end up in court. But archiving all emails and raw science data CAN be important in theft of intellectual property cases (i.e. plagiarism) and patent disputes (BIG money at stake). In the trial on who owned the patent on Stored Program Digital Electronic Computers, the key evidence was a notarized cocktail napkin.

  26. says

    I’m surprised that there has been no mention of the Semantic Web. There is a lot of work in getting medical and biosciences hooked in to the Semantic Web effort. The storage problem would basically be a distributed effort, since the data could be linked in much the same way between databases. Journals, publishers, universities, individual blogs and corporate websites. It’s all exciting stuff and it’s developing on both the commercial and academic front. This is why I’m planning to do my Masters’ degree in the area of semantic web technology.

  27. says

    Forrest Bishop sent me this email to post:

    Knowledge is Property

    For the scientist, as for everyone, it requires expending resources in the present (deferring consumption) in order to produce something for future consumption, scientific knowledge in this case. If the future product has an *anticipated* sale price of zero, there is no economic reason to produce the good. The reward then becomes purely psychic, a valid reason to pursue the endeavor, but it does not put
    food on the table.

    If the purchase price were zero, the demand would tend to infinity, as the supply approaches zero (for manmade goods). A great example is government health insurance: the perceived “free” medical care drives the demand upwards, while the
    supply of efficacious medical care tends to zero. This is most clearly seen in the more socialistic countries, like Canada and England. Here in America, we have reached the absurd situation whereby folks without
    health insurance live longer and enjoy better health. Death by AMA doctor has become the second or third leading cause.

    In more civilized times (17th-19th Centuries), most of the valid scientific or technological knowledge was either produced by ascetics like Oliver Heaviside, by independently wealthy investigators like Rene Decartes, by patronage, or by private enterprise. University was a secondary player/cheerleader, and the State almost a non-player. (An interesting exception was the French military invention of orthographic projection in the 17th Century. This remained a military secret- holding up progress- for 30 years, IIRC.) Hence the rate of progress was far higher
    than it is today. For example, the transition from canal to railroad in the early 19th Century was far more revolutionary and beneficial than
    the transition from the 707 to the 7X7, or from the Benz to the Maserati, or from the V2 to the Saturn V to the lately-recyled Saturn V.01.

    Sifting Signal from Age of Noise

    Next, we have to look at what constitutes “valid scientific knowledge” and what is only noise and more noise. A couple examples of non-scientific beliefs, or hypotheses:
    a) There exists a fluid-like material that flows inside of wires, called “electric current”.
    b) There exists a continuous, thermonuclear reaction inside of the Sun.

    Each of these constitutes an hypothesis. The ramifications, i.e. deductions, of the hypothesis are what is tested, not the hypothesis itself. These examples were chosen because both have had their deductions falsified. Both rely on a hypothesized unseen entity, or process. When the deductions from an hypothesis, which form the scientific theory, are
    shown to be false, everything built upon that hypothesis is questionable as to being valid knowledge. (etc.)

    Forrest

    –Forrest Bishop

    Institute of Atomic-Scale Engineering
    http://www.iase.cc

    [there were 2 more URLs as originally submitted, but they might have been causing the scienceblogs server to assume spam, and give a vague error message]

  28. Gregory Mayer says

    “Supplementary information” is not a way to archive data, it’s a way to disappear it. Journals are declining to publish necessary information (often the data itself) on the grounds that they can put it online. But what’s still going to be around in 50 or 100 years– the Library of Congress, or some file posted on a website? At a workshop, Edward Tufte held up a first edition Galileo, noted it had lasted 400 years, and then asked, “Is your website going to be up next week?”

  29. Neil Beagrie says

    we have recently completed reports for the UK Office for Science and Innovation (part of the Dept of Trade and Industry) looking at the information infrastructure needed to support the UK’s Science and Innovation Investment Framework 2004-2014.

    The sub-group report on digital preservation and curation is available from http://www.nesc.ac.uk/documents/OSI/preservation.pdf

    The report covers many of the key issues raised in the discussion above such scientific data management and preservation, selection, data services, scientific e-journals and persistent linking to supplementary data sources. Hence it should be of considerable interest to those following this discussion thread.

    There is also some interesting work under discussion in the US Office for Cyberinfrastructure(part of NSF)considering a massive plan to store almost all scientific data generated by federal agencies in publicly accessible digital repositories. See the recent coverage in March 22 issue of Nature at http://ealerts.nature.com/cgi-bin24/DM/y/hc530SpivX0HjB0BOpY0EA (subscription required).