Steal this post


There’s a minor contretemps going on at scienceblogs — a few of our Original Content Providers are a bit peeved at certain abysmally uncreative sites that think they can get rich by collecting rss feeds and putting them on a site with google ads, while adding no original content of their own. I don’t mind the rss parasites trying that at all — if it’s in my syndication, it’s out there and you can jiggle it around however you want — but it’s such a stupid, mindless strategy. Who’s going to regularly read a site that just repackages other people’s work, when the originals are easily and freely available? And if they stumble across something I wrote on another site, as long as it has a link back here, I really don’t care; it just means they’re doing some half-assed, clumsy advertising for me, for free.

So I’m not joining in the complaints. However, I do think this great comment from a Technorati rep at Bora’s is an optimal way to handle it.

Thanks for bringing that to our attention, ny articles won’t be getting indexed by us. We try to index only original sources and to avoid aggregator/planet sites; we definitely don’t want mechanical feedscrape-and-adsense sites.
cheers,
-Ian (from Technorati)

A parasitic clone-dump really represents minor damage, in the form of inefficiency, on the internet — it’s nice to see that it’s also seen that way by at least one of the network aggregating services, and that they are going to detour around it.

Now for the philosophical dilemma: are Technorati and Google also mechanical scrapers of the original content on the web? Should we be irritated that Google keeps copies of our web pages on its servers?

Comments

  1. llewelly says

    This argument is older than search engines. Google hadn’t been imagined, much less named when it started.
    At the end of the day the community tends to judge the scrapers by the services (if any) they provide. In google’s case, their scraping enables a much wider audience to find and view a web page.

  2. says

    My take on people archiving and redistributing my blog, photography, etc. is that as long as I’ve already made it publically available and they’re not charging a fee for accessing it without my permission and a royalty, it’s fine. Anything placed in an open forum of any sort anywhere on the net should be considered to have passed beyond the author’s control the moment he/she hits the “submit” button, given the previously listed conditions.

  3. inkadu says

    I’m assuming you’re not talking about editted compendium sites like A&L Daily, which filters a lot of web content and picks some interesting articles (to elitist libertarian). That is actually providing a bonafide service.

    But again, I think PZ, and most people on the internet, are hip to the internet’s economic philosophy:
    1) If people are willing to do it free (and better), there’s no point to charge for it.
    2) If a lot of people are reading it, you’ll find some way for your internet hobbyt to pay for your laundry.
    3) Jealously guarding content is a great way to be irrelevant.

  4. Abc says

    I copy paste some of your posts and put them on my blog. I give full credit to the authors. I think its a form of advertisement.

  5. says

    I, too, have had people aggregate my feeds, among others, without asking me, which is certainly annoying. Which is why when I created my own aggregator, Planet Atheism, I made it my policy never to include a blog without express permission from the author. So far, the experience has been quite positive.

    I very much doubt that the actual content stealers get a lot of money from AdSense, since Google and others are very good nowadays at detecting which is the original version of a post, and send people to that version, not to the copy. Planet Atheism, for instance, gets almost no traffic from search engines (except when people actually search for “planet atheism” and such); when people google for something, they get directed to the actual blog post. Which is the way it should be.

  6. says

    I, too, have had people aggregate my feeds, among others, without asking me, which is certainly annoying. Which is why when I created my own aggregator, Planet Atheism, I made it my policy never to include a blog without express permission from the author. So far, the experience has been quite positive.

    I very much doubt that the actual content stealers get a lot of money from AdSense, since Google and others are very good nowadays at detecting which is the original version of a post, and send people to that version, not to the copy. Planet Atheism, for instance, gets almost no traffic from search engines (except when people actually search for “planet atheism” and such); when people google for something, they get directed to the actual blog post. Which is the way it should be.

  7. says

    I’m certainly not annoyed at Google caching pages. When my web space provider “lost everything” in the midst of still having quota problems and not having space for backups because of it, Google cache saved my backside :)

  8. says

    I don’t understand the POINT of sites like the one described. Why would anyone else want to view willy-nilly stuff taken from elsewhere? Now I do use Google Reader to keep all my favorite reads in one place, just as a matter of convenience, but I don’t think a large number of individuals are particularly interested in EVERY THING I READ… and I would never make an entire blog of it! I do mark some entries (via my Google Reader) as shared and have a “picks” box, which directs you back to the original source blog.

    …an all cut and paste blog? that’s just lazy.

  9. says

    While I’m sure Technorati tries to eliminate the non-value-added scraping sites and those that just regurgitate public domain articles to generate ad revenue, I would think it’s a losing battle. Just visit Technorati and do a search for “dental insurance” – I don’t see a lot of “real” blogging there.

    I have a WordPress widget that identifies when a post has been scraped and linked back to (or at least it catches some of them). I can then add the offending IP or application to a blacklist and when they scrape it (supposedly) sends them either gibberish or I can customize the text.

  10. says

    I don’t understand the POINT of sites like the one described. Why would anyone else want to view willy-nilly stuff taken from elsewhere?

    The point isn’t to provide an actual useful service for readers, it’s to catch google searches and generate ad revenue for the sites’ operators without said operators actually having to spend any time writing content of their own.

    If you are a content provider interested in countering this, it should be possible to check your logs to see if the IP address of the offending aggregator site can be determined (especially if it is the same as the server hosting the website). It would then be a fairly routine matter to set up a rule in you webserver configuration that either denies the aggregator’s specific IP access to your feed, or more amusingly, redirects them to an alternate feed made especially for them (ideally with your own google ads embedded).

  11. says

    Well the problem I have the NYarticles and “articles” which are both the same website based on whois – is that they do nothing for me.

    It’s not like fair use. There is no new product or interesting service. Google is reproducing content but as part of a service – they catalogue the whole freaking internet, save pages for posterity, give you free tools, help you find what you want etc.

    NY articles is just whole-hog stealing content and adding nothing. Screw those guys. If you’re going to steal my stuff, do something interesting with it.

  12. says

    Very interesting questions.

    But maybe the premise could be reconsidered. The web without search engines and such would be like a cell without enzymes. Or maybe a cell without ER. Whatever. There would be 3 billion or so sites but you would only know about the ones you got links to from people you know and that link out from sites you visit. That would be alot, but really, just an electronic version of the mundane normal human network. The search engines are what makes the network new and different.

    A search engine without caching (internally) would be very inefficient. Persistence in the cache may simply be a side effect.

    In other words, having our pages cached IS being on the internet. These are our footprints, our tire tracks, our shed skin cells.

    (Or am I just trying to talk myself into something)

    One good thing about caching is that it annoys those news outlets that give you open access to content for a few days then switch to required you to register. Screw them.

  13. fatsparcheesi says

    Blake, OM: It’s uncanny. Is it supposed to be some sort of ironic meta-commentary? Or have you unwittingly stumbled across something far, far more sinister?

  14. DM says

    Re: Google as an aggregator

    That’s why you use a robots.txt file and meta tags to prevent the robots from archiving your pages and/or indexing them at all and/or putting a snippet beside them…

    It’s all very customizable, for the “good” robots, anyways.

  15. says

    Considering you can stop Google (and others) from caching your page, I don’t think there’s a real comparison to be made. With that out of the way, all they store is an excerpt, so no harm and no foul–and no wholesale redistribution with intent to make money from from your original content.

  16. says

    yes, this very much predates Google as an issue. indeed, it’s the core of Ted Nelson’s inclusion idea in his Xanadu project. there, whenever someone read or cited a contribution, the originator got a bit of cash credited to them. i believe the citing demanded more cash than a mere read. thus, Xanadu was as much an economic model as it was a technical one.

    but, the world being as it is, things did not work out that way. it was a Very Tall Order, after all, creating such a system. the flip side of that is that the Web and Internet are far less predictable than a system like Xanadu would be. so, if Xanadu sounds crazy or revolutionary, that may only be because people can’t imagine what non-linear perturbations may come from the Internet. still.

    folks, it’s all going to change in the next five years, for exogenous as well as endogenous reasons. petrol will go to $6+ per gallon in 3 years, and there’ll be all kinds of new datasets coming online, either in direct or processed forms. the IP address space will enlarge to include all kinds of things, from the clothes you wear to the TV dinners you might consume. datasets and data streams giving the geographic locations of all cell phones in major metro areas are available now. noone knows what to do with this spatio-temporal data, yet, but there are folks who’ll devise ways to use it.

    y’want nonlinearities? y’gonna get ’em!

  17. Azkyroth says

    But again, I think PZ, and most people on the internet, are hip to the internet’s economic philosophy:
    1) If people are willing to do it free (and better), there’s no point to charge for it.
    2) If a lot of people are reading it, you’ll find some way for your internet hobbyt to pay for your laundry.
    3) Jealously guarding content is a great way to be irrelevant.

    4) Jealous guarding of content coupled to self-righteousness and a palpable sense of entitlement is the internet equivalent of a “kick me” sign.

    This is as it should be.

  18. Caledonian says

    y’want nonlinearities? y’gonna get ’em!

    You’ll get nonlinearities even if you don’t want them.

    That’s when the burning starts.

  19. Caledonian says

    More along the lines of Armageddon.

    (The end of the world, not the movie.)

    [Although, come to think of it, the movie was like a foretaste of the end of the world, only less entertaining.]

  20. says

    Although, come to think of it, the movie was like a foretaste of the end of the world, only less entertaining.

    but the original content PZ is defending, even if not his, tells us “Armageddon” was scientifically silly.

    naw, the Apocalypse model is tiresome. i’m sure it will turn out far more interesting, even if that involves a major migration of scientists and engineers from the USA to places closer to their Asian benefactors. a new exodus, if you like. it may take 50 years, and be as much the result of economic catastrophe in the USA as anything else, but i think the dice have been cast.

  21. Crudely Wrott says

    If you remember, or have learned about, some of the following transitions of the means of communication, you may begin to see a familiar pattern emerge –

    From spoken to written communication.

    From written to telegraphic (dit – dah).

    From telegraphic to telephonic (Hello, Mabel? Gimme Geneva 6, 4989. Thanks).

    From rotary dials and party lines to direct long distance dialing.

    Speaking of radio, from Marconi to home consoles.

    From broadcast audio to broadcast video.

    From AM to FM.

    How about Echo I, the first communication satellite? A mylar balloon in low earth orbit that reflected a signal when it happened to be above the horizon

    And don’t forget the foment that greeted Gutenberg’s ability to make the Bible common currency. What a flustercluck.

    It shouldn’t be necessary to mention the vacuum tube, the transistor, or the astounding presence of the Net. The Tubes.

    At each transition two competing opportunities presented themselves. 1) The means for the speaker (or sender or network affiliate) to communicate more efficiently and copiously. 2) The opportunity for the listener (or recipient, or media consumer) to have access to more information. In each case there was a great stirring over the notion that less restrictive access to more (read “privileged”) information would somehow gum up the works and ruin everything. This is countered by the popular notion that more people knowing more stuff would be the wiser for it.

    These transitions also presented another pair of conflicting potentials. 1) Reserving the new meas of communication to a select few, accompanied by the worry that the lister might be too well informed to be “manageable,” and 2) The growing confidence of the listener to make independent and personal decisions more reliably.

    Implicit in all of this is the concept of “public.” In simple terms, what is commonly known or communicated is public. It is of common ownership.

    Given that it is apparent that there is considerable disagreement about the propriety of making any and all communication (like speech, writing, dit-dah, AM & FM , et cetera) public and given that people are insatiably curious about and suspicious of new information, (clues, gossip, the news; you know) there has always been tension concerning what should be properly allowed to be public.

    It is instructive to observe that the result of more communication (and more information) becoming public in any historic or current transition is often an embarrassment to those who seek to control public knowledge and the consequent boost to the listener’s confidence and willingness to question the assumed veracity of any and all claims of authority. This is the essence or “politics as usual” and is really how we conduct ourselves publicly and privately.

    This is like Phinnegan’s Phinagaling Phactor. (That quantity which, when added to, subtracted form, multiplied by or divided into the answer you got, gives you the answer you should have gotten.)

    In sum, ownership of information, and I think, any content that is dumped into the Tubes or on the airwaves or in a letter or in speech for the consumption of listeners should be given up for the poor expectation that it is. The insight I have is that none of this should be surprising or considered to be a “new” threat. SOP is what I see.

    Three people can keep a secret. As long as two of them are dead.

  22. says

    The fact that there is a copy of sites and articles is a good thing, I think of it as analogous to replicators in biology.

    When you read a copy, you have an approximate copy in your memory, when google caches it, it has a more exact copy.
    This benefits the survival of the information.