Dealing with web content scrapers

Two websites recently started to plagiarize full-length articles from Slight Future that were republished under different author names and a appeared to have been published a day before the original publication time. Plagiarism is apparently a great compliment, but I’m not good at accepting compliments and wanted these clearly automatic copycats off the web. So how do you deal with websites stealing content from you?

There is such a thing as international copyright law, but I didn’t want to spend unproductive time and money with lawyers trying to remove something off the web. I took on this challenge like any engineering challenge and sought after technological means to take down the copycats.

Both of the offending websites were set up in a similar fashion: they pulled my syndication feeds every few hours, found new articles, visited my site and grabbed all the text, and then published it on their own websites. To be clear, the news feed are indeed intended for syndication but they’re for personal use. The feeds don’t include full length articles and they link back to this site for the full article text. To clarify, these were not websites providing end-user functionality like reading list or different reading experiences. I don’t mind those in the slightest, and this website is optimized to work with as many of them as possible.

One of the websites that copied my articles replaced my name as the author with what appeared to be a randomly generated name and set the publication time to one day before my original publication time; while the other website omitted the author name and publication time entirely. Both websites were also filled to the brim with advertisements from multiple ad networks.

The scam allowed these sites to sift some of the traffic from my website and monetize it with their own ads. Who knew there was still a business in doing such things? Then again, the number of websites scraping and republishing StackOverflow question and answers in new original ways that do show up high in Google for technical search queries is frankly quite astounding. The difference is that I don’t allow republishing as it only leads to garbage copies appearing on on trash sites where as StackOverflow allows republishing under a Creative Commons Attribution-ShareAlike 3.0 (CC BY-SA 3.0) license.

Poisoning the well

Did you spot the flaw in how the copycat websites were setup? These sites went back to my feed for updates every few hours. All I had to do was identify which IPs were used (same as for their public web server) and then use the ancient tactic of poisoning the well when they came back for a fresh supply of articles.

By manipulating the news feeds, I automatically created thousand of short garbage articles with repeating phrases and nonsense that only appeared when the copycats’ IPs requested the feeds. They were then sent to a dynamic page with more randomly generated garbage (plus some extra rude words, I must admit). The scrapers gobbled up these garbage articles unquestioningly and republished it like any other article.

Do you know what happens when you publish thousand upon thousands of automated junk pages? In addition to defacing the websites, they got delisted from all search engines within a few days. Without traffic from search engines, revenue opportunities quickly ran out.

Both of the offending websites have since been replaced by domain parking placeholders. Meaning they’ve effectively given up reviving the domain names for use in this fashion. I’ll not mention their domain names as I don’t wish them ever to make a dime off me again. Overall I consider this a success but I really wish the web had come to a place where taking actions like this hadn’t ever been necessary.

What could the scrapers have done differently: Should have rate limited how many articles they stole per day. They should also have picked a website not belonging to a highly vengeful and technical author. I considered wrapping this tool up as a WordPress plugin or otherwise share the code, but I don’t believe such a tool ought to exist in the first place. It would be too easy to misuse.

I’ve never had any requests to republish articles from Slight Future elsewhere and I’m not using Creative Commons. Creative Commons work okay for media such as images, audio, and video. Because of the way content is monetized on the web, it’s really not a workable license for text content and only result in rubbish sites being setup to mirror your own. I’ve seen little evidence that Creative Commons for textual content end up being used in any way other that filler content on pages that only wish to serve ads.

Although, a certain tech news site once ran big with a story that was cited everywhere just ten days after me that was peculiarly similar to my own story. They could have duplicated my research, but I consider it highly unlikely that they’d do the same research, find the same previously unpublished technical details, and draw almost identical conclusions and structure their story in the exact same way I did just days after me. Luckily, I’m not one to hold a grudge.

3 thoughts on “Dealing with web content scrapers”

  1. hm, interesting 🙂 But how did you find out which particular IP querying your newsfeed was responsible for which website contents? I’m also querying your newsfeed once… Correction: every ten minutes.

    1. As I say in the article, they used the same public IP address as the public IP address of their servers.

      As for queering the feeds every ten minutes, a lot of people do that. However, you only need to fetch one a day or so. I don’t publish more often than that.

Leave a Reply

Your email address will not be published. Be courteous and on-topic. Comments are moderated prior to publication.