How to Deter Scrapers and Hotlinkers

Categories: Blog Security, Featured
Written By: BloggerSavvy
1 Star2 Stars3 Stars4 Stars5 Stars (1 votes, average: 4.00 out of 5)
Loading ... Loading ...

When I launched my first blog (a Linux based niche blog) at Ubuntu Linux Help, initially I did not have much content. As content built and traffic grew, I ended up writing some posts that went viral. Three of them were:

I enjoyed the traffic those posts provided (and still provide to this day) as they helped my blog grow. I think they were popular because they provided a valuable resource as well as elicited fair bit of discussion, as some of the posts and resulting comments were very outspoken and opinionated. One of the above posts (Why I Quit Windows and Switched to Linux) was a very personal story describing some of my career experiences and how they effected my professional life and thoughts.

Imagine my surprise when I found another blog with the identical content on it. So much so, that the author who scraped my content claimed it as his own! At that time I had more than enough technical knowledge to initiate actions that would catch-out scrapers, but I’d not yet fully experienced some of the nitty-gritty administrative aspects such as contacting the hosting providers, filing DMCA notices et al.

For that matter, why would this be an important issue? Why should bloggers (or any web site owner for that matter) take actions to mitigate such issues? Well a few good reasons immediately come to mind:

  1. It’s annoying. A blogger puts a fair bit of work into his or her post, only to find it copied elsewhere and used to earn revenue (usually via advertising) for the content thief.
  2. It can impact your web hosting cost if you have a busy site. Remember, most hosting accounts have a monthly bandwidth allowance. Exceed the monthly bandwidth and the blog owner incurs extra cost. But wait!… How did scrapers cause my cost to increase? Simply put, they copied the text content onto their blogs and linked the images (in that content) from my blog. This resulted in the text being duplicated on their site and the images being stored on my hosting account. When a web browser viewed the content on their page, it was pulling the images (for that content) from my hosting account, for which I had to pay the bandwidth.
  3. It can reduce your Google (SEO) ranking. How does that happen? When Google finds content, it tries to determine if the content is original (not copied from another site) and proceeds to provide it with page ranking data. It is conceivable that content can be copied and receive a page rank (and inclusion in Google search results), without the original blog article being yet found by Google. If it get’s found by Google later, how can you ensure that your original becomes noted as the original post? Don’t misunderstand me, search engines like Google do try to remove duplicate content, but it becomes difficult when your original content becomes listed as the duplicate.

One important issue I realized is that you cannot be overly emotional about such actions. When trying to fix such issues, you need to work with various parties such as Google DMCA, Hosting Providers, etc. Sending them flippant or angry letters is not going to get you the help you need. Remember, large organizations and businesses deal with such issues daily and they are not impressed with theatrics. You need to have these people on your side. Remaining calm, assertive and professional goes a long way to getting support (as does courtesy).

Let’s define a couple things before we move on…

  • A scraper is someone who copies your content and places it on another site, without your permission. In essence this is theft of your intellectual property.
  • A hotlinker is someone who displays your images on another site, and uses the coding on their page is such a way so as to pull the image that is stored on your server (hosting account), for display on another page. As I mentioned earlier, this is tantamount to bandwidth theft.

I’m often asked how I discover that my content has been scraped or hotlinked. There are several tools that when used on a regular basis, can help you reduce the amount of content thieves.

Review your web statistics.


All good hosting accounts have built in web statistics. In my opinion AWStats ranks highly. AWStats has a feature that displays “Links from an external page (other web sites except search engines)”

The display will show you the URL of the page linking to you as well as the number of hits to your page. It will also tell you how many page loads (of your pages) that URL initiates. For example if another site URL causes 10 hits on my site, then there should also be 10 page loads. If not something’s up. Take a look at the image below (clicking on the image will show you the large version).

AWStats hotlinker example

AWStats hotlinker example

What you’re seeing in the edited image above, is that another site has displayed something (from my site) 1645 times but never actually sent a visitor to my site. In other words my content (an image in this case) was shown on another site, but no page visits (referals) were ever recorded coming from that site. I visited the URL in question, and sure enough, the site was hotlinking to one of my images.

Use online services.


One service I’ve experimented with is Copyscape. They are a site that provides a service which scans other web sites, providing you with the URL of copied content. In my case I found scraped content (stolen from my other blog) during the writing of this very post. That is… one of my original blog posts, was found via Copyscape to be duplicated verbatum on another web site in another country. They were using it to sell their advertising space and also had hot linked to all my images. While they did include a link to my original post and did list my URL as the “Original link”, they did so without permission and were using my complete work for their own profit.

A Google search will provide you with a plethora of other sites that provide such services, I’m only mentioning Copyscape as one good example. To further take advantage of such tools, its most advantageous to include a very unique sentence in any given post and search for it in Google (that’s often a very quick method to catch sites that copy your valuable content).

Once you’ve been able to isolate a specific URL that has hotlinked and/or scraped content, what can you do about it?

Here’s how the process should work:

  1. Find and isolate a specific URL that is using your content.
  2. Evaluate the need to take action. Are they still sending you traffic or referrals in some fashion? The bottom line question you could ask yourself is “Does their copied content really do my blog enough harm that I have no option but to follow through?” If you’ve answered yes, action is the next step.
  3. Take action.

But wait! What actions are there? Can I really protect myself from a scraper or hotlinker in another country?

Before moving into they types of actions (tools) you can use to protect your content, it’s important to keep one salient point in mind:

If you have some content that is so important, private or valuable, etc. and you do not want other people to copy it, then DO NOT POST IT ON THE INTERNET. If you post something of excessive value on the Internet, no measure of copyright protection is going to prevent an individual from accessing it and copying it. But (there’s always a “but”), there are ways you can impede the profitability and earning power of copied content and in some cases injecting your own revenue generating systems into content copied from your site.

So what kinds of actions can we take?

Make sure you provide appropriate copyright notifications within every page of your blog. If you intend to permit your content to be shared, Creative Commons provides a great copyright tool wherein you can specify how your content is shared. You’ll find that tool at Creative Commons. You can select your jurisdiction in a drop down menu on the upper right side of the home page.

Is the copied content being served by Google Adsense? If so, you can issue an infringement notice at Digital Millennium Copyright Act – Google AdSense. I’ve found the best method is to fax the notice to the number they provide. In practice, I’ve found that it takes a few days for this to work through their system. However, they have always acted professionally, responsible and have indeed taken action.

As a side note: If you are looking for notification templates you can use, take a look at Copyright Law and SEO Part 3 (Sample DMCA Notifications, in HTML and MS Word Format), found on the McAnerin International web site.

Personally, I’ve experienced an excessive number of blogs copying content to Blogger.com based blogs. If you review their Blogger Content Policy, you’ll also find that notifications are to be sent to Google at Digital Millennium Copyright Act – Blogger. Again, I’ve found the response to be incredibly fast and Blogger.com is very quick (in my experience) to remove the violating content.

Many scrapers capture and repost your content by tapping into the RSS feed of your blog.

Side note: If you’re not very familiar with RSS feeds, Commoncraft has a great video, RSS in Plain English.

The fact that they are simply capturing and re-posting your RSS feeds may be indicative of an automated system (with little human intervention), where the scraper may not necessarily read your copied content. If that’s the case, you could try embedding a link back to your blog. There’s a great Wordpress plugin that will automatically do just that: RSS Footer. (One helpful thing is the RSS Link Tagger for Google Analytics, which helps with the tracking of non-adwords advertising campaigns. If you log into Google Analytics and go to your created campaign and view the traffic sources information you’ll garner more information).

Try earning a bit of revenue from your stolen feed content. If you’re a Google Adsense publisher, try the “AdSense for Feeds” option within your “Adsense Setup”. Additionally, the RSS Footer plugin above, I think can also be used to embed advertising content from any other affiliates or advertisers you subscribe to.

For those who are a little stronger at coding issues, you may want to try out a great little Wordpress plugin called “From RSS?“, where, as their site says: “…do something extra for your RSS subscribers, you might want to give them a little bit of extra content, or simply leave out some annoying footer about subscribing to the RSS feed. This plugin facilitates that…”

You can of course Report a Spam Result to Google which they use to help “…maintain the quality of Google search results.”

Finally another tactic that’s available is to report the site to the hosting company. There must be a ton of resources to help isolate who is the ultimate network provider (for the server that hosts the offending web site), of which I primarily use two:

Using the domaintools.com site, I can see which DNS servers are managing the domain (ostensibly indicating who’s hosting it). And with Netcraft’s tools, I can determine who owns the IP address block that the website and server is one, thereby opening another venue of recourse.

In conclusion, I’d like to remind everyone of on salient issue. There may always be hotlinkers and scrapers of our content. I don’t think anyone can stop all of them, as doing so would in all likelihood prevent legitimate visitors from viewing your content, instead the primary objective of this post was simply to impart the concept of DETERENCE, so as to reduce such activities.

If you like this post, why not share it?
  • StumbleUpon
  • Digg
  • del.icio.us
  • Google Bookmarks
  • Technorati
  • Reddit
  • TwitThis
  • YahooMyWeb
  • LinkedIn
  • Facebook
  • Live
  • Furl
  • Sphinn
  • Mixx
  • BlinkList
  • blogmarks
  • Ma.gnolia
  • NewsVine
  • Propeller
  • SphereIt
  • Spurl
  • Fark

One Response to “How to Deter Scrapers and Hotlinkers”

  1. Two Tools that Help Protect Your Blog from Content Theft (Scrapers) | BloggerSavvy Two Tools that Help Protect Your Blog from Content Theft (Scrapers) | BloggerSavvy Says:

    [...] posted about this subject before in “How to Deter Scrapers and Hotlinkers“, which discusses a bit more of the hands-on and some web based tools you can use to help [...]

Leave a Reply