The other day while working on my wife’s website I realized that archive.org was making snapshots of the site when I had setup the robot.txt file to block that. Then I took a look at some other sites of mine and noticed all of them were being scraped.
I’m not sure when it happen. But it seems at some point “The Wayback Machine” decided it was going to ignore website owner wishes from their robot.txt file and copy content anyway.
I was following the steps from their own site from a page that seems to have been remove sometime after October 2015. https://web.archive.org/web/20151031123632/https://archive.org/about/exclude.php
How to remove you site from Archive.org
So I wanted to throw this post out there as a warning. I wonder how many other web site owners did not know about this change.
But you can still remove your site. It just takes a little extra foot work now. According to this page you can e-mail them and ask for your site to be removed. I was able to do this for mine but it took about a week and they asked me questions about how long I owned the domain.
How can I exclude or remove my site’s pages from the Wayback Machine?https://help.archive.org/
You can send an email request for us to review to email@example.com with the URL (web address) in the text of your message.
Why would I want to remove my site from The Wayback Machine?
Well I have a number of reasons.
- I hate the idea of my spelling and grammar mistakes being saved online forever. I’m dyslexic and I find and correct things on old post all the time.
- I have good backups. I hear lots of stories were people have restored lost or hacked sites from the wayback machine. That’s really cool. But that is not a service I need.
- It’s my content made with hard work over years. I would rather people come to my site to see it instead of going somewhere else. I’m kind of surprised there is no copywrite problems with the way they scrape. I don’t think it would be looked at too favorably if I went to a public library and started making copy’s of books. /shrug
I don’t hate Archive.org
Archive.org is a cool site. I don’t hate it or anything. I guess I just feel a little betrayed they are ignoring robot.txt. Now people have to take extra steps to get their content removed.
Anyway that’s my warning to other site owners that might not know about the change.