≡ Menu

Should Archive.org Ignore Robots.txt Directives And Cache <em>Everything</em>?

Archive.org argues robots.txt files are geared toward search engines, and now plans instead to represent the web “as it really was, and is, from a user’s perspective.”
We have also seen an upsurge of the use of robots.txt files to remove entire domains from search engines when they transition from a live web site into a parked domain, which has historically also removed the entire domain from view in the Wayback Machine… We receive inquiries and complaints on these “disappeared” sites almost daily.”
In response, Slashdot reader Lauren Weinstein writes:
We can stipulate at the outset that the venerable Internet Archive and its associated systems like Wayback Machine have done a lot of good for many years — for example by providing chronological archives of websites who have chosen to participate in their efforts. But now, it appears that the Internet Archive has joined the dark side of the Internet, by announcing that they will no longer honor the access control requests of any websites.
He’s wondering what will happen when “a flood of other players decide that they must emulate the Internet Archive’s dismal reasoning to remain competitive,” adding that if sys-admins start blocking spiders with web server configuration directives, other unrelated sites could become “collateral damage.” But BoingBoing is calling it “an excellent decision… a splendid reminder that nothing published on the web is ever meaningfully private, and will always go on your permanent record.” So what do Slashdot’s readers think? Should Archive.org ignore robots.txt directives and cache everything?

Read more of this story at Slashdot.

{ 0 comments… add one }

Leave a Comment

Home | About | Contact | Disclaimer | Terms | Privacy

Copyright © 2017 by Tom Connelly | All Rights Reserved