February 2010 (1)
September 2009 (1)
May 2009 (1)
April 2009 (1)
March 2009 (4)
January 2009 (3)
November 2008 (2)
October 2008 (2)
September 2008 (1)
August 2008 (5)
July 2008 (3)
June 2008 (1)
May 2008 (5)
April 2008 (8)
March 2008 (3)
February 2008 (1)
January 2008 (2)
December 2007 (2)
November 2007 (4)
October 2007 (17)
September 2007 (9)
I do not think that means what you think it means
Tuesday, October 21 2008
There has been some kerfuffle in the past day or two about the National Library of New Zealand crawling New Zealand websites. Their crawler doesn’t honour robots.txt except by prior arrangement. (They have their reasons, but let’s leave that for now).
This has caused some angst for people who do things with GET. Here is an edited snippet from the NZNOG list:
A:The uncertainty principal begins to apply – by them crawling entire sites they may begin to interact with the content on the sites inadvertently. For example there can be links to flag content as inappropriate. We use robots.txt to prevent crawlers from hitting this kind of link as well as indexing our APIs (which return XML | JSON) and are no use to a crawler (but which they seem to love indexing).
B:Seeing as HTTP requires GET to be idempotent, and not take any action other than retrieval, crawlers won’t “interact” with well-designed websites if by “interact” you mean “change stuff”.
A: As far the GET requests to links such as flagging content being idempotent, no one has said that they aren’t – in the context of section 9.1.2 of the RFC, idempotent means that multiple identical requests have no greater side effect than the original request.
A has the right of it as far as idempotency goes. “Idempotent” is one of those words that is so frequently misunderstood that it would be better to paraphrase it.
But B is spot on about GET being for retrieval only. S 9.1.1 of RFC 2616 says:
Implementors should be aware that the software represents the user in their interactions over the Internet, and should be careful to allow the user to be aware of any actions they might take which may have an unexpected significance to themselves or others.
In particular, the convention has been established that the GET and HEAD methods SHOULD NOT have the significance of taking an action other than retrieval. These methods ought to be considered “safe”. This allows user agents to represent other methods, such as POST, PUT and DELETE, in a special way, so that the user is made aware of the fact that a possibly unsafe action is being requested.
Naturally, it is not possible to ensure that the server does not generate side-effects as a result of performing a GET request; in fact, some dynamic resources consider that a feature. The important distinction here is that the user did not request the side-effects, so therefore cannot be held accountable for them.
In any event, given the number of broken or malicious agents out there that ignore robots.txt for far worse reasons than the National Library of New Zealand’s harvester, it’s probably still not smart for GET to do anything other than retrieve a resource.
Tags: inconceivable ~ idempotent ~ http
Rendered at 2010-08-01 22:14:23