Errors in IIS's custom 404 error handling

The customised 404 error message feature of IIS is pretty cool and if you never want to do anything really interesting with it then it works fine.

The problem with the feature is slightly esoteric and has to do with loss of information. The feature doesn't preserve what the web site visitor actually asked for.

What IIS does

IIS allows us a couple of ways to customise the page that is displayed when a file is not found on the server:

supply a HTML file that is used for all 404 (file not found) errors.
execute some code on the server to help users find the information they want.

Clearly the second of these options is better in terms of delivering a good site to your users. It is fairly easy to configure IIS to execute an ASP file or ISAPI extension when it fails to find a page to deliver and I won't go into the configuration here.

The issue is that IIS rebuilds the URL that the browser requested, but IIS builds it based on its (possibly faulty) interpretation of the URL and then uses this to perform an internal redirect. What this means is that most of the information in the original request is lost.

When executing the custom error IIS arranges for a query string to be made available to the page which gives the URL for the page that couldn't be found. For example, if you were to type in:

http://www.kirit.com/missing.html

on your browser then IIS will make the following query available to my custom 404 handler:

404;http://www.kirit.com:80/missing.html

which gives everything that was input. If there was a query string at the end then the query string is preserved too. For example:

http://www.kirit.com/missing.asp?id=34

will end up as

404;http://www.kirit.com:80/missing.asp?id=34

So far so good. But what if there are some funny characters in the URL? What if there's a space? Now it gets intersting:

http://www.kirit.com/missing file.html

This is not really a well formed URL. The browser will see the space and actually send something slightly different to the server. It will replace the space with %20 and the file it asks for will therefore be missing%20file.html (the link already does this otherwise this page would not validate).

The server understands what's happened and knows that any '%20' it sees is really a space and changes it back again. They query that it passes to the 404 handler is now this:

404;http://www.kirit.com:80/missing file.html

And this looks fine, but wait a moment. The query sent to the 404 handler isn't actually what the browser sent to the server, but rather what the server thought the browser meant. This behaviour works well in most situations, but it means that there are some situations where the server doesn't allow the script to work out what is actually meant to be going on.

On this site the custom error is used to not only display those 'File not found' messages that have already been linked to, but also things like this page (when they're not served by an ISAPI filter, but that's another story¹ [1After going through how URLs are handled by ISAPI filters all the articles are now served by the filter and the 404 handler only needs to do an article search.]). What happens is that when IIS can't find the file then the name used is checked against a database of articles. For our missing.html example the order is something like this:

IIS checks the directory where the web site lives on the server for the file. If found then this file is returned to the browser.
If there is no file then IIS executes the 404 handler. This looks in the database for an article with that name. If found then it returns that article.
We can't find anything so we have to return a 404 error to the browser.

So given that the name of this article is "Errors in IIS's custom 404 error handling", IIS fails to find a file with that name, but the custom 404 handler does find one in the database and you get to read this.

So far it does't seem too bad, but there's a problem with all of this and like the best problems it isn't obvious and nobody would ever think about it until it rears its ugly head and bites somebody.

The ugly problem

There are all sorts of complex issues to do with language and alphabets on the web. The web server has to make a best guess about what is asked of it and how to respond. For most purposes those best guesses are fine and everybody's happy. Occasionally the server gets it wrong and you get a 404 error when you're not execting it.

Here's a simple example of the browser making a mistake. Suppose I wrote an article called Really?. Nothing too contentious about that. What would the URL for the article be? Surprisingly the following is not right (we know that spaces need to be %20)² [2These examples don't behave in exactly this way on this site. Because it is impossible for a 404 handler to tell if a question mark is part of the path specification or not this site strips everything after the first question mark when it does an article search.]:

http://www.kirit.com/Really?%20How%20Odd..

The reason is that the question mark serves a special purpose in URLs and that is to seperate the file name from the information given to the file to do its job. The correct URL for the article is actually:

http://www.kirit.com/Really%3F%20How%20odd..

This is because that in the same way that %20 is always translated to a space, %3F is always translated to a question mark.

The special thing that the question mark does is seperate the file name from extra information that the file may use. The first example will therefore look for a file called "Really" and if it finds one it will send the information "%20How%20Odd…" to it. The second URL will look for a file called "Really? How Odd…" and send no extra information to it.

Clicking on the two links gives us a hint at the problem with doing anything interesting with custom errors. Although there are two types of question mark in the original information sent to the server, the server just sends the same question mark to the 404 handler. For both of the above the first part of the URL is "Really?" and only changes after that.

What if I wrote an article called "Really? 2+2=5". This is where the information loss hurts. Look at these:

The first URL means a file called "Really" which is then given the information " 2+2=5" (this actually means that the parameter " 2 2" is given the value "5"). The second will look for a file called "Really? 2+2=5" and gives that file no parameters.

The first version is not the correct way to encode the article name. Because the question mark is used to seperate the path specification from the query it must be escaped, i.e. changed to %3F. By the time that IIS has had its way with them though there isn't much to choose between them. If we parse the URL sent to our 404 handler according to the web standards we will look for the wrong article.

Say I'm thinking of adding a feature to the article to do a simplified version if you wanted to print it. I might then use the "?print" convention at the end of the article name to generate this file (actually I use CSS, but you never know what might be needed in the future).

So let's revisit the articles and look at the URLs for the normal and print versions.

Both of these end up at the same place (this article) and both versions happen to look the same because I haven't done anything with the extra parameter.

But what about this article?

Using ?print to print articles

I might have an article like this to describe some issues with the print version of this main article. Given the URL that the browser actually requests it is easy to tell all of these apart, or if IIS would build the query string with the actual encoded URL as it first sees it then we can decode it ourselves (and of course IIS provides some handy routines for doing just that).

Categories: Bugs ● Internet