Getting the correct Unicode path within an ISAPI filter

In early February 2006 I'd promised David Wang that I'd look into a problem that I'd been having with fetching the actual URL that was requested by a browser from within the ISAPI filter that runs this site (and others).

It seemed that the IIS server was failing to decode the URL correctly for some UTF-8 Unicode URLs, but not all. The two examples that I cited where a link to the dedication page from Niccolò Machiavelli's The Prince and another to the category Microsoft Windows™.

The problem I was having was that although the ISAPI filter was provided the correct URL for the dedication page it was not getting the correct URL for the category. It seemed that the URL available to the filter was not decoded by IIS in the right way. In order to get the pages to work I needed to redo the page look ups in an ISAPI extension that was triggered on 404 errors. The expansion API is much clearer on how to fetch Unicode values from the server variables and I got this working first time. Unfortunately the ISAPI extension leads to its own problems in working out the actual URL requested.

When I had time to go back through this and add in the diagnostics detailed below I was not able to recreate my original problem. I guess that in some ways this is good as it means I can remove most of the extra work the extension was doing which will help the systems perform better. On the other hand of course it would have been much better to have found the problem and properly understood it. In any case, I've left the diagnostics in the filter and outlined what I've done. Hopefully the more detailed information is still useful.

My best guess for what was happening was that I was expecting the wrong things from the Unicode and non-Unicode interfaces. I'm not going to spend time on going through how these values could have been encoded, but simply go on to analyse how they are encoded.

All of this has been done on IIS 6 running on Windows 2003.

Diagnostics

There are a couple of diagnostics that should help us work out what is going on. The first is to find out what file specification the filter sees (the file specification is the part of the URL between the server address and the query string). We also want to be able to inspect the contents of any server variable to see what the filter is actually given by IIS.

It's worth remember that these sorts of diagnostics can constitute a security hazard and as such, although I have left them turned on for my personal web site (so that the examples here can actually be used), they should not be turned on for any production server. There is a server configuration option that allows these diagnostics to be turned on and off for any virtual site run on a server.

Seeing the file specification

The first thing I needed to do was to get the filter to report exactly which URL was being handed to it. This can be done by forcing the filter to output this information and then conclude the request:

http://www.kirit.com/Wet%20buttercup?ISAPI.filter.stop=yes

As on other pages discussing URL encoding I'm using Wet buttercup in order to make the URLs less confusing and to avoid any self-references.

How the filter finds the path specification

It's also useful to be able to try to fetch the path specification part of the URL using either the Unicode or ANSI interfaces. There are two ways of reading it:

Fetching the server variable URL
Fetching the server variable UNICODE_URL

The first method returns an eight bit string (which should be in ANSI format, whatever that means) and the second returns a UTF-16 Unicode string.

Again, using these on the Wet buttercup example gives these URLs:

http://www.kirit.com/Wet%20buttercup?ISAPI.filter.url=URL&ISAPI.filter.stop=yes
http://www.kirit.com/Wet%20buttercup?ISAPI.filter.url=UNICODE_URL&ISAPI.filter.stop=yes

The documentation that describes the HTTP_FILTER_CONTEXT call GetServerVariable() isn't totally clear on how it is going to pass the values back. As with many of the Microsoft APIs it uses a buffer to return its results and provides a way for a calling program to determine the correct buffer length.

In both cases (URL and UNICODE_URL) the length is returned as the number of bytes required to store the character sequence (the buffer length returned includes the final NIL character). For the eight bit version the number of characters needed by the buffer is the same as the number of bytes. For the Unicode version of course these characters are UTF-16 (don't confuse this with the number of Unicode characters) so you must read them in pairs.

Seeing the server variables

By the time we see them the file specification diagnostics have gone through many of the layers of library code that make up FOST.3™. Although these library calls work in all other places we cannot rule out the possibility that the problem we see is caused by a bug in them. In order to try to rule this out we're going to show the server variables as the raw data that is passed back directly from the GetServerVariable() call. This should give us confidence that the libraries aren't corrupting anything.

In order to do this we are going to add one last query option:

http://www.kirit.com/Wet%20buttercup?ISAPI.filter.show=QUERY_STRING

This is going to use a new function to return these values and send them straight to the browser.

Analysing the encoding

I'm going to use five different URLs as tests (they're not all actual pages on this site):

/Wet%20buttercup — Wet buttercup — A simple ASCII path specification.
/Niccol%C3%B2%20Machiavelli — Niccolò Machiavelli — The latin small letter O with grave has the same ISO 8859-1 entry number as its Unicode code point. Note that the UTF-8 encoding is different though (and occupies two bytes).
/Microsoft%20Windows%E2%84%A2 — Microsoft Windows™ — The trade mark symbol. A pretty common symbol that people are likely to want in URLs.
/%E5%AD%AB%E5%AD%90 — 孫子 — The name Sun Tzu in Chinese.
/The%20treble%20cleff%2c%20%F0%9D%84%9E — The treble clef, 𝄞 — This clef character (𝄞) is at Unicode code point 1D11E. If you can see this then you have a proper Unicode implementation (and a suitable font).

First of all we're just going to see what happens with the default behaviour for an ISAPI filter.

http://www.kirit.com/Wet%20buttercup?ISAPI.filter.url=URL&ISAPI.filter.stop=yes
http://www.kirit.com/Niccol%C3%B2%20Machiavelli?ISAPI.filter.url=URL&ISAPI.filter.stop=yes
http://www.kirit.com/Microsoft%20Windows%E2%84%A2?ISAPI.filter.url=URL&ISAPI.filter.stop=yes
http://www.kirit.com/%E5%AD%AB%E5%AD%90?ISAPI.filter.url=URL&ISAPI.filter.stop=yes
http://www.kirit.com/The%20treble%20cleff%2c%20%F0%9D%84%9E?ISAPI.filter.url=URL&ISAPI.filter.stop=yes

These URLs give the following results:

/Wet buttercup
/Niccolò Machiavelli
/Microsoft Windows™
/??
/The treble clef, ??

That the last two of these URLs don't work shouldn't really surprise us at all as the characters they use cannot be encoded in any single byte encoding. We don't actually know which encoding the API is using, but one presumes it should be the encoding that is configured for non-Unicode programs in the regional settings.

We need to be a little careful though before we congratulate ourselves that at least most of these work correctly. There's a lot of code within the ISAPI filter which is returning these results so to be sure we understand what is happening we have to look at the numeric sequences that are returned.

Looking at the character sequences for the server variable URL:

http://www.kirit.com/Wet%20buttercup?ISAPI.filter.show=URL http://www.kirit.com/Niccol%C3%B2%20Machiavelli?ISAPI.filter.show=URL http://www.kirit.com/Microsoft%20Windows%E2%84%A2?ISAPI.filter.show=URL http://www.kirit.com/%E5%AD%AB%E5%AD%90?ISAPI.filter.show=URL http://www.kirit.com/The%20treble%20cleff%2c%20%F0%9D%84%9E?ISAPI.filter.show=URL

These checks give us the following sequences:

2f 57 65 74 20 62 75 74 74 65 72 63 75 70 0
2f 4e 69 63 63 6f 6c f2 20 4d 61 63 68 69 61 76 65 6c 6c 69 0
2f 4d 69 63 72 6f 73 6f 66 74 20 57 69 6e 64 6f 77 73 99 0
2f 3f 3f 0
2f 54 68 65 20 74 72 65 62 6c 65 20 63 6c 65 66 66 2c 20 3f 3f 0

These show that the eight bit interface is interpreting the UTF-8 sequences and sending them to the filter with valid character codes where they can be found. I'm not sure which code page is being used here, but knowing the processing that the ISAPI filter does to display the URLs earlier I can say that the encoding scheme is reversible back to UTF-16 (and then back out to UTF-8 to send to the browser). Internally the URLs go through a VARIANT (they are put into the variant as an eight bit sequence and then pulled back out as a UTF-16 sequence). I'm presuming that it is this that allows the URLs to be correctly reported above.

The Chinese characters have to be thrown out because they cannot be represented in whatever code page IIS is using for the eight bit interface. Interesting though is that the treble clef is converted not to one question mark but to two. The reason for this will be become evident when we look at the Unicode interface next.

Using the Unicode interface

If we now try the same thing using the Unicode interface we can see if that gives us better results or not. FOST.3™ uses the Unicode interface by default, but to keep everything in the open I'm going to explicitly tell it to use UNICODE_URL anyway.

http://www.kirit.com/Wet%20buttercup?ISAPI.filter.url=URL&ISAPI.filter.stop=yes
http://www.kirit.com/Niccol%C3%B2%20Machiavelli?ISAPI.filter.url=URL&ISAPI.filter.stop=yes
http://www.kirit.com/Microsoft%20Windows%E2%84%A2?ISAPI.filter.url=URL&ISAPI.filter.stop=yes
http://www.kirit.com/%E5%AD%AB%E5%AD%90?ISAPI.filter.url=URL&ISAPI.filter.stop=yes
http://www.kirit.com/The%20treble%20cleff%2c%20%F0%9D%84%9E?ISAPI.filter.url=URL&ISAPI.filter.stop=yes

Trying these gives us the following URLs:

/Wet buttercup
/Niccolò Machiavelli
/Microsoft Windows™
/孫子
/The treble clef, 𝄠

This looks better in so much as we now have the Chinese characters passed through the interface correctly.

http://www.kirit.com/Wet%20buttercup?ISAPI.filter.show=UNICODE_URL http://www.kirit.com/Niccol%C3%B2%20Machiavelli?ISAPI.filter.show=UNICODE_URL http://www.kirit.com/Microsoft%20Windows%E2%84%A2?ISAPI.filter.show=UNICODE_URL http://www.kirit.com/%E5%AD%AB%E5%AD%90?ISAPI.filter.show=UNICODE_URL http://www.kirit.com/The%20treble%20cleff%2c%20%F0%9D%84%9E?ISAPI.filter.show=UNICODE_URL

These will show us the following UTF-16 sequences:

2f 57 65 74 20 62 75 74 74 65 72 63 75 70 0
2f 4e 69 63 63 6f 6c f2 20 4d 61 63 68 69 61 76 65 6c 6c 69 0
2f 4d 69 63 72 6f 73 6f 66 74 20 57 69 6e 64 6f 77 73 2122 0
2f 5b6b 5b50 0
2f 54 68 65 20 74 72 65 62 6c 65 20 63 6c 65 66 66 2c 20 d834 dd1e 0

The last one shows us why we got two question marks for the treble clef. Although it is a single Unicode code point it is beyond the range that fits into a single UTF-16 character so it must use two. It seems that the character substitution fails to notice that this is one character rather than two and so substitutes each UTF-16 character rather than each UTF-32 one as it should do. Although this should probably be regarded as a bug, for most cases it is not a serious one unless it implies that the look up tables used to do the substitution are based on UTF-16 and not UTF-32 (as this would be a more serious design flaw). The possibility that it is because the extended Unicode range is disabled by default on a Windows 2003 server shouldn't be ruled out either though.

Unicode with UTF-8, UTF-16 and UTF-32

Although it seems at first cut that IIS is happy to decode the UTF-8 for us, I was curious as to what it would do with path specifications that were not valid UTF-8.

The first thing to try is to encode the URL as ISO 8859-1 (the standard Latin 1 character set commonly used for Western HTML pages).

http://www.kirit.com/Niccol%F2%20Machiavelli?ISAPI.filter.show=URL http://www.kirit.com/Niccol%F2%20Machiavelli?ISAPI.filter.show=UNICODE_URL

These give the following URLs:

2f 4e 69 63 63 6f 6c f2 20 4d 61 63 68 69 61 76 65 6c 6c 69 0
2f 4e 69 63 63 6f 6c f2 20 4d 61 63 68 69 61 76 65 6c 6c 69 0

Because that this works it seems that IIS uses some heuristics to determine which encoding is being used. This also means that we need to know which encoding takes precedence. The following URL has some interesting properties:

/2×½=1

The ISO 8859-1 sequence is as follows:

2f 32 d7 bd 3d 31

This also corresponds to a valid UTF-8 encoding for the code point 5fd which although part of the Hebrew area is not assigned. Trying this ISO 8859-1 encoding yields the following UTF-16 sequence:

2f 32 5fd 3d 31 0

This clearly shows that when given the choice IIS interprets a sequence as UTF-8 before it tries any other encoding, even if the code point is not strictly valid. Another similar example is:

/Niccolò«¹»

Can be encoded as ISO 8859-1:

2f 4e 69 63 63 6f 6c f2 ab b9 bb

Trying this gives the following:

2f 4e 69 63 63 6f 6c da6f de7b 0

These are probably also reasonable as otherwise we would need to update the Unicode configuration data on the server for all new characters added to the standard. It might also interfere with legitimate use of private use areas.

One final thing we should test is what happens with a UTF-8 sequence that decodes to give the correct character but is in itself invalid. This can be done by encoding the character using a longer sequence than is required. For example, all of the following sequences could be used to generate the small letter O with grave if the UTF-8 decoder is naïvely written, but only the first is actually valid UTF-8.

c3 b2 e0 83 b2 f0 80 83 b2 f8 80 80 83 b2 fc 80 80 80 83 b2

Note that the five and six byte encodings should never be allowed under any circumstances (although some older UTF-8 documentation does describe how to encode and decode to them). Here is what IIS gives us:

2f f2 0
2f e0 192 b2 0
2f f0 20ac 192 b2 0
2f f8 20ac 20ac 192 b2 0
2f fc 20ac 20ac 20ac 192 b2 0

Here IIS does correctly reject all but the first as UTF-8 sequences and instead seems to be decoding them based on whatever code page it is using.

What you can and cannot do

It is clearly possible to get a much wider range of character sequences through the ISAPI filter API than most web sites use, but we have to be careful how we interpret the results and we also need to be careful about how we encode the URL so that we can be sure the browser request will be correct.

It looks like IIS will correctly decode a URL sent as a UTF-8 sequence and this will appear correctly through the filter's Unicode API. Although some default code page could be used to encode URLs this is clearly not recommended as IIS will by default assume that the request from the browser uses UTF-8.

The biggest thing that we cannot do with an ISAPI filter though is to use a custom encoding format. If you take a look at Wikipedia you will notice that article URLs have the spaces replaced with underline character. This has the immediate effect of making the URLs much easier to read as there aren't %20s everywhere. This means of course that an alternative for the underscore must be used where the URL would genuinely include one. The most obvious replacement would be %5F, but this will get decoded to an underscore before we get to see the path specification. There appears to be no way to get the actual request as passed by the browser.

On a final note here is the code that implements the character sequence listings. filter is a class which handles some basic properties of the ISAPI filter such as fetching values from the query string (the filter.query.value() function). widen() converts from UTF-8 to UTF-16 and narrow() converts in the opposite direction. Finally variant_cast manages conversions from VARIANT and I think the rest should be fairly obvious.

string name( narrow( variant_cast< wstring >( filter.query.value( L"ISAPI.filter.show" ) ) ) );
DWORD len( 0 );
if ( name.substr( 0, 8 ) == "UNICODE_" ) {
    if ( 0 == filter.m_pfc->GetServerVariable( filter.m_pfc, const_cast< char * >( name.c_str() ), NULL, &len ) && len != 0 ) {
        boost::scoped_array< wchar_t > value( new wchar_t[ ++len / sizeof( wchar_t ) ] );
        if ( 0 == filter.m_pfc->GetServerVariable( filter.m_pfc, const_cast< char * >( name.c_str() ), value.get(), &len ) && len != 0 )
            throw FSLib::Exceptions::Field( widen( name ) + L" not available (" + toString( len ) + L" (includes NIL terminator))- " + ISAPI::formatLastError() );
        string digits;
        for ( size_t p( 0 ); p < len / sizeof( wchar_t ); ++p ) {
            char num[ 5 ];
            _itoa( value[ p ], num, 16 );
            digits += string( num ) + " ";
        }
        showPage( filter, L"User requested server variable\n\n" + widen( name ) + L": " + widen( digits ) );
    } else
        showPage( filter, widen( name ) + L" not found" );
} else {
    if ( 0 == filter.m_pfc->GetServerVariable( filter.m_pfc, const_cast< char * >( name.c_str() ), NULL, &len ) && len != 0 ) {
        boost::scoped_array< char > value( new char[ ++len ] );
        if ( 0 == filter.m_pfc->GetServerVariable( filter.m_pfc, const_cast< char * >( name.c_str() ), value.get(), &len ) && len != 0 )
            throw FSLib::Exceptions::Field( widen( name ) + L" not available (" + toString( len ) + L" (includes NIL terminator))- " + ISAPI::formatLastError() );
        string digits;
        for ( size_t p( 0 ); p < len; ++p ) {
            char num[ 3 ];
            _itoa( utf8( value[ p ] ), num, 16 );
            digits += string( num ) + " ";
        }
        showPage( filter, L"User requested server variable\n\n" + widen( name ) + L": " + widen( digits ) );
    } else
        showPage( filter, widen( name ) + L" not found" );
}

Note that the arrays are overallocated by one or two bytes (depending on if it is a Unicode or ANSI variable being asked for). The system documentation says that it includes the final NIL, but overallocating one byte is always better than a buffer overrun.