Annoying HTML

Created 18th December, 2006 15:01 (UTC), last edited 15th July, 2007 12:46 (UTC)

Writing the parser for the content management module for FOST.3™ was a very complex affair, and it's fair to say that it still isn't without its rough edges. Its best feature though is that it can turn fairly unstructured Mediawiki style mark-up into correct, validating XHTML which is also semantic.

Here are just three of the more complex issues to do with generating correct HTML from the perspective of writing a general content management system like FOST.3™.

I'm going to list some ideas for making these easier to deal with in no particular order. They're numbered not because I think some solutions are better than others, but to make them easier to refer back.

You can't embed block level elements within <p> elements

The humble paragraph marker must be one of the most used elements around. Much early HTML used the tag as a paragraph separator rather than the more correct paragraph surround, but of the paragraphs in this page are correctly surrounded by <p> elements and this should be the case for every web site that cares about semantic mark up.

The difficulty comes when we allow arbitrary inclusion of other content. The footnotes used on this page¹ [1This is just an example footnote.And they can span several paragraphs themselves.] are placed in line within the paragraph where they occur, but because the paragraph is surrounded by <p> tags we cannot use any block level tags within the footnote text — at least if we want the page to validate.

FOST.3™ uses a complicated system that relies on co-operation between in-line elements and the CSS to style them as block level elements.


  1. The most obvious solution is to allow a <p> tag to include block level tags. There's probably a lot of good reasons for assuming this to be a bad idea.
  2. The simplest way to solve this with the existing HTML standard is to not use <p> tags, but to style, for example, a <div class=“paragraph”> tag to look like the <p> tag. It isn't semantic though.
  3. Another way to handle this would be to have a new HTML tag, maybe called <aside>. This would allow the embedding of out-of-band HTML to be placed at any location. <aside> would be an in-line tag that would be able to contain block-level tags² [2The HTML 5 proposed specification does contain an aside tag, but it is a block level tag. Although useful it would be even more useful as an in-line tag that could contain block-level tags.]. User agents could either:
    • float the content off to one side; or
    • draw it in a pop-up window; or
    • overlay it on the content (like the way many user agents handle title attributes).

Nested lists must be within the <li> tag

Lists are extremely hard to generate properly. FOST.3™ translates the Mediawiki mark up into an internal representation which is then translated back out to Mediawiki when a page is edited³ [3You can see this in action if you use the forum/discussion system on this site. The generated Mediawiki is from the Mediawiki generator, not a copy of what was entered when writing the post.].

This is the legal way to create a nested bullet list:

<ol><li>First bullet
    <ol><li>Nested bullet</li></ol></li>
<li>Second bullet</li></ol>

Note that the nested list must be contained within the <li> element. This means that different logic must be used to generate the outermost list and the nested lists because the <p> elements cannot contain the outermost <ul>/<ol> element.


  1. Allow the <ul> and <ol> tags to be included within the previous <p> tag. This has the advantage that now the list always starts within the logically outer content carrying tag.
  2. Simply allow <ul> and <ol> tags to contain child <ul> and <ol> tags as well as <li> tags. Most user agents already deal properly with this situation because so many web sites contain it already. It should be a relatively simple affair to codify this in the standard:
<ol><li>First bullet</li>
  <ol><li>Nested bullet</li></ol>
<li>Second bullet</li></ol>

Placement of <input type=“hidden” … /> is the same as other <input> sub-types

One of the advantages of using a framework is that it should be able to handle a lot of the common functions needed in the interaction between HTTP and HTML. One of these is preserving state across requests and making sure that interactions that are idempotent from the user's point of view actually are idempotent. The way to deal with this is most often done through hidden fields within forms.

The problem is of course that the part of the code that deals with automating the output of these hidden fields is rarely in the same place as the code that deals with the other fields that go into the form. For example, hidden fields used to guard against double submission will often be generated by the framework at the point where the <form> element is generated.

The locations that it is legal to have a <form> element isn't the same as the locations that are allowed <input type=“hidden” … /> elements. This means that the framework must remember the hidden fields and place them when it places other <input> elements. This in turn makes the location of all of the <input> elements unpredictable which means that using <input> nested inside <label> elements, for example, is liable to break.


  1. Provide a special container tag which is only allowed to contain <input type=“hidden” … /> elements and is not rendered.
  2. Allow <input type=“hidden” … /> elements to appear anywhere in the HTML stream.
  3. Allow <input type=“hidden” … /> elements to appear immediately after a <form> tag or before a </form> tag.