Divya Manian

RSS Feed Youtube Channel Github

Content Models in HTML

So, in the spirit of being the ferret for jargon I don’t understand, I went on a quest to understand what the phrase “Content Models” mean.

WTF are “Content Models”?

You will regret asking that question, as it is a rabbit hole. But fear not as this post can be your summary of the excitement that awaits you. In short, Content Models are used to define the kind of content that can be found within an element in HTML. Historically, this is defined in the DTD for a SGML-derived language (like XML and HTML 4.01).

The Story so Far

In HTML 4.01 specification, every element needs to be defined and contain a content model. For example, a ul element can only contain one or more li elements. An img cannot contain any element within it, so its content model is empty.

Several elements are categorized in HTML 4.01 into block and inline. This is declared in the HTML 4.01 DTD:

<!ENTITY % block "P | %heading; | %list; | %preformatted; | DL | DIV | NOSCRIPT | BLOCKQUOTE | FORM | HR | TABLE | FIELDSET | ADDRESS">and<!ENTITY % inline "#PCDATA | %fontstyle; | %phrase; | %special; | %formctrl;">

Each element is then defined in the DTD with what kind of content they can contain. For example, while p is a block element, it can contain only inline elements according to the DTD:

<!ELEMENT P - O (%inline;)*            -- paragraph -->

Some elements (like ins or del) can function either as block level or inline elements depending on how they are used.

HTML 4.01 defines the content models within a DTD, which is referenced by the Doctype declaration at the beginning of a HTML page. In HTML5, the content models are defined within the spec and not in a DTD (this also means you do not have to specify a Doctype, but not doing so will trigger quirks mode in some browsers, so you are recommended to use the HTML5 Doctype <!doctype html>).

What’s new in HTML5?

HTML5 does away with block and inline categorization of content. There are now several categories:

  • Metadata content
  • Flow content
  • Sectioning content
  • Heading content
  • Phrasing content
  • Embedded content
  • Interactive content

Each element can belong to one or more categories, and can behave one way or the other depending on the context. Each of them has an expected list of contents. Here is a summary of which elements belong to which categories

Why is this good to know?

There is an experimental HTML5 parser available on Firefox 3.6 onwards and basic HTML 5 parsing available for Opera Presto 2.5 rendering engine. Suffice to say, when all browsers start parsing HTML5, it will be relevant to know which content is being rendered in what context for writing more semantic code.

Browsers have, for long, worked around incorrect implementations of the HTML specification, so the more rigorous/semantic we are with our markup, the easier it is for the browser to render!

Or, you can use this knowledge to wisely nod your head when you are trying to parse the messages in the WHATWG mailing list!

Comments