WD-doctypes-960302

HTML Dialects: Internet Media and SGML Document Types

W3C Working Draft 06-Mar-96

This version:: http://www.w3.org/pub/WWW/TR/WD-doctypes-960302
$Id: WD-doctypes.html,v 1.11 1996/03/05 17:33:40 connolly Exp $
Latest version:: http://www.w3.org/pub/WWW/TR/WD-doctypes
Authors:: Daniel W. Connolly <connolly@w3.org>

Status of this document

This is [not yet] a W3C Working Draft for review by W3C members and other interested parties. It is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress". A list of current W3C working drafts can be found at: http://www.w3.org/pub/WWW/TR

Note: since working drafts are subject to frequent change, you are advised to reference the above address, rather than the addresses of working drafts themselves.

Abstract

The HTML 2.0 specification, RFC1866, defines an SGML application and an Internet media type. The specification notes that extensions are planned, but only the text/html; level=2 internet media type and the "-//IETF//DTD HTML 2.0//EN" document type are defined. This document suggests the use of URIs as system identifiers for document type definitions, allowing decentralized evolution of the language. The use of marked sections as a transition technique and the continued use of the level mechanism for standardized points in the evolution path are discussed.

Introduction
Problem Statement
References

Introduction

The goal of any HTML specification should be to promote that confidence in the fidelity of communications using HTML. This means:

making it clear to authors what idioms are available
making it clear to implementors how to interpret the
keeping HTML simple enough that it can be implemented
making HTML expressive enough that it can represent a useful majority of the contemporary communications idioms in this community
making some allowance for expressing idioms not captured by the specification
addressing relavent interoperability issues with other applications and technologies

HTML 2.0 specifies a set of idioms widely used and supported as of June of 1994. But HTML and the web are still in a stage of rapid innovation and evolution, and will be for the forseaable future. The HTML 2.0 specification fails to accomodate this evolution--it fails to meet goal #5, and goal #6 cannot be met by any frozen document, as "contemporary communications idioms" evolve over time.

Examples of this evolution include the introduction of forms and tables. In each case, information providers suddenly had two kinds of clients: those with support for the new feature, and those without. They were faced with the following choices:

Stick to the lowest common denominator: This sacrifices rich information delivery for ubiquitous access
Exploit the new feature: Some clients will fail to support the new feature, and in stead see "noise." Some information providers employ a "You must have a forms-capable browser to access this page" disclaimer.
Make the choice explicit: This is the "click here if your browser supports forms" phenomenon. The information provider maintains two representations: feature-rich and feature-poor. The consumer's readering experience is disrupted to make an irrelevant technical decision that they may not be equipped to make.

Optimally, the system should obviate the need for information providers and consumers to deal with this issue explicitly. Interoperability between new and old components should be automatic.

This document proposes a mechanism that obviates the need for consumers to explicitly deal with the issue. The mechanism does not alleviate the information provider's burden, but it does increase reliability even in the case that information providers are unwilling to invest the effort necessary to support old clients.

Problem Statement

Consider the following documents:

Level 0: Simple HTML

<title>Example: Simple HTML</title>
<p>A paragraph with a <a href="#dest">link</a>.
<ul>
<li>a list
<li>of <a href="dest">items

Level 1: Phrase Markup, Nested Lists, and Images

<title>Example: Phrase Markup, Nested Lists, and Images</title>
<p>A paragraph with <em>emphasis<em> and an <img alt="image"
src="foo.png">.
<ol>
<li>Section 1
<li>Section 2
 <li>Section 2.1
 <li>Section 2.2
<li>Section 3
</ol>

Level 2: Forms

<title>Example: Forms</title>
<h1>Forms</h1>
<form action="/cgi-bin/test" method=POST>
<p><input name=x>
<p><input name=y>
<p><input name=z>
</form>

Level 3: Tables, Objects, and Figures

<title>Example: Tables, Inserts, and Figures</title>
<table>
<tr><th>Col 1 <th>Col 2 <th>Col 3
<tr><td>A     <td>B     <td>C
<tr><td>1     <td>2     <td>3
</table>
<fig>
<caption>Figure 1: A Movie</caption>
<object data="movie.mpg">
[Movie elided]
</object>
</fig>

There is a convention among HTML user agents to ignore unrecognized markup. Given the above documents, HTML user agents will behave reliably for documents containing only markup they support. In the face of unrecognized markup, the reliability varies:

HTML Document vs. User Agent Features
Document:	Level 0	Level 1	Level 2	Level 3
Level 0 User Agent	100% fidelity	phrase markup and images lost	forms shown as noise	tables and figure captions shown as noise
Level 1 User Agent	100% fidelity	100% fidelity	forms shown as noise	tables and figure captions shown as noise
Level 2 User Agent	100% fidelity	100% fidelity	100% fidelity	tables and figure captions shown as noise
Level 3 User Agent	100% fidelity	100% fidelity	100% fidelity	100% fidelity

A Robust Definition of the `text/html` Internet Media Type

Actually, none of the above documents conforms to the specificatoin for the text/html media type given in [RFC1866] -- they are missing a document type declaration, e.g.:

<!doctype html public "-//IETF//DTD HTML 2.0//EN">

The HTML 2.0 specification advises implementors to infer the above declaration if none is given. This is poor advice since in practice, the chance that such a document conforms to the HTML 2.0 DTD is very small [Adams95] (cite Tim Bray at opentext, regarding %age of valid HTML docs?)

Rather than binding text/html to any particular DTD, we define it to be and SGML document type that includes HTML level 1, as defined by [RFC1866]. (An SGML document type t1 includes t2 if every document conforming to t2 also conforms to t1.)

We define a text/html body to be an SGML document entity whose DTD is externally referenced; i.e. the body begins with one of

<!doctype html public "..." system "...">
<!doctype html public "...">
<!doctype html system "...">
<!doctype html>

And we remove the default from the level parameter:

Media Type name: text
Media subtype name: html
Required parameters: none
Optional parameters: level, charset
Encoding considerations: any encoding is allowed
Security considerations: Anchors, embedded images, and all other elements which contain URIs as parameters may cause the URI to be dereferenced in response to user input. In this case, the security considerations of [URL@@] apply.
The widely deployed methods for submitting forms requests -- HTTP and SMTP -- provide little assurance of confidentiality. Information providers who request sensitive information via forms -- especially by way of the `PASSWORD' type input field -- should be aware and make their users aware of the lack of confidentiality.

The optional parameters are defined as follows:

Level: The level parameter specifies the feature set used in the document. The level is an integer number, implying that any features of same or lower level may be present in the document. Level 1 is all features defined in [RFC1866] except those that require the FORM element. Level 2 includes form processing. There is no default. In the absence of a level parameter, the <!doctype ...> in the body determines the level.
Charset: The charset parameter (as defined in section 7.1.1 of RFC 1521[MIME]) may be given to specify the character encoding scheme used to represent the HTML document as a sequence of octets. The default value is outside the scope of this specification; but for example, the default is `US-ASCII' in the context of MIME mail, and `ISO-8859-1' in the context of HTTP [HTTP].

Decentralized Definition of the HTML Document Type

The expectation is that in addition to the standard DTDs, the HTML processing capabilities of a user agent are described by some DTD, and that this DTD has a formal public identifier, a Uniform Resource Identifier (URI or URL), or both.

Most documents will be prepared for standard HTML user agents, and their document type will be declared ala:

<!doctype html public "-//IETF//DTD HTML 2.0//EN">

A Document prepared for a user agent with support for some other HTML dialect would have its document type declared using one of the following:

<!doctype html public "-//VendorCo Inc.//DTD HTML v1.4//EN"
	system "http://www.vendor.com/html-public-text/v1.4.dtd">
<!doctype html system "http://www.vendor.com/html-public-text/v1.4.dtd">

All user agents would have built-in support for the standard DTDs, plus a few popular de-jour DTDs. Some user agents would be able to accomodate new DTDs at runtime by fetching them from the network. User agents without this capability, on encountering an unknown DTD identifier, could warn that the document might not be processed as intended by the information provider.

Marked Sections for Robust Handling of Unknown Markup

The "ignore unrecognized markup" convention is unacceptably unreliable in cases such as forms and tables.

The improved convention is that marked sections are processed as per [ISO8879] (see @@marked sections primer). Additionally, parameter entity references of the form %if-xxx are presumed to resolve to IGNORE, and those of the form %no-xxx are presumed to resolve to INCLUDE, unless the DTD in effect has a declaration for those names.

Using this convention, consider the following enhanced document:

Level 3/1: Conditional Table

<doctype html system "http://www.w3.org/html-pubtext/960212/html.dtd">
<title>Example: Conditional Table</title>
<![ %if-table [
<table>
<tr><th>Col 1 <th>Col 2 <th>Col 3
<tr><td>A     <td>B     <td>C
<tr><td>1     <td>2     <td>3
</table>
]]>
<![ %no-table [
<pre>
Col 1     Col 2   Col 3
A         B       C
1         2       3
</pre>
]]>

Assuming support for marked sections, an HTML 2.0 user agent will process the table marked up using <pre>, whereas a user agent that supports the 960212 DTD will process the <table> markup. A user agent that does not support the 960212 DTD, but does support tables, is likely to process the <tables> markup reliably, since its DTD is likely to have declarations ala:

<!entity % if-tables "INCLUDE">
<!entity % no-tables "IGNORE">

and declarations for <table>, <tr>, <td>, etc. that match the 960212 DTD.

This convention would have dealt gracefully with FORM and TABLES. It has the potential to deal gracefully with SCRIPT, MATH, APPLET, etc.

While the marked section markup may seem unwieldy, it is necessary only when both of the following conditions hold:

a feature hasn't been fully deployed, i.e. there is still a significant installed base that doesn't support it and
the information provider needs "forwards compatibility" -- i.e. they're willing to put more stuff in the document to be sure that old browsers behave nicely.

Here are some cases to mull over, in roughly historical order:

DOCTYPE	Features Used in Doc	Features in Marked Section?	Browser Capabilities	Result
1.0	1.0	no	1.0	100% reliable *1
1.x	1.0+phrase markup	no	1.0	some signal loss *2
2.0	2.0lev1 (no forms)	no	2.0lev1	100% reliable *1
2.0	2.0 incl forms	no	2.0lev1	some form noise *3
2.0	2.0 incl forms	no	2.0	100% reliable *1
3.x(tables)	2.0+tables	no (tables)	2.0	some table noise *3
3.x(tables)	2.0+tables	no (tables)	3.x (tables)	100% reliable *1
3.x(tables)	2.0+tables	yes, incl apology	2.0+marked sections	100% reliable *4 (apology shown)
3.x(tables)	2.0+tables	yes, incl apology	2.0	some table noise,*5, apology
3.x(tables)	2.0+tables	yes, incl apology	2.0+tables	98% reliable,*6 apology (uneeded)
3.x (tables)	2.0+tables	yes, incl apology	3.x(tables) Marked S.	100% reliable*1 (table shown)

*1: Standard features
*2: Unrecognized markup ignored without much disruption
*3: Unrecognized markup causes disruption
*4: Apology for lack of support shown
*5: Apology shown along with goofed up table
*6: Apology shown along with correctly processed table

In the table above, substitute any of script, style, math, embed, etc. for forms/tables with the same result.

The HTML 2.0 "ignore unknown tags" absorbs changes along the lines of phrase markup and new IMG attributes ala *2. But for novel new features like forms and tables, we see *3. Note that without marked sections, each non-trivial feature introduced causes a transitional period involving lots of interactions ala *3, with most things settling down ala *1, but an indefinite burden of *3 style interactions due to outdated software.

Until marked sections are supported, providers who use marked sections are rewarded ala *5, but penalized ala *6. (They are apparently already to live with this, as evidenced by the "if your browsers doesn't support forms, ..." apologies we see, even on forms-capable browsers.)

With marked sections, non-trivial new features can be introduced with interactions ala *4, with graceful transition back to style *1.

Format Negotiation Using Links and Resource Information

@@information provider maintains several variants; one corresponds to the capabilities of most if his/her readership, and that's the one that's shipped by default. It has links to the other variants, so that remedial clients can downgrade at runtime.

Format Negotiation Using HTTP

@@see: tables deployment document

The combination of relying on internal labelling (with external labelling in the content type as an optimization) and marked sections is a viable medium-to-long term solution.

The internal labelling/marked section strategy is the equivalent ofthe color TV solution: send the color signal to everybody, and the folks that can't show the color just throw it away.

The external labelling/format negotiation strategy is like having the broadcasters send black-and-white signal to folks that request it, and color to the rest. In some cases (like inline graphics formats), this is the right thing to do. But it appears that in the vast majority of cases involving new HTML features, it's just not worth the trouble.

@@discuss negotiation based on user-agent, caching, etc.

Appendix: Marked Sections Primer

See: "Marked Sections" in TEI Gentle Intro to SGML

References

Adams, Nov 95

Date: Thu, 9 Nov 95 13:03:39 EST
Message-Id:<9511091801.AA04679@trubetzkoy.stonehand.com>
From: Glenn Adams<glenn@stonehand.com>
To: Multiple recipients of list<html-wg@oclc.org>

T. Berners-Lee & D. Connolly, November 1995.

"Hypertext Markup Language - 2.0" RFC 1866 ftp://ds.internic.net/rfc/rfc1866.txt

Altheim, Murray, Jan 1996

A Modular DTD Approach for HTML Specification National Technology Transfer Center, work in progress

Connolly, Jan 1996

W3C HTML Public Text Repository work in progress

Connolly

Toward Graceful Deployment of Tables

Connolly, XXX

To: mwm@contessa.phone.net
cc: Multiple recipients of list <html-wg@oclc.org>
Subject: Reliable Interoperability [was: LiveScript and HTML ]
In-reply-to: Your message of "Mon, 16 Oct 1995 23:00:26 EDT."
             <19951016.75EF780.11F50@contessa.phone.net> 
Date: Tue, 17 Oct 1995 00:32:12 -0400
From: "Daniel W. Connolly" <connolly@beach.w3.org>

Clark, James

nsgmls -- a new SGML parser

Behlendorf , Jan 1996

Date: Sun, 7 Jan 1996 23:45:23 -0800 (PST)
From: Brian Behlendorf <brian@organic.com>
To: www-talk@w3.org
Subject: HTML variants and content negotiation
Message-Id: <Pine.SGI.3.91.960107232733.10147O-100000@fully.organic.com>

The World Wide Web Consortium: http://www.w3.org/