W3CWD-htmllink-970328


Hypertext Links in HTML

W3C Working Draft 28-Mar-97


This version:
http://www.w3.org/pub/WWW/TR/WD-htmllink-970328
Latest Version:
http://www.w3.org/pub/WWW/TR/WD-htmllink
Authors:

Status of This Document

This draft is work under review by the W3C HTML Working Group, for potential incorporation in an upcoming version of the HTML specification, code named Cougar. Please remember this is subject to change at any time, and may be updated, replaced or obsoleted by other documents. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress".

A list of current W3C Working Drafts can be found at http://www.w3.org/pub/WWW/TR. This is work in progress and does not imply endorsement by, or the consensus of, either W3C or members of the HTML working group. Further information about Cougar is available at http://www.w3.org/pub/WWW/MarkUp/Cougar/.

Please send detailed comments to www-html-editor@w3.org. We cannot garantee a personal response, but summaries will be maintained off the Cougar page. Public discussion on HTML features takes place on www-html@w3.org. To subscribe send a message to www-html-request@w3.org with subscribe in the subject.

Abstract

This introduces the mechanisms used by HTML for hypertext links and resource descriptions. The formal definitions of elements and attributes are left to other specifications.

Note: W3C is developing a more sophisticated approach for representating meta-data than is currently feasible with the LINK and META elements. This new mechanism is intended to include the ability to treat meta-data as first-class objects, to include information describing the properties and relationships of other entities, and to reliably authenticate meta-data use digital signatures. This activity is being coordinated between the PICS "NG" Working Group and the DSIG "Collections" Working Group. .

Contents

Foundations

The World Wide Web can be considered as the set of network accessible information resources. It is founded on three basic ideas:

  1. A global naming scheme for resources - e.g. URLs
  2. Protocols for accessing named resources - e.g. HTTP
  3. Hypertext - the ability to embed links to other resources - e.g. HTML

The hypertext markup language (HTML) plays a key role as a popular hypertext document format.

Source and Destination Anchors

The basic linking idiom in HTML is that of a unidirectional pointer from the source of the link to its destination. The end points are refered to as "anchors" in the hypertext literature. This motivates the choice of 'a' for the tag name for HTML elements acting as hypertext links ('a' is an abbreviation of 'anchor'). Many hypertext links in HTML documents name other HTML documents, e.g.

  See <a href="http://www.w3.org/>W3C</a> for further information.

The A element (delimited by <a> and </a> tags) represents an anchor, one endpoint of a link. Each link has a source anchor and a target anchor. The HREF attribute refers to the target anchor by its URL.

The content between the start and end tag is used as the label for the link, e.g:

  <a href=destination>link label</a>

URLs can also be used to name locations within HTML documents using the fragment identifier syntax, e.g.

  See: <a href="#s3.2">section 3.2</a> for details.
        ...
  <a name="s3.2">Section 3.2</a>The Bigger Picture

Here the second A element acts as the destination for the first. The presence of the NAME attribute causes the A element to define a destination anchor within the document.

When an HTML user agent presents this idiom to the user, the source anchor represents an opportunity for the user to traverse the link and visit the target anchor, and the user agent should select an appropriate user interface idiom to stylize the anchor and allow navigation. GUI applications typically provide visual cues such as text color or underlining with a different mouse cursor, and point-and-click navigation. Keyboard and cursor key selection is another mechanism, which is suitable for text-based applications as well.

Note that the target anchor will not necessarily be another HTML document. Traversing a link may invoke a software program, or present audio, graphics, video, print, speech synthesis or braille.

A single A element can serve simultaneously as the source of one anchor and the target of another:

  In <a href="#jones93">[Jones93]</a>, we see...

  <dl>
  <dt><a name="jones93"
          href="http://www.newu.edu/Jones93/">Jones93</a>
  <dd>Jones, Fred. "Strange Amphibian Behaviours", May 1993
    ...
  </dl>

Any element with an ID attribute denotes a target anchor:

  See <a href="#sec3.2">section 3.2</a> for details.
    ...
  <h2 id=sec3.2>Section 3.2 The Bigger Picture</h2>

An ID attribute value is an SGML NAME token. NAME tokens are formed by an initial letter followed by letters, digits, "-" and "." characters. The letters are restricted to A-Z and a-z. ID values are not case sensitive. No two elements in a document may have the same ID value. No two A elements may have the same NAME.

Anchor NAME values are case insensitive: Characters with multiple possible representations in ISO 10646 (e.g. both precomposed and base+diacritic forms) match only if they have the same representation, except for case differences, in both strings. Case folding must be performed as specified in the Unicode Standard, Version 2.0, section 4.1; in particular it is recommended that case-insensitive matching be peformed by folding to uppercase letters to lowercase, not vice versa. This is the same definition as used by XML.

Absolute and Relative Addresses

The URL syntax [RFC1808] is used for anchor addresses. In the example above, the URL contains only a fragment identifier, since it refers to an anchor within the same resource. In general, a URL can be absolute:

  The work by <a href="http://www.newu.edu/faculty/Jones.html">Jones</a> ...

or relative:

  <a href="Jones.html">...</a>
  <a href="faculty/Jones.html">...</a>
  <a href="faculty/Jones/Bio.html">...</a>
  <a href="/people/faculty/Jones.html">...</a>
  <a href="../faculty/Jones.html">...</a>

In any case, it can also have a fragment identifier, delimited by "#":

  See: <a href="http://www.newu.edu/faculty/Jones.html#pubs">publications
      by Jones</a> ...
  <a href="Jones.html#pubs">...</a>
  <a href="faculty/Jones.html#pubs">...</a>
  <a href="faculty/Jones/Bio.html#1990">...</a>
  <a href="/people/faculty/Jones.html#pubs">...</a>
  <a href="../faculty/Jones.html#pubs">...</a>

Link Semantics: REL and REV

The REL attribute describes the relationship that destination plays with respect to the source (the current document). The REV attribute can be used to define the reverse relationship. A link from document A to document B with REV=relation expresses the same relationship as a link from B to A with REL=relation. Note that relationship names are case insensitive using the same definition as given above for anchor name values.

    Thanks to <A REL="sponsor">Acme Inc</A> for support ...

Following the precedent set by HTML 2.0, REL and REV can take a space separated list of relationship values. Note that REL and REV are also used with the LINK element. Relationship values can be defined in profiles, as explained in a later section.

The Title Attribute

This is used to provide an advisory title for the linked resource. User agents can use this to display balloon help (aka tool tip) for the link. The TITLE attribute can be used with anchor <A> and LINK elements.

    <A TITLE="Ski Conditions for New Hampshire">ski conditions</A>

The BASE Element

The BASE element gives the base URL for dereferencing relative URLs, using the rules given by the URL specification. For example:

  <head>
  <BASE HREF="http://www.acme.com/intro.html" >
  </head>
     ...
  <IMG SRC="icons/logo.gif">

The image is deferenced to

  http://www.acme.com/icons/logo.gif

In the absence of a BASE element the document URL should be used. Note that this is not necessarily the same as the URL used to request the document, as the base URL may be overridden by an HTTP header accompanying the document.

The LINK Element

The A element is used to define source and destination anchors for hypertext links that users can choose to follow as they wish. In contrast, the LINK element can be used to bind HTML elements to various kinds of resources, e.g. style sheets, optimal color palettes, scripts, alternative forms of the document, and navigation links (tables of contents, document index, previous and next pages, copyright notices).

The LINK element denotes a semantic link whose source anchor is the entire containing document or resource. The role of the link is expressed using the REL and/or REV attributes as described above for the anchor element.

The following example might appear in a section of chapter 2 of a book:

  <head>
    <link rel=parent href="chapter2.html">
  </head>

Example LINK elements:

    <LINK REL=Contents HREF=toc.html>
    <LINK REV=Contents REL=Chapter HREF=chap2.html>
    <LINK REL=Index HREF=index.html>
    <LINK REL=Previous HREF=doc31.html>
    <LINK REL=Next HREF=doc33.html>

A range of relationship values have been proposed for a variety of applications. To avoid the difficulties in maintaining a centralized registry for relationship names, you can name a profile with a URL. This is explained below.

LINK can be used to specify linked style sheets, e.g.

    <LINK REL=stylesheet MEDIA=print HREF="corporate-print.css">
    <LINK REL=stylesheet MEDIA=screen HREF="corporate-screen.css">
    <LINK REL=stylesheet HREF="techreport.css">
    <STYLE TYPE="text/css">
        p.special { color: rgb(230, 100, 180) }
    </STYLE>

LINK can be used to specify a language variant of the current document, e.g.

    <LINK REL=alternate LANG=fr HREF="mydoc-fr.html">

You can also use LINK to specify alternative versions of the current document for use when printing, e.g.

  <LINK REL=ALTERNATE MEDIA=PRINT
        HREF="mydoc.ps"
        TYPE=application/postscript>

The MEDIA attribute is used to indicate that the resource pointed to by a LINK element is designed for a particular medium. The TYPE attribute can be used to specify the Internet Media type and associated parameters for the linked resource. This allows the user agent to disregard linked style sheets etc. in unsupported notations, without the need to first make a remote query across the network.

The META Element

The META element can be used to include name/value pairs describing properties of the document, such as author, expiry date, a list of key words etc. The NAME attribute specifies the property name while the CONTENT attribute specifies the property value, e.g.

  <META NAME="Author" CONTENT="Dave Raggett">

The HTTP-EQUIV attribute can be used in place of the NAME attribute and has a special significance when documents are retrieved via the Hypertext Transfer Protocol (HTTP). HTTP servers may use the property name specified by the HTTP-EQUIV attribute to create an RFC 822 style header in the HTTP response. This can't be used to set certain HTTP headers though, see the HTTP specification for details.

  <META HTTP-EQUIV="Expires" CONTENT="Tue, 20 Aug 1996 14:25:27 GMT">

will result in the HTTP header:

    Expires: Tue, 20 Aug 1996 14:25:27 GMT

This can be used by caches to determine when to fetch a fresh copy of the associated document.

A common use for META to specify a comma separated list of keywords that can be exploited by search engines to improve the specificity of search results:

  <META NAME="keywords" CONTENT="vacation,Greece,sunshine">

Some user agents support the use of META to refresh the current page after a few seconds, perhaps replacing it with another page, e.g.

  <META NAME="refresh" CONTENT="3,http://www.acme.com/intro.html">

The content is a number specifying the delay in seconds, followed by the URL to load when the time is up. This mechanism is generally used to show people a fleeting greetings page. You can think of it as as ushering you through a door into a room.

PICS is an infrastructure for associating labels (metadata) with Internet content. It was originally designed to help parents and teachers control what children access on the Internet, but it also facilitates other uses for labels, including code signing, privacy, and intellectual property rights management. The following shows how you can use META to include a PICS label:

    <head>
    <META http-equiv="PICS-Label" content='
    (PICS-1.1 "http://www.gcf.org/v2.5"
       labels on "1994.11.05T08:15-0500"
          until "1995.12.31T23:59-0000"
          for "http://w3.org/PICS/Overview.html"
       ratings (suds 0.5 density 0 color/hue 1))
    '>
    <title>..title goes here..</title>
    </head>
   ...contents of document here...

META can be used to specify the default scripting and style sheet languages. You can also use it to set the default style when you have provided a range of alternative styles using LINK and STYLE elements. The use of the HTTP-EQUIV attribute allows these properties to be set by HTTP headers making it easy for site managers to impose a standard style.

The LANG attribute can be used with META to specify the language for the value of the CONTENT attribute, e.g.

  <META NAME=author LANG=fr CONTENT="Arnaud Le Hors">

This enables speech synthesisers to apply language dependent pronunciation rules. If you provide descriptions in multiple languages, this allows search engines to display search results using the language preferences of the user.

When a property may be described in any of a number of externally defined ways, the SCHEME attribute should be used to indicate which such scheme is used for the value of that property. An example would be the use of a scheme value with NAME=description to indicate the CONTENT is a Library of Congress classification number, a Dewey Decimal System number, a MEdical Subject heading, or Art and Architecture Thesaurus descriptor.

As an example, here is a Dewey Decimal System subject (dds):

  <META NAME=description SCHEME=dds
    CONTENT="04.251 Supercomputers systems design ">

Another example, this time for an identifier property using the ISBN scheme:

  <META NAME=identifier SCHEME=ISBN CONTENT="0-8230-2355-9">

SCHEME is defined as CDATA. The permitted values and their interpretation for each property name are defined by the profile.

Profiles for meta-data

Very common names for link relationships etc. can be standardized. This is not so easy for applications, which while popular, cannot be considered as wholely pervasive. The outcome is that relying on a centralized registry for names is a poor option. This specification proposes the means to define named registries using URLs as globally unique names.

The PROFILE attribute is used with the HEAD element to provide a URL that acts as a globally unique name for a profile (basically a dictionary) of names for link relationships (LINK & A), property names (META) and classes (the CLASS attribute), e.g.

  <HEAD PROFILE="http://www.acme.com/profiles/core">
  <META NAME="author" CONTENT="John Doe">
  <META NAME="copyright" CONTENT="&copy; 1997 Acme Corp.">
  <META NAME="keywords" CONTENT="corporate,guidelines,cataloging">
  <META NAME="date" CONTENT="23 Jan 1997 16:05:31 GMT">
    ...

The example above uses a hypothetical profile, which covers a range of common terms for document attribution and indexing. It is easy to imagine a wide range of profiles for different purposes. User agents which are able to act on meta info using these names can match on the URL given by the PROFILE attribute, without needing to download the profile itself. This allows considerations of formats for profiles to be deferred to future consideration.

An example of a profile is the Dublin Core. This defines a set of recommended properties for electronic bibliographic descriptions, and is intended to promote interoperability among disparate description models.

To allow the possibility of extending the PROFILE attribute to provide a list of profiles, user agents should consider the value as a white space separated list of URLs. For the moment, only the first item is significant.

Note: space makes more sense than comma as the delimiter since white space is not allowed within URLs.

Recommendations for Authors

This section provides some simple suggestions that will help indexing engines to manage your Web pages effectively.

Define the Document Language

In the global context of the Web it is important to know which language a page was written in, e.g. French, German, Spanish etc. You can do this in several ways:

The LANG attribute should be used to specify the language for the content of an element when this differs from the parent element. The attribute can be used on most elements, including empty elements such as LINK, META and IMG.

    <BODY LANG=fr>
        ... en français ...
        <SPAN LANG=de>...auf deutsch...</SPAN>
        ... en français ...
    </BODY>

The language is specified as per RFC 1766 "Tags for the Identification of Languages". For META the language refers to the language used within the CONTENT attribute. For LINK the language refers to the resource specified by the HREF attribute, and to the language used in the TITLE attribute (if present).

Specify Language Variants of this Document

If you have prepared translations of this document into other languages, you should use the LINK element to reference these. This allows an indexing engine to offer users search results in the users' preferred language, regardless of how the query was written.

    <LINK REL=alternate HREF=mydoc-fr.html
        LANG=fr TITLE="La vie sousterrainne">
    <LINK REL=alternate HREF=mydoc-de.html
        LANG=de TITLE="Untergrundlebenskeit">

Can anyone help me get the translations right?

Keywords and Description

Some indexing engines look for META elements that define a comma separated list of keywords/phrases, or which give a short description. At the very least, these can be used when presenting the search results to help users pick the most promising match.

    <META NAME=keywords CONTENT="vacation,Greece,sunshine">
    <META NAME=description CONTENT=Idylic European vacations">

The Beginning of a Collection

When word processing documents or presentations are automatically converted into HTML, this generally results in a collection of HTML pages. Its helpful for search results to reference the beginning of the collection in addition to the page hit by the search. Use LINK with REL=begin along with a TITLE, as in:

 <LINK REL=begin HREF=page1.html TITLE="General Theory of Relativity">

Specify which parts of your Web site get Indexed

Sometimes people find they have been indexed by an indexing robot, or that a resource discovery robot has visited part of a site that for some reason shouldn't be visited by robots. In recognition of this problem, many Web Robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms: a "robots.txt" file and the use of META tags in specific Web pages.

In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents to see if it is allowed to retrieve the document. You can customize the robots.txt file to apply only to specific robots, and to disallow access to specific directories or files.

Here is a sample robots.txt file that excludes all robots from the entire site:

        User-agent: *    # applies to all robots
        Disallow: /      # disallow indexing of all pages
Where to create the robots.txt file

The Robot will simply look for a "/robots.txt" URL on your site, where a site is defined as a HTTP server running on a particular host and port number. For example:

Site URLURL for robots.txt
http://www.w3.org/ http://www.w3.org/robots.txt
http://www.w3.org:80/ http://www.w3.org:80/robots.txt
http://www.w3.org:1234/ http://www.w3.org:1234/robots.txt
http://w3.org/ http://w3.org/robots.txt

There can only be a single "/robots.txt" on a site. Specifically, you should not put "robots.txt" files in user directories, because a robot will never look at them. If you want your users to be able to create their own "robots.txt", you will need to merge them all into a single "/robots.txt". If you don't want to do this your users might want to use the Robots META Tag instead.

Some tips: URL's are case sensitive, and "/robots.txt" string must be all lower-case. Blank lines are not permitted.

There must be exactly one User-agent field. The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

The Disallow field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example,

    Disallow: /help disallows both /help.html and /help/index.html, whereas
    Disallow: /help/ would disallow /help/index.html but allow /help.html. 

An empty value for Disallow, indicates that all URLs can be retrieved. At least one Disallow field must be present in the robots.txt file.

The Robots META tag
The Robots META tag allows HTML authors to indicate to visiting robots whether a document may be indexed, or used to harvest more links. No server administrator action is required.

In the following example a robot should neither index this document, nor analyse it for links.

    <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">

The list of terms in the content is ALL, INDEX, NOFOLLOW, NOINDEX. The name and the content attribute values are case insensitive.

Note: in early 1997 only a few robots implement this, but this is expected to change as more public attention is given to controlling indexing robots.

References

The Dublin Core
For more information see http://purl.org/metadata/dublin_core
Platform for Internet Content (PICS)
For more information see http://www.w3.org/pub/WWW/PICS/