The W3C MarkUp Validation Service consists of an SGML Parser, an SGML catalog, a CGI program and it's configuration files. In addition it relies on a moderately large set of Perl modules for it's operation.

This document tries to draw a road map of the prerequisites and what the different parts of the system do. It is intended for system administrators and people interested in helping developing the validator. This is not end user documentation. See the User Manual for usage instructions.

Prerequisites

Apart from a properly configured web server, the Validator needs a SGML parser -- that does all the hard work -- and several Perl modules used by the "check" CGI script.

The SGML parser we're currently using is OpenSP 1.5, which can be found on the OpenJade home page.

The canonical list of Perl modules we use can be found in the source for the "check" CGI script. There is a bunch of lines that of the form "use Foo::Bar" where each "Foo::Bar" represents a module. Most modules can be found on CPAN (minimum versions in parenthesis after the name). The following list was complete when CVS spit out: $Date: 2002-12-08 01:46:44 $. :-)

CGI (2.81)
The all-singing, all-dancing, everything-and-the-kitchen-sink, Perl CGI library. This takes care of all those niggly little bits of CGI for us and make options parsing and file upload a breeze.
CGI::Carp
CGI-aware warn()/die()
File::Spec
Portable filespecs.
HTML::Parser (3.25)
Minimal HTML Parser used for preparse and finding metadata.
LWP::UserAgent (1.90)
Gisle Aas' most excellent WWW library for Perl. This is where our support for downloading pages off the net comes from.
Set::IntSpan
Efficient Set operations.
Text::Iconv
Perl-native interface to the (g)libc iconv(3) library. Handles charset conversion issues.
Text::Wrap
Wrap text to a sane width. Needed for source output in results.
URI::Escape
Module to handle escaping special characters in URIs.

Configuration Files

The validator uses a number of configuration files -- most of which are really mapping tables of some form -- to avoid having to check in a new version of the code every time a new version of HTML comes out. All configuration files can be found in $CVSROOT/validator/htdocs/config/.

To really understand what each does you should read the source, but here is a short description to get you started.

validator.conf
Main configuration file. Gives various parameters (such as the address of the maintainer and the URL for the "Home Page") and the locations of the other configuration files and mapping tables.
types.conf

The main document type database for the Validator. This file contains information on all the document types we know of. It lets us map from a Public Identifier to a plain text version string, lookup an URL for more information on a DOCTYPE, and check which Content-Types and Namespaces are legal for this particular DOCTYPE.

And entry in this file looks like this:

<XHTML_1_1>
  Name       = html
  Display    = XHTML 1.1
  Info_URL   = http://www.w3.org/TR/xhtml11/
  PubID      = -//W3C//DTD XHTML 1.1//EN
  SysID      = http://www.w3.org/TR/2001/REC-xhtml11-20010531/DTD/xhtml11-flat.dtd
  Parse_Mode = XML
  <Content_Types>
    Allowed   = application/xhtml+xml
    Forbidden = text/html
    Preferred = application/xhtml+xml
  </Content_Types>
  <Namespaces>
    Allowed   = http://www.w3.org/1999/xhtml
    Required  = 1
  </Namespaces>
  <Badge>
    URI    = http://www.w3.org/Icons/valid-xhtml11
    Height = 31
    Width  = 88
  </Badge>
</XHTML_1_1>

The name used for each section (e.g. "XHTML_1_1") is arbitrary. The file will be turned inside out and will end up indexed by the "PubID". This means that you cannot have two entries with the same PubID. The rest of the parameters are:

NameThe "Document Type Name" for this document type.
DisplayThe pretty text version for the PubID.
Info_URLURL for more information on the PubID.
PubIDThe Formal Public Identifier for this document type.
SysIDA System Identifier for the DTD.
Parse_ModeBoolean describing whether to treat this as XML or SGML.
Content_Types
AllowedAllowed Content-Types
ForbiddenForbidden Content-Types
PreferredPreferred Content-Types
Namespaces
AllowedAllowed Namespaces
RequiredBoolean describing whether a Namespace is required in this document type.
Badge
URIURI for a "Valid Foo" badge.
HeightHeight of this image.
WidthWidth of this image.
eref.cfg
Contains the mappings from element names to an URI fragment (relative to a configurable URI) for their definitions. Used in output when the "Show Source Input" option is enabled.
frag.cfg
Maps error messages to an URI fragment identifier where an explanation of that error can be found.

TODO

The TODO list for the Validator is online at <http://validator.w3.org/todo.html>. This is probably the best place to start.

However this list is by no means comprehensive. Feel free to suggest other features that should be on this list or send patches for your favourite feature.

Keep in mind that features should be of general utility and that the point if the validator is that it does an objective validation instead of just what some random developer happens to think is a Good Idea®. While extra features are nice, they shouldn't dilute the value of the validator as an objective check.