All You Need to Know about XML in One Page

20th February 2004

The world of XML is drowning in alphabet soup! There's just too much to know these days, what with CSS, the DOM, DTDs, HTML, HTTP, Resources, SAX, SOAP, URIs, URLs, URNs, WSDL, XHTML, XLink, XML, XML Namespaces, XML Schema, XPath, XPointer and XSLT.

Here, at last, is a one-page guide that tells you in the simplest terms what all these things are and how they fit together. Enjoy!

A Resource is a thing somewhere in the world, often but not always on the Internet. Common resources include things like documents (plain text, HTML, XML, PDF, Microsoft Word), images, video clips, database server endpoints, etc.
A URI is a Unique Resource Identifier - a string uniquely identifying a resource. It always begins with a short string called the Scheme, followed by a colon and then some scheme-specific identifying information. Each URI is either a URN or a URL (see below).
A URN is a Unique Resource Name - a URI that names a resource but does not give a resolvable address for it. For example, a book's ISBN can be specified as a URN, urn:isbn:0-395-36341-1
A URL is a Unique Resource Location - a URI that gives the location of a resource so that it can be obtained directly. Popular schemes of URL include HTTP and HTTPS (see below) and FTP. For example, the URL of the page you're reading now, which is accessible using HTTP, is http://miketaylor.org.uk/tech/xml.html
HTML is HyperText Markup Language, the language used for expressing what web pages should look like. This page is written in HTML.
HTTP is HyperText Transfer Protocol, the protocol used to retrieve web pages and many other kinds of resource. http is the most widely used URL scheme, providing URLs that specify how to retrieve a resource using HTTP.
XML is eXtensible Markup Language, a syntax for expressing structured data that is used for many different applications including the representation of arbitrary data. XML documents consist of elements which may have attributes, and which contain other elements and/or text.
XHTML is a more rigorously specified version of HTML which conforms to the XML specifications - so XHTML documents are also XML documents. They are painful to write, but easier to process automatically than old-style HTML.
A DTD is a Document Type Definition. It is associated with an XML document to constrain what kinds of things can occur in it - e.g. ``There may be zero or more author elements inside each book element, and each one contains text''.
XML Schema is a replacement for DTDs: a newer, more complex, cumbersome and powerful notation for constaining XML document contents. XML Schemas are themselves XML documents (conforming to the Schema schema)
An XML Namespace is an abstract place where XML elements ``come from'', so that you can have multiple different elements with the same name in the same XML document: the title element from the library namespace, and the completely unrelated title element from the heraldry namespace. Namespaces are identified by URIs.
The DOM is the Document Object Model, a way of thinking about, and an API for manipulating, XML documents as trees of objects representing elements, attributes and text.
SAX is the Simple API for XML, an event-based API for XML parsing, often used when the DOM approach of reading the whole parsed document into memory at once would be too resource-heavy.
XPath is a little language for writing expressions that pluck out parts of an XML document.
XLink is a specification for a set of XML elements and attributes used to express links from one XML document to another, and also some more complex link topologies. It is broadly equivalent to, but much more complex and powerful than, HTML anchor (<A href="...">) tags.
XPointer is an XPath-based little language for referring to locations and regions within an XML document, and may be used in conjunction with XLink. It is broadly equivalent to, but much more complex and powerful than, the fragment identifiers used by the URLs in HTML anchor tags.
XSLT is XML Stylesheet Language Transformations: a so-called ``stylesheet language''. It is actually a rich and complex language for specifying arbitrary transformations on XML documents, often but not always resulting in more XML: some common alternative uses include generating user-facing HTML and even plain text.
CSS is Cascading Style Sheets - a much simpler stylesheet language widely used with HTML to specify display aspects such as the fonts to be used for various-level headings, the spacing to use between paragraphs, etc.
SOAP is the Simple Object Access Protocol, a way for a client process to ask a server to do something for it. This is done by expressing the request in XML and sending it to the server, typically via HTTP; the server then returns the results of the requested operation to the client, again expressed in XML. The server is said to provide a Web Service.
SOAP is a form of RPC (Remote Procedure Call) using XML, and indeed SOAP supersedes an earlier, similar specification simply called XML-RPC.
It is not clear whether the name Simple Object Access Protocol means a simple protocol for accessing objects, a protocol for accessing simple objects or a protocol for simple access to objects.
WSDL is the Web Services Definition Language, a formal notation for describing the requests that can be made of a SOAP server and what responses can be expected. WSDL documents are expressed in XML, which makes them very hard to read.
UDDI is an evil and incomprehensible monster from outer space. It stands for Universal Description, Discovery and Integration.

And that's all you need to know!

[Confession: when I print this with Mozilla 1.5 on Red Hat 9, using my browser-default font (12-pixel Sans Serif), it actually uses one and a third pages. But that would have made a terrible title.]

Feedback to <mike@miketaylor.org.uk> is welcome!