All You Need to Know about XML in One Page
20th February 2004
The world of XML is drowning in alphabet soup! There's just too much
to know these days, what with
CSS,
the DOM,
DTDs,
HTML,
HTTP,
Resources,
SAX,
SOAP,
URIs,
URLs,
URNs,
WSDL,
XHTML,
XLink,
XML,
XML Namespaces,
XML Schema,
XPath,
XPointer
and
XSLT.
Here, at last, is a one-page guide that tells you in the simplest
terms what all these things are and how they fit together. Enjoy!
-
A Resource
is a thing somewhere in the world, often but not always
on the Internet. Common resources include things like documents
(plain text, HTML, XML, PDF, Microsoft Word), images, video clips,
database server endpoints, etc.
-
A URI
is a Unique Resource Identifier - a string uniquely
identifying a resource. It always begins with a short string
called the Scheme, followed by a colon and then some
scheme-specific identifying information. Each URI is either a URN
or a URL (see below).
-
A URN
is a Unique Resource Name - a URI that names a resource but
does not give a resolvable address for it. For example, a book's
ISBN can be specified as a URN, urn:isbn:0-395-36341-1
-
A URL
is a Unique Resource Location - a URI that gives the
location of a resource so that it can be obtained directly.
Popular schemes of URL include HTTP and HTTPS (see below) and
FTP. For example, the URL of the page you're reading now, which
is accessible using HTTP, is
http://miketaylor.org.uk/tech/xml.html
-
HTML
is HyperText Markup Language, the language used for
expressing what web pages should look like.
This page is written in HTML.
-
HTTP
is HyperText Transfer Protocol, the protocol used to retrieve
web pages and many other kinds of resource. http is the
most widely used URL scheme, providing URLs that specify how to
retrieve a resource using HTTP.
-
XML
is eXtensible Markup Language, a syntax for expressing
structured data that is used for many different applications
including the representation of arbitrary data. XML documents
consist of elements which may have attributes, and
which contain other elements and/or text.
-
XHTML
is a more rigorously specified version of HTML which
conforms to the XML specifications - so XHTML documents are also
XML documents. They are painful to write, but easier to process
automatically than old-style HTML.
-
A DTD
is a Document Type Definition. It is associated with an XML
document to constrain what kinds of things can occur in it -
e.g. ``There may be zero or more author elements inside
each book element, and each one contains text''.
-
XML Schema
is a replacement for DTDs: a newer, more complex,
cumbersome and powerful notation for constaining XML document
contents. XML Schemas are themselves XML documents (conforming to
the Schema schema)
-
An XML Namespace
is an abstract place where XML elements ``come
from'', so that you can have multiple different elements with the
same name in the same XML document: the title element from
the library namespace, and the completely unrelated
title element from the heraldry namespace. Namespaces
are identified by URIs.
-
The DOM
is the Document Object Model, a way of thinking about, and
an API for manipulating, XML documents as trees of objects
representing elements, attributes and text.
-
SAX
is the Simple API for XML, an event-based API for XML parsing,
often used when the DOM approach of reading the whole parsed
document into memory at once would be too resource-heavy.
-
XPath
is a little language for writing expressions that pluck out
parts of an XML document.
-
XLink
is a specification for a set of XML elements and attributes
used to express links from one XML document to another, and also
some more complex link topologies. It is broadly equivalent to,
but much more complex and powerful than, HTML anchor
(<A href="...">)
tags.
-
XPointer
is an XPath-based little language for referring to
locations and regions within an XML document, and may be used in
conjunction with XLink. It is broadly equivalent to, but much
more complex and powerful than, the fragment identifiers used by
the URLs in HTML anchor tags.
-
XSLT
is XML Stylesheet Language Transformations: a so-called
``stylesheet language''. It is actually a rich and complex
language for specifying arbitrary transformations on XML
documents, often but not always resulting in more XML: some common
alternative uses include generating user-facing HTML and even
plain text.
-
CSS
is Cascading Style Sheets - a much simpler stylesheet language
widely used with HTML to specify display aspects such as the fonts
to be used for various-level headings, the spacing to use between
paragraphs, etc.
-
SOAP
is the Simple Object Access Protocol, a way for a client
process to ask a server to do something for it. This is done by
expressing the request in XML and sending it to the server,
typically via HTTP; the server then returns the results of the
requested operation to the client, again expressed in XML. The
server is said to provide a Web Service.
SOAP is a form of RPC (Remote Procedure Call) using XML, and
indeed SOAP supersedes an earlier, similar specification simply
called XML-RPC.
It is not clear whether the name Simple Object Access Protocol
means a simple protocol for accessing objects, a protocol for
accessing simple objects or a protocol for simple access to
objects.
-
WSDL
is the Web Services Definition Language, a formal notation
for describing the requests that can be made of a SOAP server and
what responses can be expected. WSDL documents are expressed in
XML, which makes them very hard to read.
-
UDDI
is an evil and incomprehensible monster from outer space.
It stands for Universal Description, Discovery and Integration.
And that's all you need to know!
[Confession: when I print this with Mozilla 1.5 on Red Hat 9, using
my browser-default font (12-pixel Sans Serif), it actually uses one
and a third pages. But that would have made a terrible title.]