Alvis::Pipeline - Perl extension for passing XML documents along the Alvis pipeline
use Alvis::Pipeline; $in = new Alvis::Pipeline::Read(host => "harvester.alvis.info", port => 16716); $out = new Alvis::Pipeline::Write(port => 29168, spooldir => "/home/alvis/spool"); while ($xmlDOM = $in->read(1)) { $transformed = process($xmlDOM); $out->write($transformed); }
This module provides a simple means for components in the Alvis
pipeline to pass documents between themselves without needing to know
about the underlying transfer protocol. Pipe objects may be created
either for reading or writing; components in the middle of the
pipeline will create one of each. Pipes support exactly one method,
which is either read()
or write()
depending on the type of the
pipe. The granularity of reading and writing is
the XML document; neither smaller fragments nor larger aggregates can
be transferred.
Underneath this interface the Open Archive Initiative's Protocol for
Metadata Harvesting (OAI-PMH) is used. Although the Alvis pipeline is
often presented in terms of documents being pushed down the pipeline
from above, the implementation in fact pulls documents down from
below. This is achieved by having the write()
method simply store
the XML document in a spooling area, where it awaits a request from a
reader that will take it from that area. Therefore, write()
will
never block, but code that writes documents down the pipeline may
not assume that a document, once succesfully written, has necessarily
been successfully read by the downstream component.
The adoption of OAI-PMH as the pipeline's document-passing protocol gives us a pre-made way to express the various necessary operations, and will make it easier in future for new Alvis components to be written that can participate in the pipeline.
In general, though, document producers, filters and consumers in the Alvis pipeline that use this module need not be concerned with OAI-PMH, and can simply use the API described herein.
The documents expected to pass through this pipeline are those representing documents acquired for, and being analysed by, Alvis. These documents are expressed as XML contructed according to the specifications described in the Metadata Format for Enriched Documents. However, while this is the motivating example pipeline that led to the creation of this module, there is no reason why other kinds of documents should not also be passed through pipeline using this software.
new()
$in = new Alvis::Pipeline::Read(host => "harvester.alvis.info", port => 16716); $out = new Alvis::Pipeline::Write(port => 29168, spooldir => "/home/alvis/spool");
Creates a new pipeline, either for reading or for writing. Any number of name-value pairs may be passed as parameters. Among these, most are optional but some are mandatory:
host
and port
of the component
that they will read from.
Write-pipes must specify both spooldir
,
a directory that is writable to the user the process is running as,
and the port
that the OAI-PMH server should listen on.
(Files that are written to a write-pipe are in fact stored in the
specified spool directory until picked up by a reader.)
Write-papes may specify loglevel
[default 0], which is provoke the
under-the-hood OAI server into providing some commentary on its
behaviour.
read()
# Read-pipes only $xmlDOM = $in->read($block);
Reads an XML document from the specified inbound pipe, and returns a
DOM tree representing it. If there is no document ready to read, it
either returns an undefined value (if no argment is provided, or if
the argument is false) or blocks if the argument is provided and true.
read()
throws an exception if an error occurs.
Once a document has been read in this way, it will no longer be
available for subsequent read()
s, so a sequence of read()
calls
will read all the available records one at a time. This is unusual
behaviour for an OAI-PMH repository, but then what we are doing here
is an unusual deployment of OAI-PMH.
write()
# Write-pipes only $in->write($xmlDocument);
Writes an XML document to the specified outbound pipe. The document
may be passed in either as a DOM tree (XML::LibXML::Element
) or a
string containing the text of the document. Throws an exception if an
error occurs.
(In reality, all this does is place the document in a spooling area, whence it will subsequently be picked up when the downstream component asks to read a record. But that implementation detail can be ignored.)
close()
$pipe->close();
Closes a pipe, after which no further reading or writing may be done on it. This is important for write-pipes, as it frees up the Internet port that the under-the-hood OAI server is listening on. Tiny reading clients will also call this when they're done.
Alvis Task T3.2 - Metadata Format for Enriched Documents. Milestone M3.2 - Month 12 (December 2004). Includes a useful overview of the Alvis processing pipeline. http://www.miketaylor.org.uk/alvis/t3-2/m3-2.html
The Open Archives Initiative. http://www.openarchives.org/
The Open Archives Initiative Protocol for Metadata Harvesting Version 2.0. http://www.openarchives.org/OAI/openarchivesprotocol.html
Mike Taylor, <mike@indexdata.com>
Copyright (C) 2005 by Index Data ApS.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.