Building an Apache XML/HTML Rewriting Stack
URL rewriting is not enough
Apache has bucket brigades - which are essentially lists of output buffers - at the heart of its filtering architecture. The buckets are moved brigade wise through the output filters that manipulate them. The first idea is to fill sax events into those buckets. This is possible because buckets morph into simple text output by calling their read function. So there is a sax filter, that turns the outgoing bucket stream int a stream of sax events. These can be rewritten by subsequent filters. Whatever happens, before they finally reach the network they morph into text.
This has been implemented in mod_xml2. The problem currently is, that modules that manipulate sax buckets need to be written in C. The existing modules mod_xi and mod_i18n were too hard to write (and are currently not sufficiently maintained).
Plans for the Retry
My current plan (which is work in progress by now) is therefore to
make sax buckets available to higher level languages with access to
the apache api, namely perl and lua. Since this implies wrapping the
sax events with an API that then must be made available to said
languages, I use
libxml2 DOM nodes for
this. These are already wrapped. Even more important is that they
have a well documented api for both languages.
The sax buckets have been renamed to node buckets since their
binary format is completely different and since they hold
nodes. The switch to node buckets also saves a lot of code in
mod_xml2. Functionality already
libxml2 does not need to
Parsing the outgoing XML runs the
tree builder with hooked sax handlers. Element nodes are removed from
the tree the in the end handler, all other nodes are removed
immediately. Node buckets are shared buckets with reference counting.
This is used to have start and end element hold the same node. As a
result it is easy to rebuild the tree from the bucket stream, since
the start bucket already knows the end bucket.
libxml2 implements streaming XPath
expression, which allow matching a very restricted subset of XPath
expressions while parsing. Using these it should be easily possible
to implement filters which call a given callback passing the matched
subtree as a parameter. The point with these is that only these
subtrees need to be build.
KID like template
engines that execute
processing instructions should also be doable.
My current project goals are to
stay streaming and
The last one is because I like
It is highly useful for web stuff because it can also parse HTML. It
is also to justify the name.