Building an Apache XML/HTML Rewriting Stack

URL rewriting is not enough

Bucket Abuse

Apache has bucket brigades - which are essentially lists of output buffers - at the heart of its filtering architecture. The buckets are moved brigade wise through the output filters that manipulate them. The first idea is to fill sax events into those buckets. This is possible because buckets morph into simple text output by calling their read function. So there is a sax filter, that turns the outgoing bucket stream int a stream of sax events. These can be rewritten by subsequent filters. Whatever happens, before they finally reach the network they morph into text.

First Try

This has been implemented in mod_xml2. The problem currently is, that modules that manipulate sax buckets need to be written in C. The existing modules mod_xi and mod_i18n were too hard to write (and are currently not sufficiently maintained).

Plans for the Retry

My current plan (which is work in progress by now) is therefore to make sax buckets available to higher level languages with access to the apache api, namely perl and lua. Since this implies wrapping the sax events with an API that then must be made available to said languages, I use libxml2 DOM nodes for this. These are already wrapped. Even more important is that they have a well documented api for both languages.

The sax buckets have been renamed to node buckets since their binary format is completely different and since they hold libxml2 nodes. The switch to node buckets also saves a lot of code in mod_xml2. Functionality already implemented in libxml2 does not need to be reimplemented.

Parsing the outgoing XML runs the libxml2 tree builder with hooked sax handlers. Element nodes are removed from the tree the in the end handler, all other nodes are removed immediately. Node buckets are shared buckets with reference counting. This is used to have start and end element hold the same node. As a result it is easy to rebuild the tree from the bucket stream, since the start bucket already knows the end bucket.

Further Plans

libxml2 implements streaming XPath expression, which allow matching a very restricted subset of XPath expressions while parsing. Using these it should be easily possible to implement filters which call a given callback passing the matched subtree as a parameter. The point with these is that only these subtrees need to be build.

Implementing KID like template engines that execute <?perl and <?lua processing instructions should also be doable.

Goals

My current project goals are to

  1. become usable,

  2. stay streaming and

  3. be libxml2ish.

The last one is because I like libxml2. It is highly useful for web stuff because it can also parse HTML. It is also to justify the name.