.. -*- mode: rst; encoding: utf-8 -*-
==============
Markup Streams
==============
A stream is the common representation of markup as a *stream of events*.
.. contents:: Contents
:depth: 2
.. sectnum::
Basics
======
A stream can be attained in a number of ways. It can be:
* the result of parsing XML or HTML text, or
* the result of selecting a subset of another stream using XPath, or
* programmatically generated.
For example, the functions ``XML()`` and ``HTML()`` can be used to convert
literal XML or HTML text to a markup stream:
.. code-block:: pycon
>>> from genshi import XML
>>> stream = XML('
Some text and '
... 'a link.'
... '
')
>>> stream
The stream is the result of parsing the text into events. Each event is a tuple
of the form ``(kind, data, pos)``, where:
* ``kind`` defines what kind of event it is (such as the start of an element,
text, a comment, etc).
* ``data`` is the actual data associated with the event. How this looks depends
on the event kind (see `event kinds`_)
* ``pos`` is a ``(filename, lineno, column)`` tuple that describes where the
event “comes from”.
.. code-block:: pycon
>>> for kind, data, pos in stream:
... print kind, `data`, pos
...
START (QName(u'p'), Attrs([(QName(u'class'), u'intro')])) (None, 1, 0)
TEXT u'Some text and ' (None, 1, 17)
START (QName(u'a'), Attrs([(QName(u'href'), u'http://example.org/')])) (None, 1, 31)
TEXT u'a link' (None, 1, 61)
END QName(u'a') (None, 1, 67)
TEXT u'.' (None, 1, 71)
START (QName(u'br'), Attrs()) (None, 1, 72)
END QName(u'br') (None, 1, 77)
END QName(u'p') (None, 1, 77)
Filtering
=========
One important feature of markup streams is that you can apply *filters* to the
stream, either filters that come with Genshi, or your own custom filters.
A filter is simply a callable that accepts the stream as parameter, and returns
the filtered stream:
.. code-block:: python
def noop(stream):
"""A filter that doesn't actually do anything with the stream."""
for kind, data, pos in stream:
yield kind, data, pos
Filters can be applied in a number of ways. The simplest is to just call the
filter directly:
.. code-block:: python
stream = noop(stream)
The ``Stream`` class also provides a ``filter()`` method, which takes an
arbitrary number of filter callables and applies them all:
.. code-block:: python
stream = stream.filter(noop)
Finally, filters can also be applied using the *bitwise or* operator (``|``),
which allows a syntax similar to pipes on Unix shells:
.. code-block:: python
stream = stream | noop
One example of a filter included with Genshi is the ``HTMLSanitizer`` in
``genshi.filters``. It processes a stream of HTML markup, and strips out any
potentially dangerous constructs, such as Javascript event handlers.
``HTMLSanitizer`` is not a function, but rather a class that implements
``__call__``, which means instances of the class are callable:
.. code-block:: python
stream = stream | HTMLSanitizer()
Both the ``filter()`` method and the pipe operator allow easy chaining of
filters:
.. code-block:: python
from genshi.filters import HTMLSanitizer
stream = stream.filter(noop, HTMLSanitizer())
That is equivalent to:
.. code-block:: python
stream = stream | noop | HTMLSanitizer()
For more information about the built-in filters, see `Stream Filters`_.
.. _`Stream Filters`: filters.html
Serialization
=============
Serialization means producing some kind of textual output from a stream of
events, which you'll need when you want to transmit or store the results of
generating or otherwise processing markup.
The ``Stream`` class provides two methods for serialization: ``serialize()``
and ``render()``. The former is a generator that yields chunks of ``Markup``
objects (which are basically unicode strings that are considered safe for
output on the web). The latter returns a single string, by default UTF-8
encoded.
Here's the output from ``serialize()``:
.. code-block:: pycon
>>> for output in stream.serialize():
... print `output`
...
'>
'>
'>
'>
'>
And here's the output from ``render()``:
.. code-block:: pycon
>>> print stream.render()
Some text and a link.
Both methods can be passed a ``method`` parameter that determines how exactly
the events are serialized to text. This parameter can be either a string or a
custom serializer class:
.. code-block:: pycon
>>> print stream.render('html')
Some text and a link.
Note how the `
` element isn't closed, which is the right thing to do for
HTML. See `serialization methods`_ for more details.
In addition, the ``render()`` method takes an ``encoding`` parameter, which
defaults to “UTF-8”. If set to ``None``, the result will be a unicode string.
The different serializer classes in ``genshi.output`` can also be used
directly:
.. code-block:: pycon
>>> from genshi.filters import HTMLSanitizer
>>> from genshi.output import TextSerializer
>>> print ''.join(TextSerializer()(HTMLSanitizer()(stream)))
Some text and a link.
The pipe operator allows a nicer syntax:
.. code-block:: pycon
>>> print stream | HTMLSanitizer() | TextSerializer()
Some text and a link.
.. _`serialization methods`:
Serialization Methods
---------------------
Genshi supports the use of different serialization methods to use for creating
a text representation of a markup stream.
``xml``
The ``XMLSerializer`` is the default serialization method and results in
proper XML output including namespace support, the XML declaration, CDATA
sections, and so on. It is not generally not suitable for serving HTML or
XHTML web pages (unless you want to use true XHTML 1.1), for which the
``xhtml`` and ``html`` serializers described below should be preferred.
``xhtml``
The ``XHTMLSerializer`` is a specialization of the generic ``XMLSerializer``
that understands the pecularities of producing XML-compliant output that can
also be parsed without problems by the HTML parsers found in modern web
browsers. Thus, the output by this serializer should be usable whether sent
as "text/html" or "application/xhtml+html" (although there are a lot of
subtle issues to pay attention to when switching between the two, in
particular with respect to differences in the DOM and CSS).
For example, instead of rendering a script tag as ```` (which
confuses the HTML parser in many browsers), it will produce
````. Also, it will normalize any boolean attributes values
that are minimized in HTML, so that for example ``
``
becomes ``
``.
This serializer supports the use of namespaces for compound documents, for
example to use inline SVG inside an XHTML document.
``html``
The ``HTMLSerializer`` produces proper HTML markup. The main differences
compared to ``xhtml`` serialization are that boolean attributes are
minimized, empty tags are not self-closing (so it's ``
`` instead of
``
``), and that the contents of ``