# -*- coding: utf-8 -*- <%inherit file="content_layout.html"/> <%page args="toc, extension, paged"/> <%namespace name="formatting" file="formatting.html"/> <%namespace name="nav" file="nav.html"/> <%def name="title()">Mako Documentation - The Unicode Chapter <%! filename = 'unicode' %> ## This file is generated. Edit the .txt files instead of this one. <%call expr="formatting.section(path='unicode',paged=paged,extension=extension,toc=toc)">

The Python language supports two ways of representing what we know as "strings", i.e. series of characters. In Python 2, the two types are <%text filter='h'>string and <%text filter='h'>unicode, and in Python 3 they are <%text filter='h'>bytes and <%text filter='h'>string. A key aspect of the Python 2 <%text filter='h'>string and Python 3 <%text filter='h'>bytes types are that they contain no information regarding what encoding the data is stored in. For this reason they were commonly referred to as byte strings on Python 2, and Python 3 makes this name more explicit. The origins of this come from Python's background of being developed before the Unicode standard was even available, back when strings were C-style strings and were just that, a series of bytes. Strings that had only values below 128 just happened to be ascii strings and were printable on the console, whereas strings with values above 128 would produce all kinds of graphical characters and bells.

Contrast the "bytestring" types with the "unicode/string" type. Objects of this type are created whenever you say something like <%text filter='h'>u"hello world" (or in Python 3, just <%text filter='h'>"hello world"). In this case, Python represents each character in the string internally using multiple bytes per character (something similar to UTF-16). Whats important is that when using the <%text filter='h'>unicode/<%text filter='h'>string type to store strings, Python knows the data's encoding; its in its own internal format. Whereas when using the <%text filter='h'>string/<%text filter='h'>bytes type, it does not.

When Python 2 attempts to treat a byte-string as a string, which means its attempting to compare/parse its characters, to coerce it into another encoding, or to decode it to a unicode object, it has to guess what the encoding is. In this case, it will pretty much always guess the encoding as <%text filter='h'>ascii...and if the bytestring contains bytes above value 128, you'll get an error. Python 3 eliminates much of this confusion by just raising an error unconditionally if a bytestring is used in a character-aware context.

There is one operation that Python can do with a non-ascii bytestring, and its a great source of confusion: it can dump the bytestring straight out to a stream or a file, with nary a care what the encoding is. To Python, this is pretty much like dumping any other kind of binary data (like an image) to a stream somewhere. In Python 2, it is common to see programs that embed all kinds of international characters and encodings into plain byte-strings (i.e. using <%text filter='h'>"hello world" style literals) can fly right through their run, sending reams of strings out to whereever they are going, and the programmer, seeing the same output as was expressed in the input, is now under the illusion that his or her program is Unicode-compliant. In fact, their program has no unicode awareness whatsoever, and similarly has no ability to interact with libraries that are unicode aware. Python 3 makes this much less likely by defaulting to unicode as the storage format for strings.

The "pass through encoded data" scheme is what template languages like Cheetah and earlier versions of Myghty do by default. Mako as of version 0.2 also supports this mode of operation when using Python 2, using the "disable_unicode=True" flag. However, when using Mako in its default mode of unicode-aware, it requires explicitness when dealing with non-ascii encodings. Additionally, if you ever need to handle unicode strings and other kinds of encoding conversions more intelligently, the usage of raw bytestrings quickly becomes a nightmare, since you are sending the Python interpreter collections of bytes for which it can make no intelligent decisions with regards to encoding. In Python 3 Mako only allows usage of native, unicode strings.

In normal Mako operation, all parsed template constructs and output streams are handled internally as Python <%text filter='h'>unicode objects. Its only at the point of <%text filter='h'>render() that this unicode stream may be rendered into whatever the desired output encoding is. The implication here is that the template developer must ensure that the encoding of all non-ascii templates is explicit (still required in Python 3), that all non-ascii-encoded expressions are in one way or another converted to unicode (not much of a burden in Python 3), and that the output stream of the template is handled as a unicode stream being encoded to some encoding (still required in Python 3).

<%call expr="formatting.section(path='unicode_specifying',paged=paged,extension=extension,toc=toc)">

This is the most basic encoding-related setting, and it is equivalent to Python's "magic encoding comment", as described in pep-0263. Any template that contains non-ascii characters requires that this comment be present so that Mako can decode to unicode (and also make usage of Python's AST parsing services). Mako's lexer will use this encoding in order to convert the template source into a <%text filter='h'>unicode object before continuing its parsing:

<%call expr="formatting.code()"><%text>## -*- coding: utf-8 -*- Alors vous imaginez ma surprise, au lever du jour, quand une drôle de petit voix m’a réveillé. Elle disait: « S’il vous plaît… dessine-moi un mouton! »

For the picky, the regular expression used is derived from that of the abovementioned pep:

<%call expr="formatting.code(syntaxtype='python')"><%text> #.*coding[:=]\s*([-\w.]+).*\n

The lexer will convert to unicode in all cases, so that if any characters exist in the template that are outside of the specified encoding (or the default of <%text filter='h'>ascii), the error will be immediate.

As an alternative, the template encoding can be specified programmatically to either <%text filter='h'>Template or <%text filter='h'>TemplateLookup via the <%text filter='h'>input_encoding parameter:

<%call expr="formatting.code(syntaxtype='python')"><%text> t = TemplateLookup(directories=['./'], input_encoding='utf-8')

The above will assume all located templates specify <%text filter='h'>utf-8 encoding, unless the template itself contains its own magic encoding comment, which takes precedence.

<%call expr="formatting.section(path='unicode_handling',paged=paged,extension=extension,toc=toc)">

The next area that encoding comes into play is in expression constructs. By default, Mako's treatment of an expression like this:

<%call expr="formatting.code()"><%text>${"hello world"}

looks something like this:

<%call expr="formatting.code(syntaxtype='python')"><%text> context.write(unicode("hello world"))

In Python 3, its just:

<%call expr="formatting.code(syntaxtype='python')"><%text> context.write(str("hello world"))

That is, the output of all expressions is run through the <%text filter='h'>unicode builtin. This is the default setting, and can be modified to expect various encodings. The <%text filter='h'>unicode step serves both the purpose of rendering non-string expressions into strings (such as integers or objects which contain <%text filter='h'>__str()__ methods), and to ensure that the final output stream is constructed as a unicode object. The main implication of this is that any raw bytestrings that contain an encoding other than ascii must first be decoded to a Python unicode object. It means you can't say this in Python 2:

<%call expr="formatting.code()"><%text>${"voix m’a réveillé."} ## error in Python 2!

You must instead say this:

<%call expr="formatting.code()"><%text>${u"voix m’a réveillé."} ## OK !

Similarly, if you are reading data from a file that is streaming bytes, or returning data from some object that is returning a Python bytestring containing a non-ascii encoding, you have to explcitly decode to unicode first, such as:

<%call expr="formatting.code()"><%text>${call_my_object().decode('utf-8')}

Note that filehandles acquired by <%text filter='h'>open() in Python 3 default to returning "text", that is the decoding is done for you. See Python 3's documentation for the <%text filter='h'>open() builtin for details on this.

If you want a certain encoding applied to all expressions, override the <%text filter='h'>unicode builtin with the <%text filter='h'>decode builtin at the <%text filter='h'>Template or <%text filter='h'>TemplateLookup level:

<%call expr="formatting.code(syntaxtype='python')"><%text> t = Template(templatetext, default_filters=['decode.utf8'])

Note that the built-in <%text filter='h'>decode object is slower than the <%text filter='h'>unicode function, since unlike <%text filter='h'>unicode its not a Python builtin, and it also checks the type of the incoming data to determine if string conversion is needed first.

The <%text filter='h'>default_filters argument can be used to entirely customize the filtering process of expressions. This argument is described in <%call expr="nav.toclink(path='filtering_expression_defaultfilters',paged=paged,extension=extension,toc=toc)">.

<%call expr="formatting.section(path='unicode_defining',paged=paged,extension=extension,toc=toc)">

Now that we have a template which produces a pure unicode output stream, all the hard work is done. We can take the output and do anything with it.

As stated in the "Usage" chapter, both <%text filter='h'>Template and <%text filter='h'>TemplateLookup accept <%text filter='h'>output_encoding and <%text filter='h'>encoding_errors parameters which can be used to encode the output in any Python supported codec:

<%call expr="formatting.code(syntaxtype='python')"><%text> from mako.template import Template from mako.lookup import TemplateLookup mylookup = TemplateLookup(directories=['/docs'], output_encoding='utf-8', encoding_errors='replace') mytemplate = mylookup.get_template("foo.txt") print mytemplate.render()

<%text filter='h'>render() will return a <%text filter='h'>bytes object in Python 3 if an output encoding is specified. By default it performs no encoding and returns a native string.

<%text filter='h'>render_unicode() will return the template output as a Python <%text filter='h'>unicode object (or <%text filter='h'>string in Python 3):

<%call expr="formatting.code(syntaxtype='python')"><%text> print mytemplate.render_unicode()

The above method disgards the output encoding keyword argument; you can encode yourself by saying:

<%call expr="formatting.code(syntaxtype='python')"><%text> print mytemplate.render_unicode().encode('utf-8', 'replace') <%call expr="formatting.section(path='unicode_defining_buffer',paged=paged,extension=extension,toc=toc)">

Mako does play some games with the style of buffering used internally, to maximize performance. Since the buffer is by far the most heavily used object in a render operation, its important!

When calling <%text filter='h'>render() on a template that does not specify any output encoding (i.e. its <%text filter='h'>ascii), Python's <%text filter='h'>cStringIO module, which cannot handle encoding of non-ascii <%text filter='h'>unicode objects (even though it can send raw bytestrings through), is used for buffering. Otherwise, a custom Mako class called <%text filter='h'>FastEncodingBuffer is used, which essentially is a super dumbed-down version of <%text filter='h'>StringIO that gathers all strings into a list and uses <%text filter='h'>u''.join(elements) to produce the final output - its markedly faster than <%text filter='h'>StringIO.

<%call expr="formatting.section(path='unicode_saying',paged=paged,extension=extension,toc=toc)">

Some segements of Mako's userbase choose to make no usage of Unicode whatsoever, and instead would prefer the "passthru" approach; all string expressions in their templates return encoded bytestrings, and they would like these strings to pass right through. The only advantage to this approach is that templates need not use <%text filter='h'>u"" for literal strings; there's an arguable speed improvement as well since raw bytestrings generally perform slightly faster than unicode objects in Python. For these users, assuming they're sticking with Python 2, they can hit the <%text filter='h'>disable_unicode=True flag as so:

<%call expr="formatting.code(syntaxtype=u'python')"><%text> # -*- encoding:utf-8 -*- from mako.template import Template t = Template("drôle de petit voix m’a réveillé.", disable_unicode=True, input_encoding='utf-8') print t.code

The <%text filter='h'>disable_unicode mode is strictly a Python 2 thing. It is not supported at all in Python 3.

The generated module source code will contain elements like these:

<%call expr="formatting.code(syntaxtype='python')"><%text> # -*- encoding:utf-8 -*- # ...more generated code ... def render_body(context,**pageargs): context.caller_stack.push_frame() try: __M_locals = dict(pageargs=pageargs) # SOURCE LINE 1 context.write('dr\xc3\xb4le de petit voix m\xe2\x80\x99a r\xc3\xa9veill\xc3\xa9.') return '' finally: context.caller_stack.pop_frame()

Where above that the string literal used within <%text filter='h'>context.write is a regular bytestring.

When <%text filter='h'>disable_unicode=True is turned on, the <%text filter='h'>default_filters argument which normally defaults to <%text filter='h'>["unicode"] now defaults to <%text filter='h'>["str"] instead. Setting default_filters to the empty list <%text filter='h'>[] can remove the overhead of the <%text filter='h'>str call. Also, in this mode you cannot safely call <%text filter='h'>render_unicode() - you'll get unicode/decode errors.

The <%text filter='h'>h filter (html escape) uses a less performant pure Python escape function in non-unicode mode (note that in versions prior to 0.3.4, it used cgi.escape(), which has been replaced with a function that also escapes single quotes). This because MarkupSafe only supports Python unicode objects for non-ascii strings.

Rules for using disable_unicode=True