=head1 NAME
perluniintro - Perl Unicode introduction
=head1 DESCRIPTION
This document gives a general idea of Unicode and how to use Unicode
in Perl.
=head2 Unicode
Unicode is a character set standard which plans to codify all of the
writing systems of the world, plus many other symbols.
Unicode and ISO/IEC 10646 are coordinated standards that provide code
points for characters in almost all modern character set standards,
covering more than 30 writing systems and hundreds of languages,
including all commercially-important modern languages. All characters
in the largest Chinese, Japanese, and Korean dictionaries are also
encoded. The standards will eventually cover almost all characters in
more than 250 writing systems and thousands of languages.
Unicode 1.0 was released in October 1991, and 4.0 in April 2003.
A Unicode I is an abstract entity. It is not bound to any
particular integer width, especially not to the C language C.
Unicode is language-neutral and display-neutral: it does not encode the
language of the text and it does not generally define fonts or other graphical
layout details. Unicode operates on characters and on text built from
those characters.
Unicode defines characters like C or C and unique numbers for the characters, in this
case 0x0041 and 0x03B1, respectively. These unique numbers are called
I.
The Unicode standard prefers using hexadecimal notation for the code
points. If numbers like C<0x0041> are unfamiliar to you, take a peek
at a later section, L"Hexadecimal Notation">. The Unicode standard
uses the notation C, to give the
hexadecimal code point and the normative name of the character.
Unicode also defines various I for the characters, like
"uppercase" or "lowercase", "decimal digit", or "punctuation";
these properties are independent of the names of the characters.
Furthermore, various operations on the characters like uppercasing,
lowercasing, and collating (sorting) are defined.
A Unicode character consists either of a single code point, or a
I (like C), followed by one or
more I (like C). This sequence of
base character and modifiers is called a I.
Whether to call these combining character sequences "characters"
depends on your point of view. If you are a programmer, you probably
would tend towards seeing each element in the sequences as one unit,
or "character". The whole sequence could be seen as one "character",
however, from the user's point of view, since that's probably what it
looks like in the context of the user's language.
With this "whole sequence" view of characters, the total number of
characters is open-ended. But in the programmer's "one unit is one
character" point of view, the concept of "characters" is more
deterministic. In this document, we take that second point of view:
one "character" is one Unicode code point, be it a base character or
a combining character.
For some combinations, there are I characters.
C, for example, is defined as
a single code point. These precomposed characters are, however,
only available for some combinations, and are mainly
meant to support round-trip conversions between Unicode and legacy
standards (like the ISO 8859). In the general case, the composing
method is more extensible. To support conversion between
different compositions of the characters, various I to standardize representations are also defined.
Because of backward compatibility with legacy encodings, the "a unique
number for every character" idea breaks down a bit: instead, there is
"at least one number for every character". The same character could
be represented differently in several legacy encodings. The
converse is also not true: some code points do not have an assigned
character. Firstly, there are unallocated code points within
otherwise used blocks. Secondly, there are special Unicode control
characters that do not represent true characters.
A common myth about Unicode is that it would be "16-bit", that is,
Unicode is only represented as C<0x10000> (or 65536) characters from
C<0x0000> to C<0xFFFF>. B Since Unicode 2.0 (July
1996), Unicode has been defined all the way up to 21 bits (C<0x10FFFF>),
and since Unicode 3.1 (March 2001), characters have been defined
beyond C<0xFFFF>. The first C<0x10000> characters are called the
I, or the I (BMP). With Unicode
3.1, 17 (yes, seventeen) planes in all were defined--but they are
nowhere near full of defined characters, yet.
Another myth is that the 256-character blocks have something to
do with languages--that each block would define the characters used
by a language or a set of languages. B
The division into blocks exists, but it is almost completely
accidental--an artifact of how the characters have been and
still are allocated. Instead, there is a concept called I,
which is more useful: there is C script, C script, and
so on. Scripts usually span varied parts of several blocks.
For further information see L.
The Unicode code points are just abstract numbers. To input and
output these abstract numbers, the numbers must be I or
I somehow. Unicode defines several I, of which I is perhaps the most popular. UTF-8 is a
variable length encoding that encodes Unicode characters as 1 to 6
bytes (only 4 with the currently defined characters). Other encodings
include UTF-16 and UTF-32 and their big- and little-endian variants
(UTF-8 is byte-order independent) The ISO/IEC 10646 defines the UCS-2
and UCS-4 encoding forms.
For more information about encodings--for instance, to learn what
I and I (BOMs) are--see L.
=head2 Perl's Unicode Support
Starting from Perl 5.6.0, Perl has had the capacity to handle Unicode
natively. Perl 5.8.0, however, is the first recommended release for
serious Unicode work. The maintenance release 5.6.1 fixed many of the
problems of the initial Unicode implementation, but for example
regular expressions still do not work with Unicode in 5.6.1.
B is needed only in much more restricted circumstances.> In earlier releases the C pragma was used to declare
that operations in the current block or file would be Unicode-aware.
This model was found to be wrong, or at least clumsy: the "Unicodeness"
is now carried with the data, instead of being attached to the
operations. Only one case remains where an explicit C