, C and C.
Perl recognizes the following POSIX character classes:
alpha Any alphabetical character.
alnum Any alphanumerical character.
ascii Any ASCII character.
blank A GNU extension, equal to a space or a horizontal tab ("\t").
cntrl Any control character.
digit Any digit, equivalent to "\d".
graph Any printable character, excluding a space.
lower Any lowercase character.
print Any printable character, including a space.
punct Any punctuation character.
space Any white space character. "\s" plus the vertical tab ("\cK").
upper Any uppercase character.
word Any "word" character, equivalent to "\w".
xdigit Any hexadecimal digit, '0' - '9', 'a' - 'f', 'A' - 'F'.
The exact set of characters matched depends on whether the source string
is internally in UTF-8 format or not. See L.
Most POSIX character classes have C<\p> counterparts. The difference
is that the C<\p> classes will always match according to the Unicode
properties, regardless whether the string is in UTF-8 format or not.
The following table shows the relation between POSIX character classes
and the Unicode properties:
[[:...:]] \p{...} backslash
alpha IsAlpha
alnum IsAlnum
ascii IsASCII
blank
cntrl IsCntrl
digit IsDigit \d
graph IsGraph
lower IsLower
print IsPrint
punct IsPunct
space IsSpace
IsSpacePerl \s
upper IsUpper
word IsWord
xdigit IsXDigit
Some character classes may have a non-obvious name:
=over 4
=item cntrl
Any control character. Usually, control characters don't produce output
as such, but instead control the terminal somehow: for example newline
and backspace are control characters. All characters with C less
than 32 are usually classified as control characters (in ASCII, the ISO
Latin character sets, and Unicode), as is the character C value
of 127 (C).
=item graph
Any character that is I, that is, visible. This class consists
of all the alphanumerical characters and all punctuation characters.
=item print
All printable characters, which is the set of all the graphical characters
plus the space.
=item punct
Any punctuation (special) character.
=back
=head4 Negation
A Perl extension to the POSIX character class is the ability to
negate it. This is done by prefixing the class name with a caret (C<^>).
Some examples:
POSIX Unicode Backslash
[[:^digit:]] \P{IsDigit} \D
[[:^space:]] \P{IsSpace} \S
[[:^word:]] \P{IsWord} \W
=head4 [= =] and [. .]
Perl will recognize the POSIX character classes C<[=class=]>, and
C<[.class.]>, but does not (yet?) support this construct. Use of
such a construct will lead to an error.
=head4 Examples
/[[:digit:]]/ # Matches a character that is a digit.
/[01[:lower:]]/ # Matches a character that is either a
# lowercase letter, or '0' or '1'.
/[[:digit:][:^xdigit:]]/ # Matches a character that can be anything,
# but the letters 'a' to 'f' in either case.
# This is because the character class contains
# all digits, and anything that isn't a
# hex digit, resulting in a class containing
# all characters, but the letters 'a' to 'f'
# and 'A' to 'F'.
=head2 Locale, Unicode and UTF-8
Some of the character classes have a somewhat different behaviour depending
on the internal encoding of the source string, and the locale that is
in effect.
C<\w>, C<\d>, C<\s> and the POSIX character classes (and their negations,
including C<\W>, C<\D>, C<\S>) suffer from this behaviour.
The rule is that if the source string is in UTF-8 format, the character
classes match according to the Unicode properties. If the source string
isn't, then the character classes match according to whatever locale is
in effect. If there is no locale, they match the ASCII defaults
(52 letters, 10 digits and underscore for C<\w>, 0 to 9 for C<\d>, etc).
This usually means that if you are matching against characters whose C
values are between 128 and 255 inclusive, your character class may match
or not depending on the current locale, and whether the source string is
in UTF-8 format. The string will be in UTF-8 format if it contains
characters whose C value exceeds 255. But a string may be in UTF-8
format without it having such characters.
For portability reasons, it may be better to not use C<\w>, C<\d>, C<\s>
or the POSIX character classes, and use the Unicode properties instead.
=head4 Examples
$str = "\xDF"; # $str is not in UTF-8 format.
$str =~ /^\w/; # No match, as $str isn't in UTF-8 format.
$str .= "\x{0e0b}"; # Now $str is in UTF-8 format.
$str =~ /^\w/; # Match! $str is now in UTF-8 format.
chop $str;
$str =~ /^\w/; # Still a match! $str remains in UTF-8 format.
=cut