Root Zanli
Home
Console
Upload
information
Create File
Create Folder
About
Tools
:
/
usr
/
share
/
doc
/
re2c
/
manual
/
encodings
/
Filename :
encodings.rst_
back
Copy
Speaking of encodings, it is necessary to understand the difference between code points and code units. Code point is an abstract symbol. Code unit is the smallest atomic unit of storage in the encoded text. A single code point may be represented with one or more code units. In a fixed-length encoding all code points are represented with the same number of code units. In a variable-length encoding code points may be represented with a different number of code units. Note that the "any" rule ``[^]`` matches any code point, but not necessarily any code unit. The only way to match any code unit regardless of the encoding it the default rule ``*``. ``YYCTYPE`` size should be equal to the size of code unit. Re2c supports the following encodings: ASCII, EBCDIC, UCS2, UTF8, UTF16 and UTF32. * ASCII is enabled by default. It is a fixed-length encoding with code space [0-255] and 1-byte code points and code units. * EBCDIC is enabled with ``-e, --ecb`` option. It a fixed-length encoding with code space [0-255] and 1-byte code points and code units. * UCS2 is enabled with ``-w, --wide-chars`` option. It is a fixed-length encoding with code space [0-0xFFFF] and 2-byte code points and code units. * UTF8 is enabled with ``-8, --utf-8`` option. It is a variable-length Unicode encoding with code space [0-0x10FFFF]. Code points are represented with one, two, three or four 1-byte code units. * UTF16 is enabled with ``-x, --utf-16`` option. It is a variable-length Unicode encoding with code space [0-0x10FFFF]. Code points are represented with one or two 2-byte code units. * UTF32 is enabled with ``-u, --unicode`` option. It is a fixed-length Unicode encoding with code space [0-0x10FFFF] and 4-byte code points and code units. Encodings can also be set or unset using ``re2c:flags`` configuration, for example ``re2c:flags:8 = 1;`` enables UTF8. Include file ``include/unicode_categories.re`` provides re2c definitions for the standard Unicode categories. Option ``--input-encoding utf8`` enables Unicode literals in regular expressions. Option ``--encoding-policy <fail | substitute | ignore>`` specifies the way re2c handles Unicode surrogates: code points in the range [0xD800-0xDFFF].