Root Zanli
Home
Console
Upload
information
Create File
Create Folder
About
Tools
:
/
proc
/
thread-self
/
root
/
usr
/
share
/
doc
/
re2c
/
manual
/
submatch
/
Filename :
submatch.rst_
back
Copy
Re2c has two options for submatch extraction. The first option is ``-T --tags``. With this option one can use standalone tags of the form ``@stag`` and ``#mtag``, where ``stag`` and ``mtag`` are arbitrary used-defined names. Tags can be used anywhere inside of a regular expression; semantically they are just position markers. Tags of the form ``@stag`` are called s-tags: they denote a single submatch value (the last input position where this tag matched). Tags of the form ``#mtag`` are called m-tags: they denote multiple submatch values (the whole history of repetitions of this tag). All tags should be defined by the user as variables with the corresponding names. With standalone tags re2c uses leftmost greedy disambiguation: submatch positions correspond to the leftmost matching path through the regular expression. The second option is ``-P --posix-captures``: it enables POSIX-compliant capturing groups. In this mode parentheses in regular expressions denote the beginning and the end of capturing groups; the whole regular expression is group number zero. The number of groups for the matching rule is stored in a variable ``yynmatch``, and submatch results are stored in ``yypmatch`` array. Both ``yynmatch`` and ``yypmatch`` should be defined by the user, and ``yypmatch`` size must be at least ``[yynmatch * 2]``. Re2c provides a directive ``/*!maxnmatch:re2c*/`` that defines ``YYMAXNMATCH``: a constant equal to the maximal value of ``yynmatch`` among all rules. Note that re2c implements POSIX-compliant disambiguation: each subexpression matches as long as possible, and subexpressions that start earlier in regular expression have priority over those starting later. Capturing groups are translated into s-tags under the hood, therefore we use the word "tag" to describe them as well. With both ``-P --posix-captures`` and ``T --tags`` options re2c uses efficient submatch extraction algorithm described in the `Tagged Deterministic Finite Automata with Lookahead <https://arxiv.org/abs/1907.08837>`_ paper. The overhead on submatch extraction in the generated lexer grows with the number of tags --- if this number is moderate, the overhead is barely noticeable. In the lexer tags are implemented using a number of tag variables generated by re2c. There is no one-to-one correspondence between tag variables and tags: a single variable may be reused for different tags, and one tag may require multiple variables to hold all its ambiguous values. Eventually ambiguity is resolved, and only one final variable per tag survives. When a rule matches, all its tags are set to the values of the corresponding tag variables. The exact number of tag variables is unknown to the user; this number is determined by re2c. However, tag variables should be defined by the user as a part of the lexer state and updated by ``YYFILL``, therefore re2c provides directives ``/*!stags:re2c*/`` and ``/*!mtags:re2c*/`` that can be used to declare, initialize and manipulate tag variables. These directives have two optional configurations: ``format = "@@";`` (specifies the template where ``@@`` is substituted with the name of each tag variable), and ``separator = "";`` (specifies the piece of code used to join the generated pieces for different tag variables). S-tags support the following operations: * save input position to an s-tag: ``t = YYCURSOR`` with default API or a user-defined operation ``YYSTAGP(t)`` with generic API * save default value to an s-tag: ``t = NULL`` with default API or a user-defined operation ``YYSTAGN(t)`` with generic API * copy one s-tag to another: ``t1 = t2`` M-tags support the following operations: * append input position to an m-tag: a user-defined operation ``YYMTAGP(t)`` with both default and generic API * append default value to an m-tag: a user-defined operation ``YYMTAGN(t)`` with both default and generic API * copy one m-tag to another: ``t1 = t2`` S-tags can be implemented as scalar values (pointers or offsets). M-tags need a more complex representation, as they need to store a sequence of tag values. The most naive and inefficient representation of an m-tag is a list (array, vector) of tag values; a more efficient representation is to store all m-tags in a prefix-tree represented as array of nodes ``(v, p)``, where ``v`` is tag value and ``p`` is a pointer to parent node.