RootZanli & NinjaCr3

: / proc / thread-self / root / usr / share / doc / re2c / manual / submatch /
Filename : submatch.rst_
back
Re2c has two options for submatch extraction.

The first option is ``-T --tags``. With this option one can use standalone tags
of the form ``@stag`` and ``#mtag``, where ``stag`` and ``mtag`` are arbitrary
used-defined names. Tags can be used anywhere inside of a regular expression;
semantically they are just position markers. Tags of the form ``@stag`` are
called s-tags: they denote a single submatch value (the last input position
where this tag matched). Tags of the form ``#mtag`` are called m-tags: they
denote multiple submatch values (the whole history of repetitions of this tag).
All tags should be defined by the user as variables with the corresponding
names. With standalone tags re2c uses leftmost greedy disambiguation: submatch
positions correspond to the leftmost matching path through the regular
expression.

The second option is ``-P --posix-captures``: it enables POSIX-compliant
capturing groups. In this mode parentheses in regular expressions denote the
beginning and the end of capturing groups; the whole regular expression is group
number zero. The number of groups for the matching rule is stored in a variable
``yynmatch``, and submatch results are stored in ``yypmatch`` array. Both
``yynmatch`` and ``yypmatch`` should be defined by the user, and ``yypmatch``
size must be at least ``[yynmatch * 2]``. Re2c provides a directive
``/*!maxnmatch:re2c*/`` that defines ``YYMAXNMATCH``: a constant  equal to the
maximal value of ``yynmatch`` among all rules. Note that re2c implements
POSIX-compliant disambiguation: each subexpression matches as long as possible,
and subexpressions that start earlier in regular expression have priority over
those starting later. Capturing groups are translated into s-tags under the
hood, therefore we use the word "tag" to describe them as well.

With both ``-P --posix-captures`` and ``T --tags`` options re2c uses efficient
submatch extraction algorithm described in the
`Tagged Deterministic Finite Automata with Lookahead <https://arxiv.org/abs/1907.08837>`_
paper. The overhead on submatch extraction in the generated lexer grows with the
number of tags --- if this number is moderate, the overhead is barely
noticeable. In the lexer tags are implemented using a number of tag variables
generated by re2c. There is no one-to-one correspondence between tag variables
and tags: a single variable may be reused for different tags, and one tag may
require multiple variables to hold all its ambiguous values. Eventually
ambiguity is resolved, and only one final variable per tag survives. When a rule
matches, all its tags are set to the values of the corresponding tag variables.
The exact number of tag variables is unknown to the user; this number is
determined by re2c. However, tag variables should be defined by the user as a
part of the lexer state and updated by ``YYFILL``, therefore re2c provides
directives ``/*!stags:re2c*/`` and ``/*!mtags:re2c*/`` that can be used to
declare, initialize and manipulate tag variables. These directives have two
optional configurations: ``format = "@@";`` (specifies the template where ``@@``
is substituted with the name of each tag variable), and ``separator = "";``
(specifies the piece of code used to join the generated pieces for different
tag variables).

S-tags support the following operations:

* save input position to an s-tag: ``t = YYCURSOR`` with default API or a
  user-defined operation ``YYSTAGP(t)`` with generic API
* save default value to an s-tag: ``t = NULL`` with default API or a
  user-defined operation ``YYSTAGN(t)`` with generic API
* copy one s-tag to another: ``t1 = t2``

M-tags support the following operations:

* append input position to an m-tag: a user-defined operation ``YYMTAGP(t)``
  with both default and generic API
* append default value to an m-tag: a user-defined operation ``YYMTAGN(t)``
  with both default and generic API
* copy one m-tag to another: ``t1 = t2``

S-tags can be implemented as scalar values (pointers or offsets). M-tags need a
more complex representation, as they need to store a sequence of tag values. The
most naive and inefficient representation of an m-tag is a list (array, vector)
of tag values; a more efficient representation is to store all m-tags in a
prefix-tree represented as array of nodes ``(v, p)``, where ``v`` is tag value
and ``p`` is a pointer to parent node.