This tool generates a a C API for SpecRegex with specific regular expressions. The configuration format is YAML. Each regular expression can be generated in different “modes” (for example, for searching, prefix matching, or replacement). For each regular expression (or where applicable, regex-replacement pair), each mode generates one of more functions to implement that mode if that mode is specified for the given regular expression. The set of equivalent functions for different regular expressions is called a “function kind”.
Example configuration:
name: feline_regexes
visible: true
regexes:
"caat":
regex: "c(a+)t"
modes: [ cpu_search, cpu_static_replace ]
replacements:
"cat": "cat"
"yay": "y$1y"
"feline":
regex: "cat|kitten|lion|tiger"
modes: [ cpu_static_replace ]
replacement: "meow"
This tool comes with a CMake wrapper to generate a library based on a configuration file. It is expected that the Regex2TU tool will be used via this wrapper.
Use include(regex2tu.cmake)
to include the CMake wrapper. Use add_specregex_capi_library(TARGET CONFIG)
to add a library called TARGET
with sources generated in the build tree and YAML configuration located in CONFIG
. The NOINSTALL
flag can be specified to prevent any files related to the library from being installed. Any other arguments get passed to add_library
.
The library header is installed to include/spec/regex/${name}.h
. Other language bindings are installed in share/SpecRegex/bindings
.
Note: The library sources get generated at configure time because it generates file names based on the configuration YAML. If the configuration file (or the tool’s Python script) is changed, CMake will rerun itself.
Example:
cmake_minimum_required(VERSION 3.16)
include(xcmake/scripts/Init.cmake)
project(FelineRegexes)
include(XCMake)
find_package(SpecRegex)
add_specregex_capi_library(feline_regexes "${CMAKE_CURRENT_SOURCE_DIR}/feline_regexes.yaml")
The top level of the configuration format is a YAML associative array. Its keys are as follows:
name
: The name to use for the library. This name is used in header names (${name}.h
), as well as prefixing the functions and types generated. The value of this field is written as ${name}
in code examples and specifications in this documentation. This key is mandatory.visible
: If true
, symbol is explicitly set to visible
. If false
, symbol visibility is explicitly set to hidden
. If not specified, no visibility attributes will be generated unless -s
is given to regex2tu.py
. Note that the add_specregex_capi_library
cmake
function passes -s
for shared libraries.lambdaReplacementBufferSize
: If set, the lambda replacement APIs will offer a buffer of this size to lambdas. If not specified, the buffer is nullptr
. The buffer is allocated on the stack, although it may in future be allocated as a thread_local
buffer if large. See the cpu_lambda_replace
subsection for more details.extraBindingLanguages
: A list of language bindings to generate in addition to C. Supported elements are: C#
.regexes
: Provides a specification for which modes to generate for which regular expression patterns and replacements. It is an associative array from a name by which a regular expression’s API is referred, to a specification for that regular expression’s API. A key in this associative array is written as ${regex}
in code examples and specifications in this documentation. The format of that specification is explained in the “Regular expression specification” section. This key is mandatory.Each regular expression is specified by an associative array. The associative array is itself a value in an associative array whose keys are names used to refer to each regular expression. The following keys are recognised:
regex
: The regular expression pattern in SpecRegex’s regular expression pattern language. This key is mandatory.modes
: A list of modes to generate APIs for this regular expression for. See the Modes subsection for a description of each permitted mode. This key is mandatory and must not be empty.replacement
: If only a single static replacement is ever used, this key may be used to specify it. The value of this key must be a string in SpecRegex’s replacement specification language. If this key is used, static replacement APIs are named using only ${regex}
. This key must not be used with replacements
.replacements
: An associative array from replacement name to replacement specification in SpecRegex’s replacement specification language. This allows multiple replacements to be used. A key in this associative array is written as ${replacement}
in code examples and specifications in this documentation. This key must not be used with replacement
.This subsection describes the functions that each mode generates.
The functions and types in this sub-subsection can be generated or used by multiple modes.
match_range
This type stores a single capture group. It is range of character positions in the input for a given (sub-)match. It has the format:
typedef struct ${name}_match_range {
int start;
int end;
} ${name}_match_range;
The fields have the following meaning:
start
: The position of the the first character of the (sub)-match.end
: The position after the last character of the (sub-)match.Some functions require a buffer to an array of this type that is specially aligned. Where this is the case for a given function argument, it is noted.
group_count
This function has the signature:
int ${name}_${regex}_group_count();
This function returns the number of capture groups (including group 0) a given regular expression has. This is useful for programmatically interacting with the C API (especially with the switch API) without having to manually code the number of capture groups to allocate in a buffer for each match the user requests from a function that returns matches. This function is always generated.
group_name
This function has the signature:
const char* ${name}_${regex}_group_name(int group_num);
This function gets the capture group name corresponding to a given numbered capture group.
nullptr
otherwise.group_num
: The number of the capture group to get the name of.cpu_is_complete_match
cpu_is_complete_match
This function has the signature:
int ${name}_${regex}_cpu_is_complete_match(const char* input, int input_length);
This function tests to see if the whole of a given input string can match the regular expression. It has the following arguments and return value:
1
if the whole input string matches the regular expression, and 0
otherwise.input
: A pointer to the input string.input_length
: The length of the input string.cpu_get_complete_match
cpu_get_complete_match
This function has the signature:
int ${name}_${regex}_cpu_get_complete_match(const char* input, int input_length, const ${name}_match_range* match);
This function tests to see if the whole of a given input string can match the regular expression. If it can, that match’s capture groups (including group 0) are extracted. It has the following arguments and return value:
1
if the whole input string matches the regular expression, and 0
otherwise.input
: A pointer to the input string.input_length
: The length of the input string.match
: A pointer to a buffer where the match is to be written. The gth
capture group of the match is written to match[g]
. The buffer must be at least ${name}_${regex}_group_count()
elements in size.cpu_has_prefix_match
cpu_has_prefix_match
This function has the signature:
int ${name}_${regex}_cpu_has_prefix_match(const char* input, int input_length);
This function tests to see if the start of a given input string matches the regular expression. It has the following arguments and return value:
1
if the the start of the input string matches the regular expression, and 0
otherwise.input
: A pointer to the input string.input_length
: The length of the input string.cpu_get_complete_match
cpu_get_prefix_match
This function has the signature:
int ${name}_${regex}_cpu_get_prefix_match(const char* input, int input_length, const ${name}_match_range* match);
This function tests to see if the the start of a given input string matches the regular expression. If it does, that match’s capture groups (including group 0) are extracted. It has the following arguments and return value:
1
if the the start of the input string matches the regular expression, and 0
otherwise.input
: A pointer to the input string.input_length
: The length of the input string.match
: A pointer to a buffer where the match is to be written. The gth
capture group of the match is written to match[g]
. The buffer must be at least ${name}_${regex}_group_count()
elements in size.cpu_has_match
cpu_has_match
This function has the signature:
int ${name}_${regex}_cpu_has_match(const char* input, int input_length);
This function tests to see if any substring of a given input string matches the regular expression. It has the following arguments and return value:
1
if the any substring of the input string matches the regular expression, and 0
otherwise.input
: A pointer to the input string.input_length
: The length of the input string.cpu_search
cpu_search
This function has the signature:
int ${name}_${regex}_cpu_search(const char* input,
int input_length,
const ${name}_match_range* matches,
int max_matches);
This function performs a search operation on the given input string using the given regular expression. It finds the first max_matches
non-overlapping matches (fewer if there are not that many matches) and returns them in the match_ranges
buffer. It has the following arguments and return value:
match_ranges
.input
: A pointer to the input string.input_length
: The length of the input string.matches
: A pointer to a buffer where the matches are to be written. The gth
group of the nth
match is stored at match_ranges[n * ${name}_${regex}_group_count() + g]
The buffer must be at least ${name}_${regex}_group_count() * max_matches
elements in size. It must be aligned to a 16 byte boundary.max_matches
The maximum number of matches to return.cpu_dynamic_literal_replace
These functions replace regex matches with a string literal given at runtime.
cpu_dlr
This function has the signature:
int ${name}_${regex}_cpu_dlr(const char* input,
int input_length,
const char* replacement,
int replacement_length,
char* output,
int output_length);
This function performs an out-of-place replacement of regex matches with a string literal given at runtime.
output_length
).input
: A pointer to the input string.input_length
: The length of the input string.replacement
: A pointer to the replacement string.replacement_length
: The length of the replacement string.output
: A pointer to the buffer in which the output string is to be written.output_length
: The length of the buffer pointed to by output
.cpu_inplace_dlr
This function has the signature:
int ${name}_${regex}_cpu_inplace_dlr (const char* replacement, int replacement_length, char* buffer, int input_length);
This function performs an in-place replacement of regex matches with a string literal given at runtime.
replacement
: A pointer to the replacement string.replacement_length
: The length of the replacement string. It is undefined behaviour for this to be longer than the shortest match that is to be replaced.buffer
: A pointer to the buffer in which the input string is, and where output string is to be written.input_length
: The length of the input string in buffer
.cpu_inplace_allocating_dlr
This function has the signature:
int ${name}_${regex}_cpu_inplace_allocating_dlr
(char** buffer, int* buffer_length, int input_length, const char* replacement,
int replacement_length, char* (*alloc)(int), void (*dealloc)(char*));
This function tries to perform an in-place replacement of regex matches with a string literal given at runtime, but allocates more memory if needed.
buffer
: A pointer to a pointer to the initial buffer. A pointer to the output string (which may be the same as the input string pointer) will be written here.buffer_length
: A pointer to the length of the initial buffer. The length of the final buffer will be written here.input_length
: The length of the input string (which must be no longer than the length of the buffer in which it is stored).replacement
: A pointer to the replacement string.replacement_length
: The length of the replacement string.alloc
: A pointer to a function that allocates memory. If null, then the C++ new char[]
will be used.dealloc
: A pointer to a function that deallocates memory. If null, then the C++ delete char[]
will be used.cpu_static_replace
These functions use exactly one of the replacement
or replacements
regular expression specification fields. One of these fields is required to use this mode. These functions (and the corresponding enum
entries) omit the _${replacement}
suffix if replacement
is used rather than replacements
.
cpu_replace
This function has the signature:
int ${name}_${regex}_cpu_replace_${replacement}
(const char* input, int input_length, char* output, int output_length);
This function performs an out-of-place replacement of matches of the regular expression with the replacement.
output_length
).input
: A pointer to the input string.input_length
: The length of the input string.output
: A pointer to the buffer in which the output string is to be written.output_length
: The length of the buffer pointed to by output
.cpu_inplace_replace
This function has the signature:
int ${name}_${regex}_cpu_inplace_replace_${replacement}
(char* buffer, int input_length);
This function performs an in-place replacement of matches of the regular expression with the replacement. It is undefined behaviour for a replacement to be longer than the match it replaces.
buffer
: A pointer to the buffer in which the input string is, and where output string is to be written.input_length
: The length of the input string in buffer
.cpu_lambda_replace
This mode generates the lambda-replacement API. The lambda replacement APIs allow take a callback function (called the lambda) which is called for each match in the input string. The lambda supplies a replacement string for the match based on the input string, and match.
cpu_inplace_expanding_lambda_replace
This function has the signature:
int ${name}_${regex}_cpu_inplace_expanding_lambda_replace
(char* buffer, int buffer_length, int input_length,
int (*lambda)(char** out, const char* input, const ${name}_match_range* match,
void *user),
void* user);
buffer
: A pointer to the buffer containing the input, and where the output will be written.buffer_length
: The length of the buffer.input_length
The length of the input string (which must be no longer than the length of the buffer in which it is stored).lambda
: A function that generates a replacement string.
out
: A pointer to a pointer to the replacement strong. It is initially a pointer to a pointer to a buffer (owned by cpu_inplace_expanding_lambda_replace
, of size lambdaReplacementBufferSize
) that lambda
may write to. If lambdaReplacementBufferSize
is not given, the buffer is initially nullptr
. Alternatively, lambda
may set this to a pointer to a different buffer managed by the user. For example, a string literal could be conditionally returned, which would save a copy.input
: A pointer to the input string containing the match. Since this is an in-place replace, only the region specified by match
is guaranteed to have the same contents as the corresponding match in the input string given to cpu_inplace_expanding_lambda_replace
.match
: An array of match ranges comprising the match currently being replaced. The number of elements of the array is the same as the number of groups in the regular expression (including group zero). Note that, although the match refers to the same contents as in the original string, those contents may have been moved, so this match gives its location within the input string passed to the lambda, which might not be its original location.user
: The user
argument given to cpu_inplace_expanding_lambda_replace
.user
: A pointer that is passed to the replacement generator with no further processing.cpu_inplace_allocating_lambda_replace
This function has the signature:
int ${name}_${regex}_cpu_inplace_allocating_lambda_replace
(char** buffer, int* buffer_length, int input_length,
int (*lambda)(char** out, const char* input, const ${name}_match_range* match,
void* user),
void* user, char* (*alloc)(int), void (*dealloc)(char*));
buffer
: A pointer to a pointer to the initial buffer. A pointer to the output string (which may be the same as the input string pointer) will be written here.buffer_length
: A pointer to the length of the initial buffer. The length of the final buffer will be written here.input_length
: The length of the input string (which must be no longer than the length of the buffer in which it is stored).lambda
: A function that generates a replacement string.
out
: A pointer to a pointer to the replacement strong. It is initially a pointer to a pointer to a buffer (owned by cpu_inplace_expanding_lambda_replace
, of size lambdaReplacementBufferSize
) that lambda
may write to. If lambdaReplacementBufferSize
is not given, the buffer is initially nullptr
. Alternatively, lambda
may set this to a pointer to a different buffer managed by the user. For example, a string literal could be conditionally returned, which would save a copy.input
: A pointer to the input string containing the match. Since this is an in-place replace, only the region specified by match
is guaranteed to be the same as the the input string given to cpu_inplace_allocating_lambda_replace
.match
: An array of match ranges comprising the match currently being replaced. The number of elements of the array is the same as the number of groups in the regular expression (including group zero).user
: The user
argument given to cpu_inplace_allocating_lambda_replace
.user
: A pointer that is passed to the replacement generator with no further processing.alloc
: A pointer to a function that allocates memory. If null, then the C++ new char[]
will be used.dealloc
: A pointer to a function that deallocates memory. If null, then the C++ delete char[]
will be used.In order to make it easier to select between regular expressions (and where applicable, replacements) at runtime, a function can be generated for each function kind that selects between regular expressions (and where applicable, replacements) based on an enum
parameter. The signature of the switch API functions is the same as that of the corresponding function kind, except that it has an extra first parameter of type ${name}_option
and does not have ${regex}
or ${replacement}
in the function name.
The enum
has an entry for each regular expression, and each regular expression-replacement pair for which any function kind has been generated. The keys of the enum are named ${name}_${regex}
or ${name}_${regex}_${replacement}
.