Other Language Bindings

This tool generates a a C API for SpecRegex with specific regular expressions. The configuration format is YAML. Each regular expression can be generated in different “modes” (for example, for searching, prefix matching, or replacement). For each regular expression (or where applicable, regex-replacement pair), each mode generates one of more functions to implement that mode if that mode is specified for the given regular expression. The set of equivalent functions for different regular expressions is called a “function kind”.

Example configuration:

name: feline_regexes
visible: true
regexes:
  "caat":
    regex: "c(a+)t"
    modes: [ cpu_search, cpu_static_replace ]
    replacements:
      "cat": "cat"
      "yay": "y$1y"
  "feline":
    regex: "cat|kitten|lion|tiger"
    modes: [ cpu_static_replace ]
    replacement: "meow"

CMake

This tool comes with a CMake wrapper to generate a library based on a configuration file. It is expected that the Regex2TU tool will be used via this wrapper.

Use include(regex2tu.cmake) to include the CMake wrapper. Use add_specregex_capi_library(TARGET CONFIG) to add a library called TARGET with sources generated in the build tree and YAML configuration located in CONFIG. The NOINSTALL flag can be specified to prevent any files related to the library from being installed. Any other arguments get passed to add_library.

The library header is installed to include/spec/regex/${name}.h. Other language bindings are installed in share/SpecRegex/bindings.

Note: The library sources get generated at configure time because it generates file names based on the configuration YAML. If the configuration file (or the tool’s Python script) is changed, CMake will rerun itself.

Example:

cmake_minimum_required(VERSION 3.16)
include(xcmake/scripts/Init.cmake)
project(FelineRegexes)
include(XCMake)
find_package(SpecRegex)
add_specregex_capi_library(feline_regexes "${CMAKE_CURRENT_SOURCE_DIR}/feline_regexes.yaml")

Top level YAML keys

The top level of the configuration format is a YAML associative array. Its keys are as follows:

Regular expression specification

Each regular expression is specified by an associative array. The associative array is itself a value in an associative array whose keys are names used to refer to each regular expression. The following keys are recognised:

Modes

This subsection describes the functions that each mode generates.

Common functions and types

The functions and types in this sub-subsection can be generated or used by multiple modes.

Type match_range

This type stores a single capture group. It is range of character positions in the input for a given (sub-)match. It has the format:

typedef struct ${name}_match_range {
    int start;
    int end;
} ${name}_match_range;

The fields have the following meaning:

Some functions require a buffer to an array of this type that is specially aligned. Where this is the case for a given function argument, it is noted.

Function group_count

This function has the signature:

int ${name}_${regex}_group_count();

This function returns the number of capture groups (including group 0) a given regular expression has. This is useful for programmatically interacting with the C API (especially with the switch API) without having to manually code the number of capture groups to allocate in a buffer for each match the user requests from a function that returns matches. This function is always generated.

Function group_name

This function has the signature:

const char* ${name}_${regex}_group_name(int group_num);

This function gets the capture group name corresponding to a given numbered capture group.

Mode cpu_is_complete_match

Function cpu_is_complete_match

This function has the signature:

int ${name}_${regex}_cpu_is_complete_match(const char* input, int input_length);

This function tests to see if the whole of a given input string can match the regular expression. It has the following arguments and return value:

Mode cpu_get_complete_match

Function cpu_get_complete_match

This function has the signature:

int ${name}_${regex}_cpu_get_complete_match(const char* input, int input_length, const ${name}_match_range* match);

This function tests to see if the whole of a given input string can match the regular expression. If it can, that match’s capture groups (including group 0) are extracted. It has the following arguments and return value:

Mode cpu_has_prefix_match

Function cpu_has_prefix_match

This function has the signature:

int ${name}_${regex}_cpu_has_prefix_match(const char* input, int input_length);

This function tests to see if the start of a given input string matches the regular expression. It has the following arguments and return value:

Mode cpu_get_complete_match

Function cpu_get_prefix_match

This function has the signature:

int ${name}_${regex}_cpu_get_prefix_match(const char* input, int input_length, const ${name}_match_range* match);

This function tests to see if the the start of a given input string matches the regular expression. If it does, that match’s capture groups (including group 0) are extracted. It has the following arguments and return value:

Mode cpu_has_match

Function cpu_has_match

This function has the signature:

int ${name}_${regex}_cpu_has_match(const char* input, int input_length);

This function tests to see if any substring of a given input string matches the regular expression. It has the following arguments and return value:

This function has the signature:

int ${name}_${regex}_cpu_search(const char* input,
                                int input_length,
                                const ${name}_match_range* matches,
                                int max_matches);

This function performs a search operation on the given input string using the given regular expression. It finds the first max_matches non-overlapping matches (fewer if there are not that many matches) and returns them in the match_ranges buffer. It has the following arguments and return value:

Mode cpu_dynamic_literal_replace

These functions replace regex matches with a string literal given at runtime.

Function cpu_dlr

This function has the signature:

int ${name}_${regex}_cpu_dlr(const char* input,
                             int input_length,
                             const char* replacement,
                             int replacement_length,
                             char* output,
                             int output_length);

This function performs an out-of-place replacement of regex matches with a string literal given at runtime.

Function cpu_inplace_dlr

This function has the signature:

int ${name}_${regex}_cpu_inplace_dlr (const char* replacement, int replacement_length, char* buffer, int input_length);

This function performs an in-place replacement of regex matches with a string literal given at runtime.

Function cpu_inplace_allocating_dlr

This function has the signature:

int ${name}_${regex}_cpu_inplace_allocating_dlr
(char** buffer, int* buffer_length, int input_length, const char* replacement,
 int replacement_length, char* (*alloc)(int), void (*dealloc)(char*));

This function tries to perform an in-place replacement of regex matches with a string literal given at runtime, but allocates more memory if needed.

Mode cpu_static_replace

These functions use exactly one of the replacement or replacements regular expression specification fields. One of these fields is required to use this mode. These functions (and the corresponding enum entries) omit the _${replacement} suffix if replacement is used rather than replacements.

Function cpu_replace

This function has the signature:

int ${name}_${regex}_cpu_replace_${replacement}
(const char* input, int input_length, char* output, int output_length);

This function performs an out-of-place replacement of matches of the regular expression with the replacement.

Function cpu_inplace_replace

This function has the signature:

int ${name}_${regex}_cpu_inplace_replace_${replacement}
(char* buffer, int input_length);

This function performs an in-place replacement of matches of the regular expression with the replacement. It is undefined behaviour for a replacement to be longer than the match it replaces.

Mode cpu_lambda_replace

This mode generates the lambda-replacement API. The lambda replacement APIs allow take a callback function (called the lambda) which is called for each match in the input string. The lambda supplies a replacement string for the match based on the input string, and match.

Function cpu_inplace_expanding_lambda_replace

This function has the signature:

int ${name}_${regex}_cpu_inplace_expanding_lambda_replace
(char* buffer, int buffer_length, int input_length,
 int (*lambda)(char** out, const char* input, const ${name}_match_range* match,
               void *user),
 void* user);
Function cpu_inplace_allocating_lambda_replace

This function has the signature:

int ${name}_${regex}_cpu_inplace_allocating_lambda_replace
(char** buffer, int* buffer_length, int input_length,
 int (*lambda)(char** out, const char* input, const ${name}_match_range* match,
               void* user),
 void* user, char* (*alloc)(int), void (*dealloc)(char*));

Switch API

In order to make it easier to select between regular expressions (and where applicable, replacements) at runtime, a function can be generated for each function kind that selects between regular expressions (and where applicable, replacements) based on an enum parameter. The signature of the switch API functions is the same as that of the corresponding function kind, except that it has an extra first parameter of type ${name}_option and does not have ${regex} or ${replacement} in the function name.

The enum has an entry for each regular expression, and each regular expression-replacement pair for which any function kind has been generated. The keys of the enum are named ${name}_${regex} or ${name}_${regex}_${replacement}.