Support Centre

Model Training & Maintenance

Guides on how to create, improve and maintain Models in Re:infer, using platform features such as Discover, Explore and Validation

Building custom regex entities

Please note that this feature is only available to certain users with a technical background. Please contact support to discuss access.




A Custom Regex Entity can be used to extract and format spans of text that have a known repetitive structure, such as IDs or reference numbers.


Custom Regex Template


A Custom Regex Entity is made up of one or more Custom Regex Templates. Each template expresses one way to extract (and format) the entity. Combined together, these templates offer a flexible and powerful way to cover multiple representations of the same entity type.

 

A template is made of two parts:

  • The regex (regular expression), which describes the constraints that need to be met by a span of text to be extracted as an entity
  • The formatting, which expresses how to normalise the extracted string into a more standard format

 

For instance, if your customer IDs can be either the word “ID” followed by 7 digits, or an alphanumeric string of 9 characters, here is what your two templates will look like:

 


Type-ahead validation


TBC



Regex preview


TBC



Regex


The regex is the pattern used to extract entities in the text. See here for the syntax documentation.


Named capture groups can be used to identify a specific section of the extracted string for subsequent formatting. The names of the capture groups should be unique across all templates, and should only contain lowercase letters or digits.

 


Formatting


Formatting can be provided to post-process the extracted entity.


By default, no formatting is applied and the string returned by Re:infer will be the string extracted by the regex. However, if needed, more complex transformations can be defined, using the following rules.

  

Variables


Any named capture group defined in the regex will be available to use in the formatting logic as a variable, prefixed with the $ symbol. Note that the $ symbol by itself represents the full regex match.


Variables can then be used in the formatting string to insert the corresponding extracted span into the value returned by Re:infer; the variable name needs to be surrounded by { and } braces.


For instance, if we want to extract seven digits as an ID, and return these seven digits prefixed with ID- then the regex and the formatting would be:

 

Or, using a named capture group:

 


Later on, if Re:infer is given the text: My identification number is 1234567, it will return one entity: ID-1234567.

 

String Operations


Raw strings can be used, and strings can be concatenated using the & symbol.


Regex(?P<id1>\b\d{3}\b)|(?P<id2>\b\d{4}\b)
Formatting{$id1 & "-" & $id2}
TextThe first id is 123 and the second one is 4567
Entity returned by Re:infer123-4567


Functions


Some functions can also be used in the formatting to transform the extracted string. The names of the functions and their signatures are inspired by Excel.


Upper

Converts all characters in the extracted span to uppercase:

Regex\w{3}
Formatting{upper($)}
Textabc
Entity returned by Re:inferABC


Lower

Converts all characters in the extracted span to lowercase:

 

Regex\w{3}
Formatting{lower($)}
TextAbC
Entity returned by Re:inferabc


Proper

Capitalises the extracted span:

 

Regex\w+\s\w+
Formatting{proper($)}
Textalbert EINSTEIN
Entity returned by Re:inferAlbert Einstein


Pad

Pads the extracted span up to a given size with a given character.

Function arguments:

  1. The text containing the characters to be padded
  2. Size of the padded string
  3. Character to be used for padding

 

Regex\d{2,5}
Formatting{pad($, 5, "0")}
Text123
Entity returned by Re:infer00123


Substitute

Replaces characters with other characters.

Function arguments:

  1. The text containing the characters to be substituted
  2. What characters to replace
  3. What the old characters should be replaced with

 

Regexab
Formatting{substitute($, "a", "12")}
Textab
Entity returned by Re:infer12b


Left

Returns the first n characters from the span.

Function arguments:

  1. The text containing the characters to be extracted
  2. The number of characters to return

 

Regex\w{4}
Formatting{left($, 2)}
TextABCD
Entity returned by Re:inferAB


Returns the last n characters from the span.

Function arguments:

  1. The text containing the characters to be extracted
  2. The number of characters to return

 

Regex\w{4}
Formatting{right($, 2)}
TextABCD
Entity returned by Re:inferCD


Mid

Returns n characters after the specified position from the span.

Function arguments:

  1. The text containing the characters to be extracted
  2. The position of the first character to return
  3. The number of characters to return

 

Regex\w{5}
Formatting{mid($, 2, 3)}
TextABCDE
Entity returned by Re:inferBCD

 


Previous: Improving entity performance


Did you find it helpful? Yes No

Send feedback
Sorry we couldn't be helpful. Help us improve this article with your feedback.

Sections

View all