Regex Guide for Masking with Regex-based Custom Entity
Regular expressions (regex) are patterns used to find and manipulate specific text segments.
In the Redaction service, regex allows you to define which parts of the text should be detected and masked — such as numbers, words, symbols, or specific formats (e.g., emails or IDs).
This guide summarizes the supported regex syntax used in the Masking feature and provides examples showing how each pattern behaves while used in masking.
Use this reference to design precise, valid patterns that match your masking needs without producing syntax errors or unintended results.
Partial token/string masking is not applicable with regex-based custom entity. Even if your regex defines a part of a token/string, the entire token/string including the value you specified will be masked.
| Example Input | Example Expression | Matching | Masking |
|---|---|---|---|
| educated | ^ed | "educated" | "educated" |
Allowed Characters
1. Letters and Numbers
A–Z, a–z, 0–9, Ç, Ö, Ü, ß, é, ı, أ, Ж, あ, 你, etc.
2. Common Symbols (Literal)
These are safe to use directly as normal characters — they have no special regex meaning unless used inside certain contexts like []:
- _ @ ! # % & : , ; ' " ` / < = > ~ space tab
(Inside a character class [ ... ], the hyphen - defines a range.
If you mean a literal dash, place it at the start or end: [-abc] or [abc-].)
3. Regex Metacharacters (Special Meaning)
These control how matching works:
. ^ $ * + ? ( ) [ ] { } | \
Basics of Regular Expression Structure
Anchors & Character Classes
| Character Classes | Name | Description | Example Expression | Example Match (In Green) |
|---|---|---|---|---|
| . | Dot (wildcard) | Any character (except newline) | edu.ated | "educated", "edu_ated”, "edu4ated” |
| ^ | Beginning | Matches the beginning of the string. (This matches a position, not a character). | ^ed | "educated" |
| $ | End | Matches the end of the string. (This matches a position, not a character). | ed$ | "educated" |
| \b | Word boundary | Matches a word boundary position between a word character and non-word character or position (start / end of string) | ed\b | "educated and qualified" |
| \B | Not word boundary | Matches any position that is not a word boundary. This matches a position, not a character. | ed\B | "educated and qualified" |
| \d | Digit | Matches any digit character (0-9). Equivalent to [0-9] | \d | "file_25” |
| \D | Not digit | Matches any character that is not a digit character (0-9). Equivalent to [^0-9] | \D | "file_25” |
| \w | Word | Matches any word character (alphanumeric & underscore). Only matches low-ascii characters (no accented or non-roman characters). Equivalent to [A-Za-z0-9_] | \w | "25 $" |
| \W | Not word | Matches any character that is not a word character (alphanumeric & underscore). Equivalent to [^A-Za-z0-9_] | \W | "25 $" |
| \s | Whitespace | White space: space, tab, newline | a\sb\s | "a b c" |
| \S | Not whitespace | Matches any character that is not a whitespace character (spaces, tabs, line breaks) | \S | "anything-without-whitespace" |
Quantifiers & Alternation
| Character | Name | Description | Example Expression | Example Match (In Green) |
|---|---|---|---|---|
| + | Plus | Matches 1 or more of the preceding token | go+ | "g", "go", "gooo", "gooo..." |
| * | Star | Matches 0 or more of the preceding token | go* | "g", "go", "gooo", "gooo..." |
| ? | Lazy | Makes the preceding quantifier lazy, causing it to match as few characters as possible. By default, quantifiers are greedy, and will match as many characters as possible | colou?r | "color", "colour" |
| {} | Quantifier bounds | Matches the specified quantity of the previous token. For example: {1,3} will match 1 to 3. {3} will match exactly 3. {3,} will match 3 or more. |
b\w{2,3} | “b be bee beer beers” |
| \ | Escape for special meaning or literal reference | Matches the results of a capture group. For example: \1 matches the results of the first capture group. \. matches a literal dot. |
(\w)a\1 \. |
"hah dad bad" "abc...abcd...." |
| | | Alternation (OR) | Acts like a boolean OR. Matches the expression before or after the |. It can operate within a group, or on a whole expression. The patterns will be tested in order. |
b(a|e|i)d | "bad", "bud", "bod", "bed", "bid" |
Ranges & Groups & Lookaround
| Character | Name | Description | Example Expression | Example Match (In Green) |
|---|---|---|---|---|
| [] | Character list or range | Match any character in the set OR Matches a character having a character code between the two specified characters inclusive. | [abc] [A-Z] |
"a" or "b" or "c" "A" to "C" |
| () | Capturing group | Groups multiple tokens together and creates a capture group for extracting a substring or using a backreference. | (ha)+ | "ha", "hahahah", "haa", "hah!" |
| (?= ... ) | Positive lookahead | Matches a group after the main expression without including it in the result. | \d(?=px) | "1pt 2px 3em 4px" |
| (?! ... ) | Negative lookahead | Specifies a group that can not match after the main expression (if it matches, the result is discarded). | \d(?!px) | "1pt1 2px 3em 4px" |
⚠️Characters to Avoid (Not Allowed)
It will not be allowed to create an entity with an invalid regex.
| Character Type | Why to Avoid |
|---|---|
| Unescaped double quote " | Causes syntax error |
| Incomplete or Dangling escapes like \ or \x or \u | Invalid syntax |
| Wrong quantifiers like a{,3} or a{5,2} | Invalid bounds |
| Unpaired surrogate code points like [\uD800-\uDFFF], \uD800 | Invalid unicode |
Regex Common Examples
| Type of Input | Pattern | Example Match |
|---|---|---|
| Date format | ^(0?[1-9]|1[0-2])[\/](0?[1-9]|[12]\d|3[01])[\/](19|20)\d{2}$ | 10/2/2019 01/02/2019 |
| USD currency | ^($)(\d)+ | $10 $100000 |
| IPv4 address | \b(?:(?:2(?:[0-4][0-9]|5[0-5])|[0-1]?[0-9]?[0-9]).){3}(?:(?:2([0-4][0-9]|5[0-5])|[0-1]?[0-9]?[0-9]))\b | 192.168.0.1 255.255.255.255 |
| Phone number | ^\s*(?:+?(\d{1,3}))?([-. (](\d{3})[-. )])?((\d{3})[-. ](\d{2,4})(?:[-.x ](\d+))?)\s*$ | 0400123000 +61 0400000000 |
| Aphanumeric Starting with the letter "C" Exact length: 7 characters |
^C\d{7}$ | C0000001 |
| Alphanumeric with spaces allowed. Max length: 20 characters |
^[\w\d\s]{1,20}$ | 12345678912345678912 abcdefghijklmnopqrst |
| Aphanumeric Starting with 2 letters, underline, 3 numeric | ^[A-z]{2}_[0-9]{3}$ | DK_003 |