Regex Field Extraction Index Stage
The Regex Field Extraction stage (called the Regular Expression Extractor stage in versions earlier than 3.0) is used to extract entities from documents based on matching regular expressions. The resulting regex matches over the contents of the source field are copied to the target field. The regular expression, source, and target fields are defined properties of this stage.
If using the REST API, this stage type is named "regex-extractor".
For examples of how to use this stage in the Fusion UI, see Part 2 of the Getting Started tutorial.
Example Stage Specification
Define a regex-field-extraction stage to apply a regular expression that looks for storage capabilities of products when it appears in the product 'name' field, and store it in a special field:
{
"type" : "regex-field-extraction",
"id" : "storagesize-regex-extraction",
"rules" : [ {
"source" : [ "name" ],
"target" : "storage_size_ss",
"pattern" : "(\\d{1,20}\\s{0,3}(GB|MB|TB|KB|mb|gb|tb|kb))",
"annotateAs" : "storage_size"
} ],
"skip" : false
}
Configuration
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.
|