Legacy Product

Fusion 5.10
    Fusion 5.10

    OpenNLP NER Extraction Index Stage

    Named Entity Recognition (NER) is the task of finding the names of persons, organizations, locations, and/or things in a passage of free text. The OpenNLP NER Extraction index stage (previously called the OpenNLP NER Extractor stage) uses a set of rules to find named entities in a field in the Pipeline Document (the "source") and populates a new field (the "target") with these entities.

    This stage uses Apache OpenNLP project’s Named Entity Recognition tool (the Name Finder tool). The OpenNLP documentation states:

    The Name Finder tool can detect named entities and numbers in text. To be able to detect entities the Name Finder needs a model. The model is dependent on the language and entity type it was trained for. The OpenNLP projects offers a number of pre-trained name finder models which are trained on various freely available corpora. They can be downloaded at our model download page. To find names in raw text the text must be segmented into tokens and sentences.

    Fusion 4.x.x contains a common set of NER models for English that include sentence, token, and part-of-speech models. These models are:

    Model Purpose

    nlp/models/en-sent.bin

    Sentence model to detect sentences

    nlp/models/en-token.bin

    Tokenizer model for tokenization of sentences

    nlp/models/en-ner-date.bin

    Date name finder model

    nlp/models/en-ner-location.bin

    Location name finder model

    nlp/models/en-ner-money.bin

    Money name finder model

    nlp/models/en-ner-organization.bin

    Organization name finder model

    nlp/models/en-ner-percentage.bin

    Percentage name finder model

    nlp/models/en-ner-person.bin

    Person name finder model

    nlp/models/en-ner-time.bin

    Time name finder model

    See OpenNLP 1.5 series for additional pre-trained OpenNLP models.

    To use these models, upload to Fusion using the Fusion Blob Store service. Here is an example of how to upload the sentence model file using the curl command-line utility, where "admin" is the name of a user with admin privileges, and "pass" is the password:

    curl -u USERNAME:PASSWORD -X PUT --data-binary @data/nlp/models/en-sent.bin -H 'Content-type: application/octet-stream' http://localhost:8764/api/blobs/en-sent.bin

    Example Specification

    Specification of a stage which extracts names of people and places from field named 'body':

    {
       "type":"nlp-extractor",
       "id":"iqtr",
       "rules":[
          {
             "source":[
                "body_t"
             ],
             "target":"organizations",
             "writeMode":"append",
             "sentenceModelLocation":"nlp/models/en-sent.bin",
             "tokenizerModelLocation":"nlp/models/en-token.bin",
             "entityTypes":[
                {
                   "name":"organization",
                   "definition":"nlp/models/en-ner-organization.bin"
                }
             ]
          },
          {
             "source":[
                "body_t"
             ],
             "target":"persons",
             "writeMode":"append",
             "sentenceModelLocation":"nlp/models/en-sent.bin",
             "tokenizerModelLocation":"nlp/models/en-token.bin",
             "entityTypes":[
                {
                   "name":"person",
                   "definition":"nlp/models/en-ner-person.bin"
                }
             ]
          }
       ],
       "type":"nlp-extractor",
       "skip":false,
       "label":"Extract Entities",
       "licensed":true,
       "secretSourceStageId":"iqtr"
    }

    Configuration

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    This stage allows you to extract named entities using natural language processing models

    skip - boolean

    Set to true to skip this stage.

    Default: false

    label - string

    A unique label for this stage.

    <= 255 characters

    condition - string

    Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

    rules - array[object]required

    object attributes:{source required : {
     display name: Source Fields
     type: array
    }
    target required : {
     display name: Target Field
     type: string
    }
    writeMode : {
     display name: Write Mode
     type: string
    }
    sentenceModelLocation required : {
     display name: Sentence Model
     type: string
    }
    tokenizerModelLocation required : {
     display name: Tokenizer Model
     type: string
    }
    entityTypes : {
     display name: Entity Types
     type: array
    }
    }