Legacy Product

Fusion 5.10
    Fusion 5.10

    Gazetteer Lookup Extraction Index Stage

    The Gazetteer Lookup Extraction index stage (called the Gazetteer Lookup Extractor stage in versions earlier than 3.0) uses predefined lists of words and phrases to process specified text fields in a document. A gazetteer is a set of lookup lists over names of people, places, or things. These lookup lists are used to find occurrences of these names in text. The matched items are saved into separate fields on the document for downstream processing.

    Gazetteers and OpenNLP Tools

    The following video shows how to configure a Gazetteer Lookup Extraction stage in combination with OpenNLP:

    Uploading Lookup Lists to Fusion Blob Store

    Fusion includes a number of lookup lists in the directory https://FUSION_HOST:FUSION_PORT/data/nlp/gazetteer. To use the supplied lists or a list of your own data, each must list be uploaded to Fusion using the Blob Store API in order to make the list contents available to the Gazetteer Lookup Extraction stage.

    For example, to identify color names, you would first compile a list of color terms, one entry per line in a text file with suffix .lst and then upload that file using the Fusion REST API endpoint api/blobs/<listfilename>, as per the following example which uses the curl command-line utility, where 'admin' is the name of a user with admin privileges, and 'pass' is that user’s password:

    curl -u USERNAME:PASSWORD -X PUT --data-binary @data/nlp/gazetteer/colors.lst -H 'Content-type: text/plain' http://localhost:8764/api/blobs/colors.lst

    Name Lookup Example

    Define a lookup-extractor to identify mentions of certain celebrities in text field description_t:

    {
      "type" : "lookup-extractor",
      "id" : "peopleLookup",
      "rules" : [ {
        "source" : [ "description_t" ],
        "target" : "celebrities_ss",
        "entityTypes" : [ {
          "name" : "person_female",
          "definitions" : [ "person_female.lst" ]
        } ],
        "additionalEntities" : [ {
          "name" : "players",
          "definitions" : [ "sharapova", "murray" ]
        }, {
          "name" : "actors",
          "definitions" : [ "pitt", "jolie" ]
        } ],
        "caseSensitive" : false
      } ],
      "skip" : false
    }

    Configuration

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    This stage allows you to extract entities using pre-defined gazetteers

    skip - boolean

    Set to true to skip this stage.

    Default: false

    label - string

    A unique label for this stage.

    <= 255 characters

    condition - string

    Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

    rules - array[object]

    object attributes:{source required : {
     display name: Source Fields
     type: array
    }
    target required : {
     display name: Target Field
     type: string
    }
    writeMode : {
     display name: Write Mode
     type: string
    }
    entityTypes required : {
     display name: Entity Types
     type: array
    }
    additionalEntities : {
     display name: Additional Entities
     type: array
    }
    caseSensitive : {
     display name: Case Sensitive
     type: boolean
    }
    }