Legacy Product

Fusion 5.10
    Fusion 5.10

    Tag Part-of-Speech Index Stage

    The Tag Part-of-Speech Index stage (previously called the Part of Speech stage) operates over one of more fields in the Pipeline Document. It marks sentences with part of speech information as annotations which can be used by downstream indexing stages. Therefore this stage requires a Detect Sentences stage defined over these fields earlier in the pipeline.

    This stage uses Apache OpenNLP project’s Part of Speech Tagger to mark tokens with their corresponding word type based on the token itself and the context of the token. The OpenNLP documentation states:

    "A token might have multiple pos tags depending on the token and the context. The OpenNLP POS Tagger uses a probability model to predict the correct pos tag out of the tag set. To limit the possible tags for a token a tag dictionary can be used which increases the tagging and runtime performance of the tagger."

    Fusion comes with a set of OpenNLP language models for english. These data files are found in the directory: https://FUSION_HOST:FUSION_PORT/data/nlp/models.

    Models are available from the OpenNLP models SourceForge repository. Model files must be uploaded to Fusion using the Fusion Blob Store service via the REST API.

    Part-of-speech Tagging in a NLP Pipeline

    The following video shows how to use a Part-of-speech indexing stage as part of an NLP pipeline:

    Stage Setup

    Here is an example of how to upload a part-of-speech model file to the Fusion blob store:

    INPUT

    curl -u USERNAME:PASSWORD -X PUT --data-binary @en-pos-maxent.bin -H 'Content-type: text/plain' http://localhost:8764/api/blobs/en-pos-maxent.bin

    OUTPUT

    {
      "name" : "en-pos-maxent.bin",
      "contentType" : "text/plain",
      "size" : 5696197,
      "modifiedTime" : "2015-07-15T06:57:48.636Z",
      "version" : 0,
      "md5" : "db2cd70395b9e2e4c6b9957015a10607"
    }

    This is an example setup of this stage using the previously loaded .bin file:

    INPUT

    curl -u USERNAME:PASSWORD -X POST -H 'Content-type: application/json' -d '{"id":"TagPartofSpeech1", "type": "tag-part-of-speech","tokenizerModel":"en-pos-maxent.bin","posModel":"en-pos-perceptron.bin","source": ["sample","text","for","NLP"]}' http://localhost:8764/api/index-stages/instances

    OUTPUT

    {
      "type" : "tag-part-of-speech",
      "id" : "TagPartofSpeech1",
      "posModel" : "en-pos-perceptron.bin",
      "tokenizerModel" : "en-sent.bin",
      "source" : [ "sample", "text", "for", "NLP" ],
      "skip" : false,
      "label" : "tag-part-of-speech",
      "type" : "tag-part-of-speech"
    }

    Configuration

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    Tag sentences with part-of-speech information. Requires Sentence Detection on the same fields earlier in the pipeline

    skip - boolean

    Set to true to skip this stage.

    Default: false

    label - string

    A unique label for this stage.

    <= 255 characters

    condition - string

    Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

    tokenizerModel - stringrequired

    posModel - stringrequired

    source - array[string]required