Legacy Product

Fusion 5.10
    Fusion 5.10

    Detect Sentences Index Stage

    The Detect Sentences index stage (called the Sentence Detection stage in versions earlier than 3.0) operates over one of more fields in the Pipeline Document and annotates each field with sentence boundary information. These annotations can be used by downstream indexing stages. A Detect Sentences stage can be used in tandem with a Tag Part-of-Speech Index Stage to provide part-of-speech annotations for the individual tokens in the field.

    This stage uses Apache OpenNLP project’s Sentence Detection tool. The OpenNLP documentation states:


    The OpenNLP Sentence Detector can detect that a punctuation character marks the end of a sentence or not. In this sense a sentence is defined as the longest white space trimmed character sequence between two punctuation marks. The first and last sentence make an exception to this rule. The first non whitespace character is assumed to be the begin of a sentence, and the last non whitespace character is assumed to be a sentence end. _

    Fusion comes with a set of OpenNLP language models for english. These data files are found in the directory: https://FUSION_HOST:FUSION_PORT/data/nlp/models. The included sentence model is en-sent.bin.

    Models are available from the OpenNLP models SourceForge repository.

    Model files must be uploaded to Fusion using the Fusion Blob Store service via the REST API (see examples below).

    Sentence Detection in a NLP Pipeline

    The following video shows how to use a Sentence Detection index stage as part of an NLP pipeline:

    Stage Setup

    This is an example of how to upload a sentence model file to the Fusion blob:

    INPUT

    curl -u USERNAME:PASSWORD -X PUT --data-binary @en-pos-maxent.bin -H 'Content-type: text/plain' http://localhost:8764/api/blobs/en-pos-maxent.bin

    OUTPUT

    {
      "name" : "en-sent.bin",
      "contentType" : "text/plain",
      "size" : 5696197,
      "modifiedTime" : "2015-07-15T06:57:48.636Z",
      "version" : 0,
      "md5" : "db2cd70395b9e2e4c6b9957015a10607"
    }

    This is an example setup of this stage using the previously loaded .bin file:

    INPUT

    curl -u USERNAME:PASSWORD -X POST -H 'Content-type: application/json' -d '{"id":"DetectSentences1", "type": "detect-sentences","sentenceModel":"en-sent.bin","source": ["A test sentence"]}' http://localhost:8764/api/index-stages/instances

    OUTPUT

    {
      "type" : "detect-sentences",
      "id" : "DetectSentences1",
      "sentenceModel" : "en-sent.bin",
      "source" : [ "A test sentence" ],
      "skip" : false,
      "label" : "detect-sentences",
      "type" : "detect-sentences"
    }

    Configuration

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    Tag sentences with part-of-speech information. Requires Sentence Detection on the same fields earlier in the pipeline

    skip - boolean

    Set to true to skip this stage.

    Default: false

    label - string

    A unique label for this stage.

    <= 255 characters

    condition - string

    Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

    tokenizerModel - stringrequired

    posModel - stringrequired

    source - array[string]required