Legacy Product

Fusion 5.10
    Fusion 5.10

    NLP Annotator Index Stage

    Table of Contents

    The NLP Annotator index stage performs Natural Language Processing tasks.

    This stage is deprecated as of Fusion 5.2.0. in favor of SpaCy and Seldon Core functionality. It is best practice to use the Machine Learning Index Stage instead.

    You can choose from different NLP libraries, either OpenNLP or the JohnSnow Lab library, which runs on Spark.

    Only the pre-trained NER model is supported. If choosing an NER model, download NerDLModel instead of NerCRFModel.

    The NLP Annotator supports the following tasks:

    • If choosing JohnSnow Lab (recommended for large dataset processing):

      • NER (Name Entity Recognition)

        Fusion uses the deep learning pre-trained NER model that JohnSnowLab provides. Currently, the pre-trained extraction model covers the following name entities:

        • ORG (organization)

        • PER (person)

        • LOC (location)

        This means that there are the only three types of entities Fusion will recognize from the source field.

      • Sentence detection

      • POS(Part of Speech) Tagging

    • If choosing OpenNLP:

      • NER

      • Sentence detection

      • POS Tagging

      • Shallow Parsing (Chunking)

    Example of how to use NLP Annotator Index stage:
    1. Add NLP Annotator index stage.

      add nlp stage

    2. Choose the annotator type (OpenNLP or SparkNLP).

      which model

      If you select the sparknlp model, you need to download and install one or more models: .. Download the models at https://github.com/JohnSnowLabs/spark-nlp#models. .. Rename the downloaded models to something easy to identify, then upload them to Fusion’s blob store.

      + add_blob

    3. Configure the index pipeline stage:

      1. Specify the model to use (fill the box with model id in the blob store).

        model_id_blob

        fill_index_stage

      2. Specify the source, label pattern, and target (destination) fields:

        • source field: the raw text with name entities to be extracted.

        • label pattern: regex pattern that matches the NER/POS labels: for example, PER. will match extracted name entities with label PERSON, while NN. will match tagged nouns.

        • target field: the outcome extraction/tagging and so on.

        source_target

        result

    Configuration

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    Annotate text using NLP.

    skip - boolean

    Set to true to skip this stage.

    Default: false

    label - string

    A unique label for this stage.

    <= 255 characters

    condition - string

    Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

    modelId - stringrequired

    Model ID

    failOnError - boolean

    Flag to indicate if this stage should throw an exception if an error occurs while generating a prediction for a document.

    Default: false

    source - array[string]required

    Input fields to annotate

    extractorRules - array[object]

    Define rules to extract annotated text into separate fields

    object attributes:{sourceFieldName required : {
     display name: Source Field Name
     type: string
    }
    extractedAnnotationType required : {
     display name: Annotation Type to Extract
     type: string
    }
    labelPattern required : {
     display name: Label Pattern
     type: string
    }
    targetFieldName required : {
     display name: Target Field Name
     type: string
    }
    }