Legacy Product

Fusion 5.10
    Fusion 5.10

    Statistically Interesting Phrases Jobs

    Use this job when you want to identify phrases in your content.

    In Fusion 4.1+, this job becomes the Phrase Extraction job.

    Compute a mutual-information item similarity model

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job

    <= 128 characters

    Match pattern: ^[A-Za-z0-9_\-]+$

    trainingCollection - stringrequired

    Solr Collection containing labeled training data

    >= 1 characters

    fieldToVectorize - stringrequired

    Solr field containing text training data for prediction/clustering instances,if want to analyze multiple fields with different weights please use the format field1:weight1,field2:weight2

    >= 1 characters

    dataFormat - string

    Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)

    Default: solr

    Allowed values: solrhdfsfileparquet

    trainingDataFrameConfigOptions - object

    Additional spark dataframe loading configuration options

    trainingDataFilterQuery - string

    Solr query to use when loading training data

    >= 3 characters

    Default: *:*

    trainingDataSamplingFraction - number

    Fraction of the training data to use

    <= 1

    exclusiveMaximum: false

    Default: 1

    randomSeed - integer

    For any deterministic pseudorandom number generation

    Default: 1234

    outputCollection - string

    Solr Collection to store model-labeled data to

    sourceFields - string

    Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

    ngramSize - integer

    The number of words in the ngram you want to consider for the sips.

    >= 2

    <= 5

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 2

    minmatch - integer

    The number of times a phrase must exist to be considered.

    >= 1

    exclusiveMinimum: false

    Default: 2

    analyzerConfig - stringrequired

    The style of text analyzer you would like to use.

    Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "stop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

    type - stringrequired

    Default: sip

    Allowed values: sip