Statistically Interesting Phrases Jobs

Use this job when you want to identify phrases in your content.

In Fusion 4.1+, this job becomes the Phrase Extraction job.

Compute a mutual-information item similarity model

id - stringrequired

The ID for this Spark job. Used in the API to reference this job

<= 128 characters

Match pattern: ^[A-Za-z0-9_\-]+$

trainingCollection - stringrequired

Solr Collection containing labeled training data

>= 1 characters

fieldToVectorize - stringrequired

Solr field containing text training data for prediction/clustering instances,if want to analyze multiple fields with different weights please use the format field1:weight1,field2:weight2

>= 1 characters

dataFormat - string

Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)

Default: solr

Allowed values: solrhdfsfileparquet

trainingDataFrameConfigOptions - object

Additional spark dataframe loading configuration options

trainingDataFilterQuery - string

Solr query to use when loading training data

>= 3 characters

Default: *:*

trainingDataSamplingFraction - number

Fraction of the training data to use

<= 1

exclusiveMaximum: false

Default: 1

randomSeed - integer

For any deterministic pseudorandom number generation

Default: 1234

outputCollection - string

Solr Collection to store model-labeled data to

sourceFields - string

Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

ngramSize - integer

The number of words in the ngram you want to consider for the sips.

>= 2

<= 5

exclusiveMinimum: false

exclusiveMaximum: false

Default: 2

minmatch - integer

The number of times a phrase must exist to be considered.

>= 1

exclusiveMinimum: false

Default: 2

analyzerConfig - stringrequired

The style of text analyzer you would like to use.

Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "stop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

type - stringrequired

Default: sip

Allowed values: sip