Statistically Interesting Phrases Jobs
Use this job when you want to identify phrases in your content.
In Fusion 4.1+, this job becomes the Phrase Extraction job. |
Legacy Product
Use this job when you want to identify phrases in your content.
In Fusion 4.1+, this job becomes the Phrase Extraction job. |
Compute a mutual-information item similarity model
The ID for this Spark job. Used in the API to reference this job
<= 128 characters
Match pattern: ^[A-Za-z0-9_\-]+$
Solr Collection containing labeled training data
>= 1 characters
Solr field containing text training data for prediction/clustering instances,if want to analyze multiple fields with different weights please use the format field1:weight1,field2:weight2
>= 1 characters
Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)
Default: solr
Allowed values: solrhdfsfileparquet
Additional spark dataframe loading configuration options
Solr query to use when loading training data
>= 3 characters
Default: *:*
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
For any deterministic pseudorandom number generation
Default: 1234
Solr Collection to store model-labeled data to
Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.
The number of words in the ngram you want to consider for the sips.
>= 2
<= 5
exclusiveMinimum: false
exclusiveMaximum: false
Default: 2
The number of times a phrase must exist to be considered.
>= 1
exclusiveMinimum: false
Default: 2
The style of text analyzer you would like to use.
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "stop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
Default: sip
Allowed values: sip