Cluster Labeling Jobs
Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)
Attach labels to document clusters.
id - stringrequired
The ID for this Spark job. Used in the API to reference this job
<= 128 characters
Match pattern: ^[A-Za-z0-9_\-]+$
trainingCollection - stringrequired
Solr Collection containing documents to be clustered
>= 1 characters
fieldToVectorize - stringrequired
Solr field containing text training data for prediction/clustering instances,if want to analyze multiple fields with different weights please use the format field1:weight1,field2:weight2
>= 1 characters
dataFormat - string
Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)
Default: solr
Allowed values: solrhdfsfileparquet
trainingDataFrameConfigOptions - object
Additional spark dataframe loading configuration options
trainingDataFilterQuery - string
Solr query to use when loading training data
>= 3 characters
Default: *:*
trainingDataSamplingFraction - number
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
randomSeed - integer
For any deterministic pseudorandom number generation
Default: 1234
outputCollection - stringrequired
Solr Collection to store output data to
>= 1 characters
sourceFields - string
Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.
modelId - string
Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.
>= 1 characters
clusterIdField - stringrequired
Input field name for cluster Id to help identify individual clusters.
>= 1 characters
analyzerConfig - string
LuceneTextAnalyzer schema for tokenization (JSON-encoded)
>= 1 characters
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "stop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
clusterLabelField - string
Output field name for top frequent terms that are (mostly) unique for each cluster.
Default: cluster_label
freqTermField - string
Output field name for top frequent terms in each cluster. These may overlap with other clusters.
Default: freq_terms
minDF - number
Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 5
maxDF - number
Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 0.75
norm - integer
p-norm to normalize vectors with (choose -1 to turn normalization off)
Default: 2
Allowed values: -1012
numKeywordsPerLabel - integer
Number of Keywords needed for labeling each cluster.
Default: 5
type - stringrequired
Default: cluster_labeling
Allowed values: cluster_labeling