Cluster Labeling Jobs
Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)
Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)
id - stringrequired
The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)
<= 128 characters
Match pattern: ^[A-Za-z0-9_\-]+$
trainingCollection - stringrequired
Solr Collection containing documents with defined categories or clusters
>= 1 characters
fieldToVectorize - stringrequired
Field containing data from which to discover keywords for the cluster
>= 1 characters
dataFormat - string
Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)
Default: solr
Allowed values: solrhdfsfileparquet
trainingDataFrameConfigOptions - object
Additional spark dataframe loading configuration options
trainingDataFilterQuery - string
Solr query to use when loading training data
>= 3 characters
Default: *:*
trainingDataSamplingFraction - number
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
randomSeed - integer
For any deterministic pseudorandom number generation
Default: 1234
outputCollection - stringrequired
Solr Collection to store output data to
>= 1 characters
sourceFields - string
Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.
modelId - string
Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.
>= 1 characters
clusterIdField - stringrequired
Field that contains your existing cluster IDs or document categories.
>= 1 characters
analyzerConfig - string
LuceneTextAnalyzer schema for tokenization (JSON-encoded)
>= 1 characters
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
clusterLabelField - string
Output field name for top frequent terms that are (mostly) unique for each cluster.
Default: cluster_label
freqTermField - string
Output field name for top frequent terms in each cluster. These may overlap with other clusters.
Default: freq_terms
minDF - number
Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 5
maxDF - number
Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 0.75
norm - integer
p-norm to normalize vectors with (choose -1 to turn normalization off)
Default: 2
Allowed values: -1012
numKeywordsPerLabel - integer
Number of Keywords needed for labeling each cluster.
Default: 5
type - stringrequired
Default: cluster_labeling
Allowed values: cluster_labeling