Cluster Labeling Jobs
Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)
Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)
analyzerConfig - string
LuceneTextAnalyzer schema for tokenization (JSON-encoded)
>= 1 characters
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
clusterIdField - stringrequired
Field that contains your existing cluster IDs or document categories.
>= 1 characters
clusterLabelField - string
Output field name for top frequent terms that are (mostly) unique for each cluster.
Default: cluster_label
dataFormat - string
Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)
Default: solr
Allowed values: solrhdfsfileparquet
fieldToVectorize - stringrequired
Field containing data from which to discover keywords for the cluster
>= 1 characters
freqTermField - string
Output field name for top frequent terms in each cluster. These may overlap with other clusters.
Default: freq_terms
id - stringrequired
The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.
<= 63 characters
Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?
maxDF - number
Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 0.75
minDF - number
Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 5
modelId - string
Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.
>= 1 characters
norm - integer
p-norm to normalize vectors with (choose -1 to turn normalization off)
Default: 2
Allowed values: -1012
numKeywordsPerLabel - integer
Number of Keywords needed for labeling each cluster.
Default: 5
outputCollection - stringrequired
Solr Collection to store output data to
>= 1 characters
randomSeed - integer
For any deterministic pseudorandom number generation
Default: 1234
sourceFields - string
Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.
sparkConfig - array[object]
Spark configuration settings.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
trainingCollection - stringrequired
Solr Collection containing documents with defined categories or clusters
>= 1 characters
trainingDataFilterQuery - string
Solr query to use when loading training data
>= 3 characters
Default: *:*
trainingDataFrameConfigOptions - object
Additional spark dataframe loading configuration options
trainingDataSamplingFraction - number
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
type - stringrequired
Default: cluster_labeling
Allowed values: cluster_labeling
writeOptions - array[object]
Options used when writing output to Solr.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}