Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)
id - stringrequired
The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.
<= 63 characters
Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?
sparkConfig - array[object]
Spark configuration settings.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
trainingCollection - stringrequired
Solr Collection containing documents with defined categories or clusters
>= 1 characters
fieldToVectorize - stringrequired
Field containing data from which to discover keywords for the cluster
>= 1 characters
dataFormat - stringrequired
Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)
>= 1 characters
Default: solr
trainingDataFrameConfigOptions - object
Additional spark dataframe loading configuration options
trainingDataFilterQuery - string
Solr query to use when loading training data if using Solr
Default: *:*
sparkSQL - string
Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input
Default: SELECT * from spark_input
trainingDataSamplingFraction - number
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
randomSeed - integer
For any deterministic pseudorandom number generation
Default: 1234
outputCollection - stringrequired
Solr Collection to store output data to
>= 1 characters
dataOutputFormat - string
Spark-compatible output format (like 'solr', 'parquet', etc)
>= 1 characters
Default: solr
sourceFields - string
Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.
partitionCols - string
If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output
writeOptions - array[object]
Options used when writing output to Solr or other sources
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
readOptions - array[object]
Options used when reading input from Solr or other sources.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
modelId - string
Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.
>= 1 characters
clusterIdField - stringrequired
Field that contains your existing cluster IDs or document categories.
>= 1 characters
analyzerConfig - string
LuceneTextAnalyzer schema for tokenization (JSON-encoded)
>= 1 characters
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
clusterLabelField - string
Output field name for top frequent terms that are (mostly) unique for each cluster.
Default: cluster_label
freqTermField - string
Output field name for top frequent terms in each cluster. These may overlap with other clusters.
Default: freq_terms
minDF - number
Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 5
maxDF - number
Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 0.75
norm - integer
p-norm to normalize vectors with (choose -1 to turn normalization off)
Default: 2
Allowed values: -1012
numKeywordsPerLabel - integer
Number of Keywords needed for labeling each cluster.
Default: 5
type - stringrequired
Default: cluster_labeling
Allowed values: cluster_labeling