Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)
analyzerConfig - string
LuceneTextAnalyzer schema for tokenization (JSON-encoded)
>= 1 characters
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
clusterIdField - stringrequired
Field that contains your existing cluster IDs or document categories.
>= 1 characters
clusterLabelField - string
Output field name for top frequent terms that are (mostly) unique for each cluster.
Default: cluster_label
dataFormat - stringrequired
Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)
>= 1 characters
Default: solr
dataOutputFormat - string
Spark-compatible output format (like 'solr', 'parquet', etc)
>= 1 characters
Default: solr
fieldToVectorize - stringrequired
Field containing data from which to discover keywords for the cluster
>= 1 characters
freqTermField - string
Output field name for top frequent terms in each cluster. These may overlap with other clusters.
Default: freq_terms
id - stringrequired
The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.
<= 63 characters
Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?
maxDF - number
Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 0.75
minDF - number
Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 5
modelId - string
Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.
>= 1 characters
norm - integer
p-norm to normalize vectors with (choose -1 to turn normalization off)
Default: 2
Allowed values: -1012
numKeywordsPerLabel - integer
Number of Keywords needed for labeling each cluster.
Default: 5
outputCollection - stringrequired
Solr Collection to store output data to
>= 1 characters
partitionCols - string
If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output
randomSeed - integer
For any deterministic pseudorandom number generation
Default: 1234
readOptions - array[object]
Options used when reading input from Solr or other sources.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
sourceFields - string
Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.
sparkConfig - array[object]
Spark configuration settings.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
sparkSQL - string
Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input
Default: SELECT * from spark_input
trainingCollection - stringrequired
Solr Collection containing documents with defined categories or clusters
>= 1 characters
trainingDataFilterQuery - string
Solr query to use when loading training data if using Solr
Default: *:*
trainingDataFrameConfigOptions - object
Additional spark dataframe loading configuration options
trainingDataSamplingFraction - number
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
type - stringrequired
Default: cluster_labeling
Allowed values: cluster_labeling
writeOptions - array[object]
Options used when writing output to Solr or other sources
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}