Document Clustering Jobs

Cluster a set of documents and attach cluster labels.

Use this job when you want to cluster a set of documents and attach cluster labels based on topics.

id - stringrequired

The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

<= 128 characters

Match pattern: ^[A-Za-z0-9_\-]+$

trainingCollection - stringrequired

Solr Collection containing documents to be clustered

>= 1 characters

fieldToVectorize - stringrequired

Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.

>= 1 characters

dataFormat - string

Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)

Default: solr

Allowed values: solrhdfsfileparquet

trainingDataFrameConfigOptions - object

Additional spark dataframe loading configuration options

trainingDataFilterQuery - string

Solr query to use when loading training data

>= 3 characters

Default: *:*

trainingDataSamplingFraction - number

Fraction of the training data to use

<= 1

exclusiveMaximum: false

Default: 1

randomSeed - integer

For any deterministic pseudorandom number generation

Default: 1234

outputCollection - stringrequired

Solr Collection to store model-labeled data to

>= 1 characters

sourceFields - string

Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

uidField - stringrequired

Field containing the unique ID for each document.

>= 1 characters

Default: id

clusterIdField - string

Output field name for unique cluster id.

Default: cluster_id

clusterLabelField - string

Output field name for top frequent terms that are (mostly) unique for each cluster.

Default: cluster_label

freqTermField - string

Output field name for top frequent terms in each cluster. These may overlap with other clusters.

Default: freq_terms

distToCenterField - string

Output field name for doc distance to its corresponding cluster center (measure how representative the doc is).

Default: dist_to_center

minDF - number

Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 5

maxDF - number

Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 0.5

kExact - integer

Exact number of clusters.

Default: 0

kMax - integer

Max possible number of clusters.

Default: 20

kMin - integer

Min possible number of clusters.

Default: 2

docLenTrim - boolean

Whether to separate out docs with extreme lengths.

Default: true

outlierTrim - boolean

Whether to perform outlier detection.

Default: true

shortLen - number

Length threshold to define short document. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 5

longLen - number

Length threshold to define long document. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 0.99

numKeywordsPerLabel - integer

Number of Keywords needed for labeling each cluster.

Default: 5

modelId - string

Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.

>= 1 characters

w2vDimension - integer

Word-vector dimensionality to represent text (choose > 0 to use, suggested dimension ranges: 100~150)

exclusiveMinimum: false

Default: 0

w2vWindowSize - integer

The window size (context words from [-window, window]) for word2vec

>= 3

exclusiveMinimum: false

Default: 8

norm - integer

p-norm to normalize vectors with (choose -1 to turn normalization off)

Default: 2

Allowed values: -1012

analyzerConfig - string

LuceneTextAnalyzer schema for tokenization (JSON-encoded)

>= 1 characters

Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

clusteringMethod - string

Choose between hierarchical vs kmeans clustering.

Default: hierarchical

outlierK - integer

Number of clusters to help find outliers.

Default: 10

outlierThreshold - number

Identify as outlier group if less than this percent of total documents. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 0.01

minDivisibleSize - number

Clusters must have at least this many documents to be split further. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

Default: 0

kDiscount - number

Applies a discount to help favor large/small K (number of clusters). A smaller value pushes K to assume a higher value within the [min, max] K range.

Default: 0.7

type - stringrequired

Default: doc_clustering

Allowed values: doc_clustering