Legacy Product

Fusion 5.10
    Fusion 5.10

    Cluster Labeling Jobs

    Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)

    Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

    <= 128 characters

    Match pattern: ^[A-Za-z0-9_\-]+$

    trainingCollection - stringrequired

    Solr Collection containing documents with defined categories or clusters

    >= 1 characters

    fieldToVectorize - stringrequired

    Field containing data from which to discover keywords for the cluster

    >= 1 characters

    dataFormat - string

    Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)

    Default: solr

    Allowed values: solrhdfsfileparquet

    trainingDataFrameConfigOptions - object

    Additional spark dataframe loading configuration options

    trainingDataFilterQuery - string

    Solr query to use when loading training data

    >= 3 characters

    Default: *:*

    trainingDataSamplingFraction - number

    Fraction of the training data to use

    <= 1

    exclusiveMaximum: false

    Default: 1

    randomSeed - integer

    For any deterministic pseudorandom number generation

    Default: 1234

    outputCollection - stringrequired

    Solr Collection to store output data to

    >= 1 characters

    sourceFields - string

    Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

    modelId - string

    Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.

    >= 1 characters

    clusterIdField - stringrequired

    Field that contains your existing cluster IDs or document categories.

    >= 1 characters

    analyzerConfig - string

    LuceneTextAnalyzer schema for tokenization (JSON-encoded)

    >= 1 characters

    Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

    clusterLabelField - string

    Output field name for top frequent terms that are (mostly) unique for each cluster.

    Default: cluster_label

    freqTermField - string

    Output field name for top frequent terms in each cluster. These may overlap with other clusters.

    Default: freq_terms

    minDF - number

    Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 5

    maxDF - number

    Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 0.75

    norm - integer

    p-norm to normalize vectors with (choose -1 to turn normalization off)

    Default: 2

    Allowed values: -1012

    numKeywordsPerLabel - integer

    Number of Keywords needed for labeling each cluster.

    Default: 5

    type - stringrequired

    Default: cluster_labeling

    Allowed values: cluster_labeling