Legacy Product

Fusion 5.10
    Fusion 5.10

    Use this job when you already have clusters or well-defined document categories, and you want to discover and attach keywords to see representative words within those existing clusters. (If you want to create new clusters, use the Document Clustering job.)

    analyzerConfig - string

    LuceneTextAnalyzer schema for tokenization (JSON-encoded)

    >= 1 characters

    Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

    clusterIdField - stringrequired

    Field that contains your existing cluster IDs or document categories.

    >= 1 characters

    clusterLabelField - string

    Output field name for top frequent terms that are (mostly) unique for each cluster.

    Default: cluster_label

    dataFormat - stringrequired

    Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

    >= 1 characters

    Default: solr

    dataOutputFormat - string

    Spark-compatible output format (like 'solr', 'parquet', etc)

    >= 1 characters

    Default: solr

    fieldToVectorize - stringrequired

    Field containing data from which to discover keywords for the cluster

    >= 1 characters

    freqTermField - string

    Output field name for top frequent terms in each cluster. These may overlap with other clusters.

    Default: freq_terms

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

    <= 63 characters

    Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

    maxDF - number

    Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 0.75

    minDF - number

    Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.

    Default: 5

    modelId - string

    Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.

    >= 1 characters

    norm - integer

    p-norm to normalize vectors with (choose -1 to turn normalization off)

    Default: 2

    Allowed values: -1012

    numKeywordsPerLabel - integer

    Number of Keywords needed for labeling each cluster.

    Default: 5

    outputCollection - stringrequired

    Solr Collection to store output data to

    >= 1 characters

    partitionCols - string

    If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

    randomSeed - integer

    For any deterministic pseudorandom number generation

    Default: 1234

    readOptions - array[object]

    Options used when reading input from Solr or other sources.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    sourceFields - string

    Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

    sparkConfig - array[object]

    Spark configuration settings.

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }

    sparkSQL - string

    Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

    Default: SELECT * from spark_input

    trainingCollection - stringrequired

    Solr Collection containing documents with defined categories or clusters

    >= 1 characters

    trainingDataFilterQuery - string

    Solr query to use when loading training data if using Solr

    Default: *:*

    trainingDataFrameConfigOptions - object

    Additional spark dataframe loading configuration options

    trainingDataSamplingFraction - number

    Fraction of the training data to use

    <= 1

    exclusiveMaximum: false

    Default: 1

    type - stringrequired

    Default: cluster_labeling

    Allowed values: cluster_labeling

    writeOptions - array[object]

    Options used when writing output to Solr or other sources

    object attributes:{key required : {
     display name: Parameter Name
     type: string
    }
    value : {
     display name: Parameter Value
     type: string
    }
    }