Legacy Product

Fusion 5.10
    Fusion 5.10

    Use this job when you have training data and you want to train a random forest model to classify text into groups.

    id - stringrequired

    The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

    <= 128 characters

    Match pattern: ^[A-Za-z0-9_\-]+$

    trainingCollection - stringrequired

    Solr Collection containing labeled training data

    >= 1 characters

    fieldToVectorize - stringrequired

    Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.

    >= 1 characters

    dataFormat - string

    Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)

    Default: solr

    Allowed values: solrhdfsfileparquet

    trainingDataFrameConfigOptions - object

    Additional spark dataframe loading configuration options

    trainingDataFilterQuery - string

    Solr query to use when loading training data

    >= 3 characters

    Default: *:*

    trainingDataSamplingFraction - number

    Fraction of the training data to use

    <= 1

    exclusiveMaximum: false

    Default: 1

    randomSeed - integer

    For any deterministic pseudorandom number generation

    Default: 1234

    outputCollection - string

    Solr Collection to store model-labeled data to

    sourceFields - string

    Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

    modelId - string

    Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.

    >= 1 characters

    analyzerConfig - string

    LuceneTextAnalyzer schema for tokenization (JSON-encoded)

    Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

    withIdf - boolean

    Weight vector components based on inverse document frequency

    Default: true

    w2vDimension - integer

    Word-vector dimensionality to represent text (choose > 0 to use)

    exclusiveMinimum: false

    Default: 0

    w2vWindowSize - integer

    The window size (context words from [-window, window]) for word2vec

    >= 3

    exclusiveMinimum: false

    Default: 5

    w2vMaxSentenceLength - integer

    Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to `maxSentenceLength` size.

    >= 3

    exclusiveMinimum: false

    Default: 1000

    w2vMaxIter - integer

    Maximum number of iterations of the word2vec training

    Default: 1

    w2vStepSize - number

    Training parameter for word2vec convergence (change at your own peril)

    >= 0.005

    exclusiveMinimum: false

    Default: 0.025

    minDF - number

    To be kept, terms must occur in at least this number of documents (if > 1.0), or at least this fraction of documents (if <= 1.0)

    Default: 0

    maxDF - number

    To be kept, terms must occur in no more than this number of documents (if > 1.0), or no more than this fraction of documents (if <= 1.0)

    Default: 1

    norm - integer

    p-norm to normalize vectors with (choose -1 to turn normalization off)

    Default: 2

    Allowed values: -1012

    predictedLabelField - string

    Solr field which will contain labels when classifier is applied to documents

    Default: labelPredictedByFusionModel

    minSparkPartitions - integer

    Minimum number of Spark partitions for training job.

    >= 1

    exclusiveMinimum: false

    Default: 200

    trainingLabelField - stringrequired

    Solr field containing labels for training instances (should be single-valued strings)

    gridSearch - boolean

    Perform grid search to optimize hyperparameters

    Default: false

    evaluationMetricType - string

    Optimize hyperparameter search over one of [binary, multiclass, regression] metrics, or 'none'

    Default: none

    Allowed values: binarymulticlassregressionnone

    autoBalanceClasses - boolean

    Ensure that all classes of training data have the same size

    Default: true

    minTrainingSamplesPerClass - integer

    Ensure that all classes of training data have at least this many examples

    >= 1

    exclusiveMinimum: false

    Default: 100

    makeOtherClass - boolean

    Create a label class 'Other' which contains all examples not in a class large enough to train on

    Default: true

    otherClassName - string

    Label class name for the catch-all 'Other' class

    >= 1 characters

    Default: Other

    maxDepth - integer

    Maximum depth of the tree (>= 0). E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.

    >= 1

    <= 20

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 5

    maxBins - integer

    Max number of bins for discretizing continuous features. Must be >=2 and >= number of categories for any categorical feature.

    <= 128

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 32

    numTrees - integer

    Number of trees to train (>= 1)

    >= 1

    <= 1000

    exclusiveMinimum: false

    exclusiveMaximum: false

    Default: 20

    type - stringrequired

    Default: random_forests_classifier

    Allowed values: random_forests_classifier