Logistic Regression Classifier Training Jobs

Table of Contents

Configuration properties

Train a regularized logistic regression model for text classification.

This job is deprecated in Fusion 5.3.x. The Classification job, introduced in Fusion 5.2.0, provides more options and better logging.

Configuration properties

Use this job when you have training data and you want to train a logistic regression model to classify text into groups.

analyzerConfig - string

LuceneTextAnalyzer schema for tokenization (JSON-encoded)

Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

autoBalanceClasses - boolean

Ensure that all classes of training data have the same size

Default: true

dataFormat - stringrequired

Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)

>= 1 characters

Default: solr

elasticNetWeight - number

Value between 0 and 1 to interpolate between ridge (0.0) and lasso (1.0) regression

<= 1

exclusiveMaximum: false

Default: 0

evaluationMetricType - string

Optimize hyperparameter search over one of [binary, multiclass, regression] metrics, or 'none'

Default: none

Allowed values: binarymulticlassregressionnone

fieldToVectorize - stringrequired

Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.

>= 1 characters

gridSearch - boolean

Perform grid search to optimize hyperparameters

Default: false

id - stringrequired

The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

<= 63 characters

Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

makeOtherClass - boolean

Create a label class 'Other' which contains all examples not in a class large enough to train on

Default: true

maxDF - number

To be kept, terms must occur in no more than this number of documents (if > 1.0), or no more than this fraction of documents (if <= 1.0)

Default: 1

maxIters - integer

Maximum number of iterations to perform before halting, even if the convergence criterion has not been met.

Default: 10

minDF - number

To be kept, terms must occur in at least this number of documents (if > 1.0), or at least this fraction of documents (if <= 1.0)

Default: 0

minSparkPartitions - integer

Minimum number of Spark partitions for training job.

>= 1

exclusiveMinimum: false

Default: 200

minTrainingSamplesPerClass - integer

Ensure that all classes of training data have at least this many examples

>= 1

exclusiveMinimum: false

Default: 100

modelId - string

Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.

>= 1 characters

norm - integer

p-norm to normalize vectors with (choose -1 to turn normalization off)

Default: 2

Allowed values: -1012

otherClassName - string

Label class name for the catch-all 'Other' class

>= 1 characters

Default: Other

outputCollection - string

Solr Collection to store model-labeled data to

overwriteExistingModel - boolean

If a model exists in the model store, overwrite when this job runs

Default: true

predictedLabelField - string

Solr field which will contain labels when classifier is applied to documents

Default: labelPredictedByFusionModel

randomSeed - integer

For any deterministic pseudorandom number generation

Default: 1234

readOptions - array[object]

Options used when reading input from Solr or other sources.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

regularizationWeight - number

Degree of regularization to use when training (L2 lambda parameter if elasticNetWeight = 0)

>= 0.000001

<= 1

exclusiveMinimum: false

exclusiveMaximum: false

Default: 0.01

sourceFields - string

Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

sparkConfig - array[object]

Spark configuration settings.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

trainingCollection - stringrequired

Solr Collection containing labeled training data

>= 1 characters

trainingDataFilterQuery - string

Solr query to use when loading training data if using Solr, Spark SQL expression for all other data sources

Default: *:*

trainingDataFrameConfigOptions - object

Additional spark dataframe loading configuration options

trainingDataSamplingFraction - number

Fraction of the training data to use

<= 1

exclusiveMaximum: false

Default: 1

trainingLabelField - stringrequired

Solr field containing labels for training instances (should be single-valued strings)

type - stringrequired

Default: logistic_regression_classifier_trainer

Allowed values: logistic_regression_classifier_trainer

w2vDimension - integer

Word-vector dimensionality to represent text (choose > 0 to use)

exclusiveMinimum: false

Default: 0

w2vMaxIter - integer

Maximum number of iterations of the word2vec training

Default: 1

w2vMaxSentenceLength - integer

Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to `maxSentenceLength` size.

>= 3

exclusiveMinimum: false

Default: 1000

w2vStepSize - number

Training parameter for word2vec convergence (change at your own peril)

>= 0.005

exclusiveMinimum: false

Default: 0.025

w2vWindowSize - integer

The window size (context words from [-window, window]) for word2vec

>= 3

exclusiveMinimum: false

Default: 5

withIdf - boolean

Weight vector components based on inverse document frequency

Default: true

writeOptions - array[object]

Options used when writing output to Solr.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}