Logistic Regression Classifier Training Jobs

Table of Contents

Configuration properties

Train a regularized logistic regression model for text classification.

Configuration properties

Use this job when you have training data and you want to train a logistic regression model to classify text into groups.

analyzerConfig - string

LuceneTextAnalyzer schema for tokenization (JSON-encoded)

Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

autoBalanceClasses - boolean

Ensure that all classes of training data have the same size

Default: true

dataFormat - string

Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)

Default: solr

Allowed values: solrhdfsfileparquet

elasticNetWeight - number

Value between 0 and 1 to interpolate between ridge (0.0) and lasso (1.0) regression

<= 1

exclusiveMaximum: false

Default: 0

evaluationMetricType - string

Optimize hyperparameter search over one of [binary, multiclass, regression] metrics, or 'none'

Default: none

Allowed values: binarymulticlassregressionnone

fieldToVectorize - stringrequired

Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.

>= 1 characters

gridSearch - boolean

Perform grid search to optimize hyperparameters

Default: false

id - stringrequired

The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

<= 63 characters

Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

makeOtherClass - boolean

Create a label class 'Other' which contains all examples not in a class large enough to train on

Default: true

maxDF - number

To be kept, terms must occur in no more than this number of documents (if > 1.0), or no more than this fraction of documents (if <= 1.0)

Default: 1

maxIters - integer

Maximum number of iterations to perform before halting, even if the convergence criterion has not been met.

Default: 10

minDF - number

To be kept, terms must occur in at least this number of documents (if > 1.0), or at least this fraction of documents (if <= 1.0)

Default: 0

minSparkPartitions - integer

Minimum number of Spark partitions for training job.

>= 1

exclusiveMinimum: false

Default: 200

minTrainingSamplesPerClass - integer

Ensure that all classes of training data have at least this many examples

>= 1

exclusiveMinimum: false

Default: 100

modelId - string

Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.

>= 1 characters

norm - integer

p-norm to normalize vectors with (choose -1 to turn normalization off)

Default: 2

Allowed values: -1012

otherClassName - string

Label class name for the catch-all 'Other' class

>= 1 characters

Default: Other

outputCollection - string

Solr Collection to store model-labeled data to

overwriteExistingModel - boolean

If a model exists in the model store, overwrite when this job runs

Default: true

predictedLabelField - string

Solr field which will contain labels when classifier is applied to documents

Default: labelPredictedByFusionModel

randomSeed - integer

For any deterministic pseudorandom number generation

Default: 1234

regularizationWeight - number

Degree of regularization to use when training (L2 lambda parameter if elasticNetWeight = 0)

>= 0.000001

<= 1

exclusiveMinimum: false

exclusiveMaximum: false

Default: 0.01

sourceFields - string

Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

sparkConfig - array[object]

Spark configuration settings.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

trainingCollection - stringrequired

Solr Collection containing labeled training data

>= 1 characters

trainingDataFilterQuery - string

Solr query to use when loading training data

>= 3 characters

Default: *:*

trainingDataFrameConfigOptions - object

Additional spark dataframe loading configuration options

trainingDataSamplingFraction - number

Fraction of the training data to use

<= 1

exclusiveMaximum: false

Default: 1

trainingLabelField - stringrequired

Solr field containing labels for training instances (should be single-valued strings)

type - stringrequired

Default: logistic_regression_classifier_trainer

Allowed values: logistic_regression_classifier_trainer

w2vDimension - integer

Word-vector dimensionality to represent text (choose > 0 to use)

exclusiveMinimum: false

Default: 0

w2vMaxIter - integer

Maximum number of iterations of the word2vec training

Default: 1

w2vMaxSentenceLength - integer

Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to `maxSentenceLength` size.

>= 3

exclusiveMinimum: false

Default: 1000

w2vStepSize - number

Training parameter for word2vec convergence (change at your own peril)

>= 0.005

exclusiveMinimum: false

Default: 0.025

w2vWindowSize - integer

The window size (context words from [-window, window]) for word2vec

>= 3

exclusiveMinimum: false

Default: 5

withIdf - boolean

Weight vector components based on inverse document frequency

Default: true

writeOptions - array[object]

Options used when writing output to Solr.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}