Use this job when you have training data and you want to train a logistic regression model to classify text into groups.
analyzerConfig - string
LuceneTextAnalyzer schema for tokenization (JSON-encoded)
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
autoBalanceClasses - boolean
Ensure that all classes of training data have the same size
Default: true
dataFormat - stringrequired
Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)
>= 1 characters
Default: solr
elasticNetWeight - number
Value between 0 and 1 to interpolate between ridge (0.0) and lasso (1.0) regression
<= 1
exclusiveMaximum: false
Default: 0
evaluationMetricType - string
Optimize hyperparameter search over one of [binary, multiclass, regression] metrics, or 'none'
Default: none
Allowed values: binarymulticlassregressionnone
fieldToVectorize - stringrequired
Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.
>= 1 characters
gridSearch - boolean
Perform grid search to optimize hyperparameters
Default: false
id - stringrequired
The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.
<= 63 characters
Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?
makeOtherClass - boolean
Create a label class 'Other' which contains all examples not in a class large enough to train on
Default: true
maxDF - number
To be kept, terms must occur in no more than this number of documents (if > 1.0), or no more than this fraction of documents (if <= 1.0)
Default: 1
maxIters - integer
Maximum number of iterations to perform before halting, even if the convergence criterion has not been met.
Default: 10
minDF - number
To be kept, terms must occur in at least this number of documents (if > 1.0), or at least this fraction of documents (if <= 1.0)
Default: 0
minSparkPartitions - integer
Minimum number of Spark partitions for training job.
>= 1
exclusiveMinimum: false
Default: 200
minTrainingSamplesPerClass - integer
Ensure that all classes of training data have at least this many examples
>= 1
exclusiveMinimum: false
Default: 100
modelId - string
Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.
>= 1 characters
norm - integer
p-norm to normalize vectors with (choose -1 to turn normalization off)
Default: 2
Allowed values: -1012
otherClassName - string
Label class name for the catch-all 'Other' class
>= 1 characters
Default: Other
outputCollection - string
Solr Collection to store model-labeled data to
overwriteExistingModel - boolean
If a model exists in the model store, overwrite when this job runs
Default: true
predictedLabelField - string
Solr field which will contain labels when classifier is applied to documents
Default: labelPredictedByFusionModel
randomSeed - integer
For any deterministic pseudorandom number generation
Default: 1234
readOptions - array[object]
Options used when reading input from Solr or other sources.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
regularizationWeight - number
Degree of regularization to use when training (L2 lambda parameter if elasticNetWeight = 0)
>= 0.000001
<= 1
exclusiveMinimum: false
exclusiveMaximum: false
Default: 0.01
sourceFields - string
Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.
sparkConfig - array[object]
Spark configuration settings.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
trainingCollection - stringrequired
Solr Collection containing labeled training data
>= 1 characters
trainingDataFilterQuery - string
Solr query to use when loading training data if using Solr, Spark SQL expression for all other data sources
Default: *:*
trainingDataFrameConfigOptions - object
Additional spark dataframe loading configuration options
trainingDataSamplingFraction - number
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
trainingLabelField - stringrequired
Solr field containing labels for training instances (should be single-valued strings)
type - stringrequired
Default: logistic_regression_classifier_trainer
Allowed values: logistic_regression_classifier_trainer
w2vDimension - integer
Word-vector dimensionality to represent text (choose > 0 to use)
exclusiveMinimum: false
Default: 0
w2vMaxIter - integer
Maximum number of iterations of the word2vec training
Default: 1
w2vMaxSentenceLength - integer
Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to `maxSentenceLength` size.
>= 3
exclusiveMinimum: false
Default: 1000
w2vStepSize - number
Training parameter for word2vec convergence (change at your own peril)
>= 0.005
exclusiveMinimum: false
Default: 0.025
w2vWindowSize - integer
The window size (context words from [-window, window]) for word2vec
>= 3
exclusiveMinimum: false
Default: 5
withIdf - boolean
Weight vector components based on inverse document frequency
Default: true
writeOptions - array[object]
Options used when writing output to Solr.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}