Word2Vec Model Training Jobs
Train a shallow neural model, and project each document onto this vector embedding space.
Legacy Product
Train a shallow neural model, and project each document onto this vector embedding space.
Trains a shallow neural model, and projects each document onto this vector embedding space
The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)
<= 128 characters
Match pattern: ^[A-Za-z0-9_\-]+$
Solr Collection containing labeled training data
>= 1 characters
Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.
>= 1 characters
Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)
Default: solr
Allowed values: solrhdfsfileparquet
Additional spark dataframe loading configuration options
Solr query to use when loading training data
>= 3 characters
Default: *:*
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
For any deterministic pseudorandom number generation
Default: 1234
Solr Collection to store model-labeled data to
Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.
Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.
>= 1 characters
LuceneTextAnalyzer schema for tokenization (JSON-encoded)
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
Weight vector components based on inverse document frequency
Default: true
Word-vector dimensionality to represent text
>= 2
exclusiveMinimum: false
Default: 50
The window size (context words from [-window, window]) for word2vec
>= 3
exclusiveMinimum: false
Default: 5
Sets the maximum length (in words) of each sentence in the input data. Any sentence longer than this threshold will be divided into chunks of up to `maxSentenceLength` size.
>= 3
exclusiveMinimum: false
Default: 1000
Maximum number of iterations of the word2vec training
Default: 1
Training parameter for word2vec convergence (change at your own peril)
>= 0.005
exclusiveMinimum: false
Default: 0.025
To be kept, terms must occur in at least this number of documents (if > 1.0), or at least this fraction of documents (if <= 1.0)
Default: 0
To be kept, terms must occur in no more than this number of documents (if > 1.0), or no more than this fraction of documents (if <= 1.0)
Default: 1
p-norm to normalize vectors with (choose -1 to turn normalization off)
Default: 2
Allowed values: -1012
Minimum number of Spark partitions for training job.
>= 1
exclusiveMinimum: false
Default: 200
Solr field which will contain terms which the word2vec model considers are related to the input
Default: related_terms_txt
Field containing the unique ID for each document
>= 1 characters
For each collection of input words, find this many word2vec-related words
>= 1
exclusiveMinimum: false
Default: 10
Default: word2vec
Allowed values: word2vec