Build Training Data Jobs

Table of Contents

Configuration properties

Use this job to build training data for query classification by joining signals data with catalog data. The output of this job can be used as input for the Classification job.

For detailed configuration steps, see Classify New Queries.

Configuration properties

Use this job to build training data for query classification by joining signals with catalog.

id - stringrequired

The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

<= 63 characters

Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

sparkConfig - array[object]

Spark configuration settings.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

fieldToVectorize - stringrequired

Field containing query strings.

>= 1 characters

Default: query_s

dataFormat - string

Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

>= 1 characters

Default: solr

trainingDataFrameConfigOptions - object

Additional spark dataframe loading configuration options

trainingDataFilterQuery - string

Solr query to additionally filter signals. For non-solr data source use SPARK SQL FILTER QUERY under Advanced to filter results

Default: *:*

sparkSQL - string

Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

Default: SELECT * from spark_input

trainingDataSamplingFraction - number

Fraction of the training data to use

<= 1

exclusiveMaximum: false

Default: 1

randomSeed - integer

For any deterministic pseudorandom number generation

Default: 1234

dataOutputFormat - string

Spark-compatible output format (like 'solr', 'parquet', etc)

>= 1 characters

Default: solr

partitionCols - string

If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

writeOptions - array[object]

Options used when writing output to Solr or other sources

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

readOptions - array[object]

Options used when reading input from Solr or other sources.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

catalogPath - stringrequired

Catalog collection or cloud storage path which contains item categories.

catalogFormat - stringrequired

Spark-compatible format that contains catalog data (like 'solr', 'parquet', 'orc' etc)

signalsPath - stringrequired

Signals collection or cloud storage path which contains item categories.

outputPath - stringrequired

Output collection or cloud storage path which contains item categories.

categoryField - stringrequired

Item category field in catalog.

catalogIdField - stringrequired

Item Id field in catalog, which will be used to join with signals

itemIdField - stringrequired

Item Id field in signals, which will be used to join with catalog.

Default: doc_id_s

countField - stringrequired

Count Field in raw or aggregated signals.

Default: aggr_count_i

topCategoryProportion - number

Proportion of the top category has to be among all categories.

Default: 0.5

topCategoryThreshold - integer

Minimum number of query,category pair counts.

>= 1

exclusiveMinimum: false

Default: 1

analyzerConfig - stringrequired

The style of text analyzer you would like to use.

Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}

type - stringrequired

Default: build-training

Allowed values: build-training