Content-Based Recommender Jobs (Experimental)

Table of Contents

Tuning tips
Query pipeline setup
Configuration properties

Use this job when you want to compute item similarities based on their content, such as product descriptions.

Default job name

COLLECTION_NAME_content_recs

Input

Searchable content from the primary collection.

Output

Items-for-item recommendations (the COLLECTION_NAME_content_recs collection by default)

First, item content is vectorized; different vectorization methods are available. Then, similar items are selected based on cosine similarity ("nearest neighbor") between their vectors.

At a minimum, you must specify these:

An ID for this job
The name of the training collection, that is, the collection with your content
An output collection; create a separate collection for this
The name of the ID field for documents in the training collection, such as item_id_s
The names of one or more content fields in the training collection

Content-based recommendations dataflow

Tuning tips

Configure Metadata fields for item-item evaluation to use those fields during evaluation to determine whether pairs belong to the same category.
Perform approximate nearest neighbor search is enabled by default to significantly reduce the job’s running time, with a small decrease in accuracy. If your training dataset is very small, then you can disable this option.
If your content contains a lot of domain-specific jargon, enable Use Word2Vec for vectorization.
If your documents are too short or too long, enable Use TF-IDF for vectorization.

Query pipeline setup

Download the APPName_item_item_rec_pipelines_content.json file and import it to create the query pipeline that consumes this job’s output. See Fetch Content-Based Items-for-Item Recommendations for details.

Configuration properties

Use this job when you want to compute item similarities based on their content such as product descriptions.

contentField - array[string]required

Field name containing item content such as product description

deleteOldRecs - boolean

Should previous recommendations be deleted. If this box is unchecked, then old recommendations will not be deleted but new recommendations will be appended with a different Job ID. Both sets of recommendations will be contained within the same collection. Will only work when output path is solr.

Default: true

excludeFromDeleteFilter - string

If the 'Delete Old Recommendations' flag is enabled, then use this query filter to identify existing recommendation docs to exclude from delete. The filter should identify recommendation docs you want to keep.

id - stringrequired

The ID for this job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_)

<= 63 characters

Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

indexNN - integer

If perform ANN, the depth of constructed index. Higher value improves recall at the expense of longer indexing time.Reasonable range: 100~2000

>= 100

<= 2000

exclusiveMinimum: false

exclusiveMaximum: false

itemIdField - stringrequired

Field name containing stored item ids

>= 1 characters

Default: item_id_s

itemMetadataFields - array[string]

List of item metadata fields to include in the recommendation output documents.

jobRunName - string

Identifier for this job run. Use it to filter recommendations from particular runs

lowercaseText - boolean

Select if you want the text to be lowercased.

Default: true

maxNeighbors - integer

If perform ANN, size of the potential neighbors for the indexing phase. Higher value leads to better recall and shorter retrieval times (at the expense of longer indexing time).Reasonable range: 5~100

>= 5

<= 100

exclusiveMinimum: false

exclusiveMaximum: false

metadataCategoryFields - array[string]

These fields will be used for item-item evaluation and for determining if the recommendation pair belongs to the same category.

numSimsPerItem - integer

Number of recommendations that will be saved per item.

>= 1

exclusiveMinimum: false

Default: 10

outputBatchSize - string

Batch size of documents when pushing results to solr

Default: 15000

outputCollection - stringrequired

Solr collection or cloud storage path where output data is to be written.

outputFormat - stringrequired

The format of the output data - solr, parquet etc.

>= 1 characters

Default: solr

partitionFields - string

If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

performANN - boolean

Whether to perform approximate nearest neighbor search (ANN). ANN will drastically reduce training time, but accuracy will drop a little. Disable only if dataset is very small.

Default: true

randomSeed - integer

Pseudorandom determinism fixed by keeping this seed constant

Default: 12345

readOptions - array[object]

Options used when reading input from Solr or other sources.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

searchNN - integer

If perform ANN, the depth of search used to find neighbors. Higher value improves recall at the expense of longer retrieval time.Reasonable range: 100~2000

>= 100

<= 2000

exclusiveMinimum: false

exclusiveMaximum: false

secretName - string

Name of the secret used to access cloud storage as defined in the K8s namespace

>= 1 characters

sparkConfig - array[object]

Provide additional key/value pairs to be injected into the training JSON map at runtime. Values will be inserted as-is, so use " to surround string values

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

topKAnn - integer

This is used to fetch additional recommendations so that the value specified for the Number of User Recommendations to Compute is most likely satisfied after filtering. This is normally set to 10 * (No. of item recommendations to compute)

>= 1

exclusiveMinimum: false

Default: 100

trainingCollection - stringrequired

Solr collection or cloud storage path where training data is present.

>= 1 characters

trainingDataFilterQuery - string

Solr or SQL query to filter training data. Use solr query when solr collection is specified in Training Path. Use SQL query when cloud storage location is specified. The table name for SQL is `spark_input`.

trainingFormat - stringrequired

The format of the training data - solr, parquet etc.

>= 1 characters

Default: solr

trainingSampleFraction - number

Choose a fraction of the data for training.

<= 1

exclusiveMaximum: false

Default: 1

type - stringrequired

Default: argo-item-recommender-content

Allowed values: argo-item-recommender-content

unidecodeText - boolean

Select if you want the text to be unidecoded.

Default: true

vectorizationDlBatchSize - integer

Compute encodings in batches in case hardware out of memory.

>= 1

exclusiveMinimum: false

vectorizationDlEnsembleWeight - number

Ensemble weight for deep learning based vectorization if more than one method of vectorization is selected.

Default: 1

vectorizationFasttextEnsembleWeight - number

Ensemble weight for Fasttext based vectorization if more than one method of vectorization is selected.

Default: 1

vectorizationFasttextEpochs - integer

Number of epochs to train custom Word2Vec embeddings.

>= 1

exclusiveMinimum: false

Default: 15

vectorizationFasttextMaxVocabSize - integer

Maximum number of tokens to consider for the vocab. Less frequent tokens will be omitted.

>= 1

exclusiveMinimum: false

vectorizationFasttextVectorsSize - integer

Word vector dimensions for Word2Vec vectorizer.

>= 1

exclusiveMinimum: false

Default: 150

vectorizationFasttextWindowSize - integer

The window size (context words from [-window, window]) for Word2Vec.

>= 1

exclusiveMinimum: false

Default: 5

vectorizationTfIdfMaxVocabSize - integer

Maximum number of tokens to consider for the vocab. Less frequent tokens will be omitted.

>= 1

exclusiveMinimum: false

vectorizationTfidfEnsembleWeight - number

Ensemble weight for Tf-Idf based vectorization if more than one method of vectorization is selected.

Default: 1

vectorizationTfidfFilterStopwords - boolean

Whether to filter out stopwords before generating Tf-Idf weights.

Default: true

vectorizationTfidfMaxNgram - integer

Maximum Ngram size to be used.

>= 1

exclusiveMinimum: false

Default: 3

vectorizationTfidfMinNgram - integer

Minimum Ngram size to be used.

>= 1

exclusiveMinimum: false

Default: 1

vectorizationTfidfUseCharacters - boolean

Whether to use characters. By default words are used.

vectorizationUseDl - boolean

Select if you want to use deep learning as the method for vectorization. You can choose the other methods too in which case an ensemble will be used.

Default: true

vectorizationUseFasttext - boolean

Select if you want to use word2vec as the method for vectorization. You can choose the other methods too in which case an ensemble will be used. Custom embeddings will be learned. Useful for jargon.

vectorizationUseTfidf - boolean

Select if you want to use Tf-idf as the method for vectorization. You can choose the other methods too in which case an ensemble will be used.

writeOptions - array[object]

Options used when writing output to Solr or other sources

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}