Synonym Detection Jobs

Table of Contents

Input
Output
- The similar queries table
- The synonyms table
Configuration properties

Use this job to generate pairs of synonyms and pairs of similar queries. Two words are considered potential synonyms when they are used in a similar context in similar queries.

Default job name

COLLECTION_NAME_synonym_detection

Input

Aggregated signals (the COLLECTION_NAME_signals_aggr collection by default)
Spell Correction job output (the COLLECTION_NAME_query_rewrite_staging collection by default)
Phrase Extraction job output (the COLLECTION_NAME_query_rewrite_staging collection by default)

Output

Synonyms (the COLLECTION_NAME_query_rewrite_staging collection by default)

query

count_i

type

timstamp_tdt

user_id

doc_id

session_id

fusion_query_id

Required signals fields:

For best job speed and to avoid memory issues, use aggregated signals instead of raw signals as input for this job.

Output from the Token and Phrase Spell Correction job and the Phrase Extraction job can be used as input for this job.

To:

Review, edit, deploy, or delete output from this job, see Query Rewriting UI
Review, edit, or add synonyms, see Use Synonym Detection

Input

This job takes one or more of the following as input:

Signal data (required)
Misspelling job results
Phrase detection job results
Keywords
Custom synonyms

This input is required; additional input is optional. Signal data can be either raw or aggregated. The job runs faster using aggregated signals. When raw signals are used as input, this job performs the aggregation.

Use the trainingCollection/Input Collection parameter to specify the collection that contains the signal data.

Misspelling job results

Token and Phrase Spell Correction job results can be used to avoid finding mainly misspellings, or mixing synonyms with misspellings.

Use the misspellingCollection/Misspelling Job Result Collection parameter to specify the collection that contains these results.

Phrase detection job results

Phrase Extraction job results can be used to find synonyms with multiple tokens, such as "lithium ion" and "ion battery".

Use the keyPhraseCollection/Phrase Extraction Job Result Collection parameter to specify the collection that contains these results.

Keywords

A keywords list in the blob store can serve as a blacklist to prevent common attributes from being identified as potential synonyms.

The list can include common attributes such as color, brand, material, and so on. For example, by including color attributes you can prevent "red" and "blue" from being identified as synonyms due to their appearance in similar queries such as "red bike" and "blue bike".

The keywords file is in CSV format with two fields: keyword and type. You can add your custom keywords list here with the type value "stopwords". An example file is shown below:

keyword,type
cu,stopword
ft,stopword
mil,stopword
watt,stopword
wat,stopword
foot,stopword
feet,stopword
gal,stopword
unit,stopword
lb,stopword
wt,stopword
cc,stopword
cm,stopword
kg,stopword
km,stopword
oz,stopword
nm,stopword
qt,stopword
sale,stopword
on sale,stopword
for sale,stopword
clearance,stopword
gb,stopword
gig,stopword
color,stopword
blue,stopword
white,stopword
black,stopword
ivory,stopword
grey,stopword
brown,stopword
silver,stopword
light blue,stopword
light ivory,stopword
light grey,stopword
light brown,stopword
light silver,stopword
light green,stopword

Use the keywordsBlobName/Keywords Blob Store parameter to specify the name of the blob that contains this list.

Custom Synonyms

For some deployments, there might be a need to use existing synonym definitions. You can import existing synonyms into the Synonym Detection Jobs as a text file. Upload your synonyms text file to the blob store and reference that file when creating the job.

Output

The output collection contains two tables distinguished by the doc_type field.

The similar queries table

If query leads to clicks on documents 1, 2, 3, and 4, and similar_query leads to clicks on documents 2, 3, 4, and 5, then there is sufficient overlap between the two queries to consider them similar.

A statistic is constructed to compute similarities based on overlap counts and query counts. The resulting table consists of documents whose doc_type value is "query_rewrite" and type value is "simq". The similar queries table contains similar query pairs with these fields:

query

The first half of the two-query pair.

similar_query

The second half of the two-query pair.

similarity

A score between 0 and 1 indicating how similar the two queries are.

All similarity values are greater than or equal to the configured Query Similarity Threshold to ensure that only high-similarity queries are kept and used as input to find synonyms.

query_count

The number of clicks received by the query_count query.

To save computation time, only queries with at least as many clicks as the configured Query Clicks Threshold parameter are kept and used as input to find synonyms.

similar_query_count

The number of clicks received by the similar_query_count query.

The synonyms table

The synonyms table consists of documents whose doc_type value is "query_rewrite" and type value is "synonym":

surface_form

The first half of the two-synonym pair.

synonym

The second half of the two-synonym pair.

context

If there are more than two words or phrases with the same meaning, such as "macbook, apple mac, mac", then this field shows the group to which this pair belongs.

similarity

A similarity score to measure confidence.

count

The number of different contexts in which this synonym pair appears.

The bigger the number, the higher the quality of the pair.

suggestion

The algorithm automatically selects context, synonym words or phrases, or the synonym_group, and puts it in this field.

Use this field as the field to review.

category

Whether the synonym is actually a misspelling.

Configuration properties

Use this job to generate synonym and similar query pairs.

id - stringrequired

The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

<= 63 characters

Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

sparkConfig - array[object]

Spark configuration settings.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

trainingCollection - stringrequired

Collection containing queries, document id and event counts. Can be either signal aggregation collection or raw signals collection.

>= 1 characters

fieldToVectorize - stringrequired

Field containing queries. Change to query to use against raw signals

>= 1 characters

Default: query_s

dataFormat - stringrequired

Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

>= 1 characters

Default: solr

trainingDataFrameConfigOptions - object

Additional spark dataframe loading configuration options

trainingDataFilterQuery - string

Solr query to use when loading training data if using Solr, Spark SQL expression for all other data sources

Default: *:*

sparkSQL - string

Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

Default: SELECT * from spark_input

trainingDataSamplingFraction - number

Fraction of the training data to use

<= 1

exclusiveMaximum: false

Default: 1

randomSeed - integer

For any deterministic pseudorandom number generation

Default: 1234

outputCollection - string

Collection to store synonym and similar query pairs.

dataOutputFormat - string

Spark-compatible output format (like 'solr', 'parquet', etc)

>= 1 characters

Default: solr

partitionCols - string

If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output

writeOptions - array[object]

Options used when writing output to Solr or other sources

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

readOptions - array[object]

Options used when reading input from Solr or other sources.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

misspellingCollection - string

Solr collection containing reviewed result of Token and phrase spell correction job. Defaults to the query_rewrite_staging collection for the app.

misspellingsFilterQuery - string

Solr query to additionally filter the misspelling results. Defaults to reading all approved spell corrections.

Default: type:spell

keyPhraseCollection - string

Solr collection containing reviewed result of Phrase extraction job. Defaults to the query_rewrite_staging collection for the app.

keyPhraseFilterQuery - string

Solr query to additionally filter the phrase extraction results. Defaults to reading all approved phrases.

Default: type:phrase

misspellingSQL - string

Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spell_input

Default: SELECT surface_form AS misspelling_s, output AS correction_s FROM spell_input WHERE doc_type = 'query_rewrite' AND type = 'spell' AND review IN ('approved' OR 'auto')

misspellingSQLDataFormat - stringrequired

Spark-compatible format that contains spelling data (like 'solr', 'parquet', 'orc' etc)

>= 1 characters

Default: solr

phraseSQL - string

Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as phrase_input

Default: SELECT surface_form AS phrases_s, coalesce(confidence, lit(1d)) AS likelihood_d, coalesce(word_count, lit(1d)) AS word_num_i FROM phrase_input WHERE doc_type = 'query_rewrite' AND type = 'phrase' AND review IN ('approved' OR 'auto')

phraseSQLDataFormat - stringrequired

Spark-compatible format that contains phrase data (like 'solr', 'parquet', 'orc' etc)

>= 1 characters

Default: solr

countField - stringrequired

Solr field containing number of events (e.g., number of clicks). Change to count_i when running against raw signals

Default: aggr_count_i

docIdField - stringrequired

Solr field containing document id that user clicked. Change to doc_id for raw signal collection

Default: doc_id_s

overlapThreshold - number

The threshold above which query pairs are consider similar. We can get more synonym pairs if increase this value but quality may get reduced.

Default: 0.5

similarityThreshold - number

The threshold above which synonym pairs are consider similar. We can get more synonym pairs if increase this value but quality may get reduced.

Default: 0.01

minQueryCount - integer

The min number of clicked documents needed for comparing queries.

Default: 5

keywordsBlobName - string

Name of the keywords blob resource. Typically, this should be a csv file uploaded to blob store in a specific format. Check documentation for more details on format and uploading to blob store.

synonymBlobName - string

Name of the custom synonym blob resource. This is a Solr synonym file that will be used in the synonym detection job and will override any generated synonyms (indicated by a 'supplied' field in the Rules UI).

analyzerConfigQuery - string

LuceneTextAnalyzer schema for tokenizing queries (JSON-encoded)

>= 1 characters

Default: { "analyzers": [ { "name": "LetterTokLowerStem","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "letter" },"filters": [{ "type": "lowercase" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "KStem" }] }],"fields": [{ "regex": ".+", "analyzer": "LetterTokLowerStem" } ]}

enableAutoPublish - boolean

If true, automatically publishes rewrites for rules. Default is false to allow for initial human-aided reviewing

Default: false

sparkPartitions - integer

Spark will re-partition the input to have this number of partitions. Increase for greater parallelism

Default: 200

type - stringrequired

Default: synonymDetection

Allowed values: synonymDetection