Phrase Extraction Jobs
Identify multi-word phrases in signals.
Default job name |
COLLECTION_NAME_phrase_extraction
|
Input |
Raw signals (the COLLECTION_NAME_signals collection by default)
|
Output |
Extracted phrases (the COLLECTION_NAME_query_rewrite_staging collection by default)
|
|
query |
count_i |
type |
timstamp_tdt |
user_id |
doc_id |
session_id |
fusion_query_id |
Required signals fields: |
|
|
|
|
|
|
|
|
This job writes to the COLLECTION_NAME_query_rewrite_staging
collection. It also uses reviewed documents from that collection to improve the accuracy of the job. You can review, edit, deploy, or delete output from this job using the Query Rewriting.
For most use cases, the minimum configuration for this job consists of these fields:
-
id
/Spark Job ID
Give this job an arbitrary ID string.
-
trainingCollection
/Training Collection
Specify the input collection.
-
fieldToVectorize
/Field to Vectorize
Specify the field in the input collection where phrases can be found.
-
outputCollection
/Output Collection
Specify the collection in which the output documents should be indexed.
|
When running this job over a content document collection, be sure to set attachPhrases /Extract Key Phrases from Input Text to "true". The default is "false", which works well when running the job over a signals collection.
|
By default, the job only outputs the phrases found from the original document. In each row of the phrases output, these fields are most useful:
-
The phrase itself is in the phrases_s
field, which can be used for faceting.
-
The likelihood_d
field gives the likelihood that the phrase is legitimate, from 0 to infinity.
Low-probability phrases are automatically trimmed from the results.
-
When a phrase’s likelihood value is ambiguous, the review
field is set to "true" to indicate that the phrase should be reviewed.
-
A phrase_count
field indicates the number of instances of the phrase in the input collection.
The complete list of output fields is shown below.
aggr_id_s
|
The name of the Phrase Extraction job that generated this document.
|
doc_type_s
|
This is always key_phrases for documents generated by a Phrase Extraction job.
|
id
|
A unique ID for this document.
|
input_collection
|
The collection used for this job’s input.
|
likelihood_d
|
The likelihood that this phrases_s is a phrase, from 0 to infinity.
|
phrase_count
|
The number of occurrences of this phrase in the input collection.
|
phrases_s
|
The phrase detected by the job.
|
review
|
"True" indicates that this may not be a valid phrase and should be reviewed.
|
score
|
|
timestamp
|
The date and time when the document was generated.
|
word_num_i
|
The number of words in this phrase.
|
_version_
|
An internal Solr field used for partial updates.
|
If the attachPhrases
/Extract Key Phrases from Input Text parameter is set to "true", then the job also outputs the original documents from the input collection with an appended field, phrases_extracted_tt
, that lists the extracted phrases from this document.
The way to distinguish the phrases output from the original document output is by the field doc_type_s
, with one of these values:
Use this job when you want to identify statistically significant phrases in your content.
analyzerConfig - stringrequired
The style of text analyzer you would like to use.
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
attachPhrases - boolean
Checking this will cause the job to associate extracted phrases from each source doc. and write them back to the output collection. If input data is signals, it is suggested to turn this option off. Also, currently it is not allowed to check this option while attempting to write to a _query_rewrite_staging collection.
Default: false
dataFormat - stringrequired
Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)
>= 1 characters
Default: solr
enableAutoPublish - boolean
If true, automatically publishes rewrites for rules. Default is false to allow for initial human-aided reviewing
Default: false
fieldToVectorize - stringrequired
Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.
>= 1 characters
id - stringrequired
The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.
<= 63 characters
Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?
minLikelihood - number
Phrases below this threshold will not be written in the output of this job.
minmatch - integer
The number of times a phrase must exist to be considered. NOTE: if input is non signal data, please reduce the number to e.g. 5.
>= 1
exclusiveMinimum: false
Default: 100
ngramSize - integer
The number of words in the ngram you want to consider for the sips.
>= 2
<= 5
exclusiveMinimum: false
exclusiveMaximum: false
Default: 3
outputCollection - string
Solr Collection to store extracted phrases; defaults to the query_rewrite_staging collection for the associated app.
randomSeed - integer
For any deterministic pseudorandom number generation
Default: 8180
readOptions - array[object]
Options used when reading input from Solr or other sources.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
sourceFields - string
Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.
sparkConfig - array[object]
Spark configuration settings.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
sparkPartitions - integer
Spark will re-partition the input to have this number of partitions. Increase for greater parallelism
Default: 200
trainingCollection - stringrequired
Solr Collection containing labeled training data
>= 1 characters
trainingDataFilterQuery - string
Solr query to use when loading training data if using Solr, Spark SQL expression for all other data sources
Default: *:*
trainingDataFrameConfigOptions - object
Additional spark dataframe loading configuration options
trainingDataSamplingFraction - number
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
type - stringrequired
Default: sip
Allowed values: sip
writeOptions - array[object]
Options used when writing output to Solr.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}