Phrase Extraction Jobs

Table of Contents

Minimum configuration
Output documents
Output fields

Identify multi-word phrases in signals.

The Phrase Extraction job is designed for use with small amounts of text only. Large documents can cause the job to break at scale.

This job writes to the _query_rewrite_staging collection. It also uses reviewed documents from that collection to improve the accuracy of the job.

To use this job, you must first upload the OpenNLP Maxent model to the blob store.

Minimum configuration

For most use cases, the minimum configuration for this job consists of these fields:

id/Spark Job ID

Give this job an arbitrary ID string.
trainingCollection/Training Collection

Specify the input collection.
fieldToVectorize/Field to Vectorize

Specify the field in the input collection where phrases can be found.
outputCollection/Output Collection

Specify the collection in which the output documents should be indexed.

When running this job over a content document collection, be sure to set attachPhrases/Extract Key Phrases from Input Text to "true". The default is "false", which works well when running the job over a signals collection.

Output documents

By default, the job only outputs the phrases found from the original document. In each row of the phrases output, these fields are most useful:

The phrase itself is in the phrases_s field, which can be used for faceting.
The likelihood_d field gives the likelihood that the phrase is legitimate, from 0 to infinity.

Low-probability phrases are automatically trimmed from the results.
When a phrase’s likelihood value is ambiguous, the review field is set to "true" to indicate that the phrase should be reviewed.
A phrase_count field indicates the number of instances of the phrase in the input collection.

The complete list of output fields is shown below.

Output fields

aggr_id_s

The name of the Phrase Extraction job that generated this document.

doc_type_s

This is always key_phrases for documents generated by a Phrase Extraction job.

id

A unique ID for this document.

input_collection

The collection used for this job’s input.

likelihood_d

The likelihood that this phrases_s is a phrase, from 0 to infinity.

phrase_count

The number of occurrences of this phrase in the input collection.

phrases_s

The phrase detected by the job.

review

"True" indicates that this may not be a valid phrase and should be reviewed.

score

This is always "1".

timestamp

The date and time when the document was generated.

word_num_i

The number of words in this phrase.

_version_

An internal Solr field used for partial updates.

If the attachPhrases/Extract Key Phrases from Input Text parameter is set to "true", then the job also outputs the original documents from the input collection with an appended field, phrases_extracted_tt, that lists the extracted phrases from this document.

The way to distinguish the phrases output from the original document output is by the field doc_type_s, with one of these values:

key_phrases denotes phrases output.
original_doc_with_phrases denotes the original documents.

Phrase Extraction Jobs

Minimum configuration

Output documents

Output fields

id - stringrequired

trainingCollection - stringrequired

fieldToVectorize - stringrequired

dataFormat - string

trainingDataFrameConfigOptions - object

trainingDataFilterQuery - string

trainingDataSamplingFraction - number

randomSeed - integer

outputCollection - stringrequired

sourceFields - string

ngramSize - integer

minmatch - integer

analyzerConfig - stringrequired

attachPhrases - boolean

type - stringrequired