Collection Analysis Jobs

Use this job when you want to compute basic metrics about your collection, like average word length, phrase percentages, and outlier documents (with very many or very few documents).

Compute basic metrics about a collection and write back to an output collection

id - stringrequired

The ID for this Spark job. Used in the API to reference this job

<= 128 characters

Match pattern: ^[A-Za-z0-9_\-]+$

trainingCollection - stringrequired

Solr Collection containing labeled training data

>= 1 characters

fieldToVectorize - stringrequired

Solr field containing text training data for prediction/clustering instances,if want to analyze multiple fields with different weights please use the format field1:weight1,field2:weight2

>= 1 characters

dataFormat - string

Spark-compatible format which training data comes in (like 'solr', 'hdfs', 'file', 'parquet' etc)

Default: solr

Allowed values: solrhdfsfileparquet

trainingDataFrameConfigOptions - object

Additional spark dataframe loading configuration options

trainingDataFilterQuery - string

Solr query to use when loading training data

>= 3 characters

Default: *:*

trainingDataSamplingFraction - number

Fraction of the training data to use

<= 1

exclusiveMaximum: false

Default: 1

randomSeed - integer

For any deterministic pseudorandom number generation

Default: 1234

outputCollection - string

Solr Collection to store model-labeled data to

sourceFields - string

Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.

numDeviations - integerrequired

Number of standard deviations away from the mean we deem acceptable for this collection.If you want all the documents set this to be high.

exclusiveMinimum: false

dateField - string

The field that corresponds to the date field you will be using

type - stringrequired

Default: collection_analysis

Allowed values: collection_analysis