Outlier Detection Jobs
Use this job when you want to find outliers from a set of documents and attach labels for each outlier group.
Legacy Product
Use this job when you want to find outliers from a set of documents and attach labels for each outlier group.
Use this job when you want to find outliers from a set of documents and attach labels for each outlier group.
The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.
<= 63 characters
Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?
Spark configuration settings.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
Solr Collection containing documents to be clustered
>= 1 characters
Solr field containing text training data. Data from multiple fields with different weights can be combined by specifying them as field1:weight1,field2:weight2 etc.
>= 1 characters
Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)
>= 1 characters
Default: solr
Additional spark dataframe loading configuration options
Solr query to use when loading training data if using Solr
Default: *:*
Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input
Default: SELECT * from spark_input
Fraction of the training data to use
<= 1
exclusiveMaximum: false
Default: 1
For any deterministic pseudorandom number generation
Default: 1234
Solr Collection to store model-labeled data to
>= 1 characters
Spark-compatible output format (like 'solr', 'parquet', etc)
>= 1 characters
Default: solr
Solr fields to load (comma-delimited). Leave empty to allow the job to select the required fields to load at runtime.
If writing to non-Solr sources, this field will accept a comma-delimited list of column names for partitioning the dataframe before writing to the external output
Options used when writing output to Solr or other sources
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
Options used when reading input from Solr or other sources.
object attributes:{key
required : {
display name: Parameter Name
type: string
}value
: {
display name: Parameter Value
type: string
}}
Identifier for the model to be trained; uses the supplied Spark Job ID if not provided.
>= 1 characters
Output field name for unique outlier group id.
Default: outlier_group_id
Output field name for top frequent terms that are (mostly) unique for each outlier group as computed based on TF-IDF and group Id.
Default: outlier_group_label
If true, only outliers are saved in the output collection, otherwise, the whole dataset is saved.
Default: false
Field containing the unique ID for each document.
>= 1 characters
Default: id
LuceneTextAnalyzer schema for tokenization (JSON-encoded)
>= 1 characters
Default: { "analyzers": [{ "name": "StdTokLowerStop","charFilters": [ { "type": "htmlstrip" } ],"tokenizer": { "type": "standard" },"filters": [{ "type": "lowercase" },{ "type": "KStem" },{ "type": "length", "min": "2", "max": "32767" },{ "type": "fusionstop", "ignoreCase": "true", "format": "snowball", "words": "org/apache/lucene/analysis/snowball/english_stop.txt" }] }],"fields": [{ "regex": ".+", "analyzer": "StdTokLowerStop" } ]}
Output field name for top frequent terms in each cluster. These may overlap with other clusters.
Default: freq_terms
Output field name for doc distance to its corresponding cluster center (measure how representative the doc is).
Default: dist_to_center
p-norm to normalize vectors with (choose -1 to turn normalization off)
Default: 2
Allowed values: -1012
Min number of documents the term has to show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 5
Max number of documents the term can show up. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 0.75
Number of Keywords needed for labeling each cluster.
Default: 5
Number of clusters to help find outliers.
Default: 10
Identify as outlier group if less than this percent of total documents. value<1.0 denotes a percentage, value=1.0 denotes 100%, value>1.0 denotes the exact number.
Default: 0.01
Default: outlier_detection
Allowed values: outlier_detection