Aggregation Jobs

Table of Contents

Aggregation properties
Configuration properties

Aggregation jobs compile your raw signals into aggregated signals. Most Fusion jobs that consume signals require aggregated signals.

Default job name

COLLECTION_NAME_click_signals_aggregation

Input

Raw signals (the COLLECTION_NAME_signals collection by default)

Output

Aggregated signals (the COLLECTION_NAME_signals_aggr collection by default)

query

count_i

type

timstamp_tdt

user_id

doc_id

session_id

fusion_query_id

Required signals fields:

[1]

1 Required if you are using response signals.

Aggregation properties

The aggregation process is specified by an aggregation type consisting of the following list of properties:

Name Description

Name	Description
`id`	Aggregation ID
`groupingFields`	List of signal field names
`signalTypes`	List of signal types
`aggregator`	Symbolic name of the aggregator implementation
`selectQuery`	Query string, default `:`
`sort`	Ordering of aggregated signals
`timeRange`	String specifying time range, e.g., `[* TO NOW]`
`outputPipeline`	Pipeline ID for processing aggregated events
`outputCollection`	Output collection name
`rollupPipeline`	Rollup pipeline ID
`rollupAggregator`	Name of the aggregator implementation used for rollups
`sourceRemove`	Boolean, default is false
`sourceCatchup`	Boolean, default is true
`outputRollup`	Boolean, default is true
`aggregates`	List of aggregation functions
`params`	Arbitrary parameters to be used by specific aggregator implementations

id

Aggregation ID

groupingFields

List of signal field names

signalTypes

List of signal types

aggregator

Symbolic name of the aggregator implementation

selectQuery

Query string, default *:*

sort

Ordering of aggregated signals

timeRange

String specifying time range, e.g., [* TO NOW]

outputPipeline

Pipeline ID for processing aggregated events

outputCollection

Output collection name

rollupPipeline

Rollup pipeline ID

rollupAggregator

Name of the aggregator implementation used for rollups

sourceRemove

Boolean, default is false

sourceCatchup

Boolean, default is true

outputRollup

Boolean, default is true

aggregates

List of aggregation functions

params

Arbitrary parameters to be used by specific aggregator implementations

Configuration properties

Use this job when you want to aggregate your data in some way.

id - stringrequired

The ID for this Spark job. Used in the API to reference this job. Allowed characters: a-z, A-Z, dash (-) and underscore (_). Maximum length: 63 characters.

<= 63 characters

Match pattern: [a-zA-Z][_\-a-zA-Z0-9]*[a-zA-Z0-9]?

sparkConfig - array[object]

Spark configuration settings.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

inputCollection - stringrequired

Collection containing signals to be aggregated.

outputCollection - string

The collection to write the aggregates to on output. This property is required if the selected output / rollup pipeline requires it (the default pipeline does). A special value of '-' disables the output.

>= 1 characters

rows - integer

Number of rows to read from the source collection per request.

Default: 10000

sql - stringrequired

Use SQL to perform the aggregation. You do not need to include a time range filter in the WHERE clause as it gets applied automatically before executing the SQL statement.

>= 1 characters

rollupSql - string

Use SQL to perform a rollup of previously aggregated docs. If left blank, the aggregation framework will supply a default SQL query to rollup aggregated metrics.

>= 1 characters

readOptions - array[object]

Additional configuration settings to fine-tune how input records are read for this aggregation.

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

sourceCatchup - boolean

If checked, only aggregate new signals created since the last time the job was successfully run. If there is a record of such previous run then this overrides the starting time of time range set in 'timeRange' property. If unchecked, then all matching signals are aggregated and any previously aggregated docs are deleted to avoid double counting.

Default: true

sourceRemove - boolean

If checked, remove signals from source collection once aggregation job has finished running.

Default: false

aggregationTime - string

Timestamp to use for the aggregation results. Defaults to NOW.

referenceTime - string

Timestamp to use for computing decays and to determine the value of NOW.

skipCheckEnabled - boolean

If the catch-up flag is enabled and this field is checked, the job framework will execute a fast Solr query to determine if this run can be skipped.

Default: true

skipJobIfSignalsEmpty - boolean

Skip Job run if signals collection is empty

parameters - array[object]

Other aggregation parameters (e.g. timestamp field etc..).

object attributes:{key required : {
display name: Parameter Name
type: string
}value : {
display name: Parameter Value
type: string
}}

signalTypes - array[string]

The signal types. If not set then any signal type is selected

selectQuery - string

The query to select the desired input documents.

>= 1 characters

Default: *:*

timeRange - string

The time range to select signals on.

>= 1 characters

useNaturalKey - boolean

Use a natural key provided in the raw signals data for aggregation, rather than relying on Solr UUIDs. Migrated aggregations jobs from Fusion 4 will need this set to false.

Default: true

optimizeSegments - integer

If set to a value above 0, the aggregator job will optimize the resulting Solr collection into this many segments

exclusiveMinimum: false

Default: 0

dataFormat - stringrequired

Spark-compatible format that contains training data (like 'solr', 'parquet', 'orc' etc)

>= 1 characters

Default: solr

sparkSQL - string

Use this field to create a Spark SQL query for filtering your input data. The input data will be registered as spark_input

Default: SELECT * from spark_input

sparkPartitions - integer

Spark will re-partition the input to have this number of partitions. Increase for greater parallelism

Default: 200

type - stringrequired

Default: aggregation

Allowed values: aggregation