Indexing Signals
Signals are indexed just like other data, but instead of using a connector, you use the Signals API.
Signals data flow
This diagram shows the flow of signals data from the search app through Fusion. The numbered steps are explained below.
-
The search app sends a query to a Fusion query pipeline.
The query request should include a user ID and session query parameter to identify the user.
-
Optionally, the Fusion query pipeline queries the
COLLECTION_NAME_signals_aggr
collection to get boosts for the main query based on aggregated click data. -
The search app also sends a request signal to the Fusion
/signals
endpoint.The primary intent of a request signal is to capture the raw user query and contextual information about the user’s current activity in the app, such as the user agent and the page where they generated the query. The request signal does not contain any information about the results sent to Solr; it is created before a query is processed.
-
Once Solr returns the response to Fusion, the SearchLogger component indexes the complete request/response data into the
COLLECTION_NAME_signals
collection as a response signal using the_signals_ingest
pipeline. Therefore, the response signal captures all results from Fusion as it related to the original query.This is a departure from pre-4.0 versions of Fusion where query impressions were logged in a separate COLLECTION_NAME_logs
collection. Query activity is no longer indexed into the_logs
collection. All response signals use thefusion_query_id
(see below) as the unique document ID in Solr. -
When the user clicks a link in the search results, the search app sends a click event to the Fusion signals endpoint (which invokes the
_signals_ingest
pipeline behind the scenes).The click signal must include a field named
fusion_query_id
in theparams
object of the raw click signal. Thefusion_query_id
field is returned in the query response (from step 1) in a response header namedx-fusion-query-id
. This allows Fusion to associate a click signal with the response signal generated in step 4. Thefusion_query_id
is also used by Fusion to associate click signals with experiments. For experiments to work, each click signal must contain the correspondingfusion_query_id
that produced the document/item that was clicked. -
The
_signals_ingest
pipeline enriches signals before indexing into theCOLLECTION_NAME_signals
collection.This enrichment includes field mapping, geolocation resolution, and updating the
has_clicks
flag to "true" on request signals when the first click signal is encountered for a given request using the Update Related Document index stage. -
Fusion queries the
COLLECTION_NAME_signals
collection through a Fusion query pipeline to generate query analytics reports from raw signals. -
Behind the scenes, the SQL aggregation framework aggregates click signals to compute a weight for each query +
doc_id
+ filters group.The resulting metrics are saved to the
COLLECTION_NAME_signals_aggr
collection to generate boosts on queries to the main collection (step 2 above). -
Recommendations also use aggregated documents in the
COLLECTION_NAME_signals_aggr
collection to build a collaborative filtering-based recommender model.
Default index pipeline for signals
When indexing signals of any type for any Fusion app, Fusion always uses a default index pipeline named _signals_ingest
unless you explicitly specify a different index pipeline.
Because this pipeline is not associated with any Fusion app, it does not automatically appear in the list of index pipelines. You can find it in the Object Explorer by clicking the In No Apps filter. |
Default stages
The _signals_ingest
index pipeline has several stages:
-
Update
has_clicks
flag stageThe Update
has_clicks
flag stage is an instance of the Update Related Document stage that updates thehas_clicks
flag to "true" on an existing request signal after the first click signal is processed for the request.
The Update has_clicks
flag stage works as follows:
-
When a click signal is encountered (
type==click
) -
Look at the incoming click signal for a field named
request_id_s
, which gets set by the Format Signals stage using a distributed cache of recently processed request signals.If the
request_id_s
field is set, then send a real-timeGET
query to Solr to find a request signal with ID equal to the value of therequest_id_s
field on the click signal. To avoid re-updating request signals, the RTG query also filters onhas_clicks==false
, which avoids duplicate atomic updates on the same document in Solr. Real-timeGET
is used to avoid timing issues between a request signal being sent to Solr and when it gets committed. This prevents missing updates when clicks occur soon after the initial request signal is sent by the search app. -
If the click signal does not have the
request_id_s
field set, then do a normal Solr lookup for the request signal using:+query_id:"${query_id}" +type:request +has_clicks:false
. A click signal may not have arequest_id_s
if there is a cache miss in the distributed cache used by the Format Signals stage. -
If the stage performs a normal query, there may be multiple request signals that have the same
query_id
. This is because thequery_id
is based onsession
+query
+filter
, so if a user sends the samequery
+filter
during the same session, there will be multiple request signals with the samequery_id
value. Thus, the stage sorts to get the latest request signal to update. -
If a related document is found (in this case a request signal), then the stage updates the
has_clicks
field to true and performs an atomic update in Solr.
This stage performs its work in a background thread, so it does not impact the indexing performance of the click signal.