Configure AEM V2 Connector

Table of Contents

Configure AEM Datasource
- Configuration settings
Field data population

This document explains how to configure an AEM V2 connector to crawl data in Adobe Experience Manager. Refer to the AEM reference to learn more about how this connector works. This connector is compatible with Fusion 5.5.1 and later.

Configure AEM Datasource

Under Indexing > Datasources, click Add, then select AEM
Enter a Configuration ID
Enter the AEM URL (the URL used to access the AEM Admin UI) as well as the AEM username and password used to authenticate access to the QueryBuilder JSON Servlet.
In the CRXDE UI, select a path to crawl. Enter this path into Fusion. Click Add to crawl multiple paths.
Optional: To exclude paths from the crawl, enter a Java Regular Expression (regex) that represents paths to exclude in the indexed content.
Enter the AEM type to crawl. In the CRXDE UI, this is the jcr:primaryType. In this example, the AEM connector is configured to crawl the AEM Type cq:Page, which represents web content pages.
To index assets with a particular file extension, locate a file type in CRXDE and enter the value of the jcr:primaryType into Fusion. In this example, the value of NY_FairHealth.pdf is dam:Asset.
You can choose which content properties to include and exclude from the index. These parameter values are represented by Java regex. For example, to only include properties that start with “jcr” enter jcr:(.*).
In Fusion, click Save when you’re done configuring the AEM datasource.

Configuration settings

Setting Notes

Setting	Notes
AEM URL	Required. This is the URL used to access the AEM Admin UI.
AEM Username	Required. The user should have sufficient permissions to read content paths and access Users/Group APIs in case Security Trimming is needed.
AEM Password	Required.
Page Batch Size	Number of documents to fetch per page request. A higher value can increase crawling speed but also increases memory usage.
Thread wait (ms)	Number of milliseconds to wait between fetch requests. This property can be used to throttle a crawl if necessary.
Paths to search	Required.
Paths that should not be fetched	Java regex for paths that should not be fetched.
AEM Types	Required. AEM document type `jcr:primaryType` to include in the index. Examples: `cq:Page`, `dam:Asset`.
Attachment types	File extensions to index.
Content Property Include Regexes	A list of regex strings of content properties to include in indexed documents. Example: `jcr:.*`.
Content Property Exlude Regexes	A list of regex strings of content properties to exclude from indexed documents. Example: `sling:.*`
Enable Security Trimming	Enable this setting for content filtering of results based on the user’s id passed in during query.
Group Mappings	AEM user groups mapped to indexed values in the security trimming field which are used to filter content based on user id passed in query.
Cache Expire Time (m)	Specifies how long a query is cached in minutes.

AEM URL

Required. This is the URL used to access the AEM Admin UI.

AEM Username

Required. The user should have sufficient permissions to read content paths and access Users/Group APIs in case Security Trimming is needed.

AEM Password

Required.

Page Batch Size

Number of documents to fetch per page request. A higher value can increase crawling speed but also increases memory usage.

Thread wait (ms)

Number of milliseconds to wait between fetch requests. This property can be used to throttle a crawl if necessary.

Paths to search

Required.

Paths that should not be fetched

Java regex for paths that should not be fetched.

AEM Types

Required. AEM document type jcr:primaryType to include in the index. Examples: cq:Page, dam:Asset.

Attachment types

File extensions to index.

Content Property Include Regexes

A list of regex strings of content properties to include in indexed documents. Example: jcr:.*.

Content Property Exlude Regexes

A list of regex strings of content properties to exclude from indexed documents. Example: sling:.*

Enable Security Trimming

Enable this setting for content filtering of results based on the user’s id passed in during query.

Group Mappings

AEM user groups mapped to indexed values in the security trimming field which are used to filter content based on user id passed in query.

Cache Expire Time (m)

Specifies how long a query is cached in minutes.

Field data population

There are multiple sources where AEM data is indexed. The /bin/querybuilder.json endpoint data is mandatory and must exist in order for a document to be indexed.

Note the list of fields that can appear in an indexed document:

Field Source Comments

Field	Source	Comments
`id`	`<AEM_URL>/bin/querybuilder.json`	Field: path
`content_txt`	`<AEM_URL>/bin/querybuilder.json`	Whole data in text format.
`<rest fields>`	`<AEM_URL>/bin/querybuilder.json`	All top level fields of JSON object.
`body_t`	`<AEM_URL>/crx/de/download.jsp`	Used if path ends with one of `Attachment types` OR path does not end with: `/jcr:content`.
`body_t`	`<AEM_URL><id>`	Used if there is no `jcr` data. If response status code is something other than 200, Fusion assumes there is no file to download under that path.
`body_t`	`[content_txt]`	Defaults to `[content_txt]` if `body_t` is empty.
`parentPage`	`Id` of document that contains attachment or link.	Populated in case of attachment/link.
`type`	File extension of the path.	Populated in case of attachment/link.
`file_size`	`<AEM_URL>/bin/querybuilder.json`	`:jcr:data;` used if `jcr` data is not empty.
`file_size`	`<AEM_URL>/bin/querybuilder.json`	`dam:size;` used if `jcr` data is empty.

id

<AEM_URL>/bin/querybuilder.json

Field: path

content_txt

<AEM_URL>/bin/querybuilder.json

Whole data in text format.

<rest fields>

<AEM_URL>/bin/querybuilder.json

All top level fields of JSON object.

body_t

<AEM_URL>/crx/de/download.jsp

Used if path ends with one of Attachment types OR path does not end with: /jcr:content.

body_t

<AEM_URL><id>

Used if there is no jcr data. If response status code is something other than 200, Fusion assumes there is no file to download under that path.

body_t

[content_txt]

Defaults to [content_txt] if body_t is empty.

parentPage

Id of document that contains attachment or link.

Populated in case of attachment/link.

type

File extension of the path.

Populated in case of attachment/link.

file_size

<AEM_URL>/bin/querybuilder.json

:jcr:data; used if jcr data is not empty.

file_size

<AEM_URL>/bin/querybuilder.json

dam:size; used if jcr data is empty.