SharePoint and SharePoint Online Connectors
The SharePoint connector retrieves content and metadata from an on-premises SharePoint repository.
Platform versions
V1 connectors
The SharePoint V1 connectors were deprecated in Fusion 5.2. However, the:
-
SharePoint V1 connector can be used in Fusion 4.x and Fusion 5.1 - Fusion 5.4.
-
SharePoint Online V1 connector can be used in Fusion 4.x and Fusion 5.1 - Fusion 5.3.
V2 connectors
The Sharepoint V2 connector can be used in Fusion 5.1 - Fusion 5.5. This connector is deprecated as of June 19, 2023. The scheduled date to remove the connector is January 31, 2024.
Key differences between V1 and V1 Optimized
CSOM REST API
The V1 platform version uses SOAP API. This API style was deprecated as of SharePoint 2013.
The V1 Optimized platform version uses CSOM REST API. This API style provides a variety of benefits not found with SOAP API:
-
CSOM REST API supports bulk operations for faster crawl operations.
-
CSOM REST API uses traffic decorating and is therefore less susceptible to throttling.
-
CSOM REST API is considerably more efficient, resulting in less data being transferred during crawl operations.
Active Directory Connector for ACLs dependency
The V1 platform version has a key limitation in regard to LDAP/ActiveDirectory access. In order to look up user group memberships, each SharePoint datasource was required to perform LDAP queries. If multiple SharePoint datasources utilized a single LDAP/ActiveDirectory backend, however, multiple LDAP lookup operations took place unnecessarily, and the user would suffer from excessive LDAP overhead.
In Fusion 4.2.4, the Active Directory (AD) Connector for ACLs was introduced.
The SharePoint V1 Optimized connector works in tangent with the AD Connector for ACLs to create a sidecar collection which is used in graph security trimming queries. As a result, all LDAP/ActiveDirectory operations are fully dependent on the AD Connector for ACLs.
If you are using SharePoint Online, and it is not backed by Azure Active Directory or Active Directory Federation Services (ADFS), the V1 Optimized connector does not depend on the AD Connector for ACLs. |
Changes API
The V1 platform version does not use the SharePoint Changes API. As a result, the recrawl process required all items to be revisited in order. For large SharePoint collections, incremental crawls took an excessive amount of time.
The V1 Optimized platform version is able to take advantage of the Changes API to perform incremental crawls. The Changes API tracks all additions, updates, and deletions since the previous crawl operation for a collection.
This improved crawl operation process significantly improves incremental crawl speed.
Graph security trimming
The security trimming approach used by the V1 platform version had notable drawbacks:
-
LDAP/ActiveDirectory information is stored in an inefficient manner. When a document is fetched for indexing, it returns the users and groups with permission to view the document. However, SharePoint does not explicitly list these users and groups. The security trimming approach requires that all nested LDAP/ActiveDirectory groups be fetched and added to the document ACLs.
As a result, if the nested LDAP/ActiveDirectory group relationships change, the content is sometimes required to be reindexed despite not changing in SharePoint. This can lead to massive reindexing operations.
-
Each SharePoint datasource requires a separate Solr filter. With the V1 platform version, SharePoint datasources are unable to share the same security filter, even if they are pointing to the same SharePoint farm. This restriction can be Severely inefficient.
In a use case with five SharePoint datasources, for example, five Solr filter queries (fqs) would be required. The more fqs you have, the more work is required from Solr while performing queries, resulting in slower queries. This inefficiency scales with the number of SharePoint datasources, and it is not uncommon to have 30-50 datasources in an application.
-
SharePoint security filters cannot be shared with other connectors. For example, if a SharePoint datasource and an SMB2 datasource are backed by the same ActiveDirectory, you are still required to have an individual security filter for both datasources. Again, this inefficiency scales with the number of datasources you have.
Unlike the V1 platform version, the V1 Optimized platform version uses a Solr graphy query approach. Advantages include:
-
LDAP/ActiveDirectory information is not stored in nested groups on the content document ACL fields.
-
ACLs in SharePoint content documents are stored in a field. Each SharePoint document that you crawl contains ACLs. As the document is indexed by Fusion, a field is populated with any role assignments attached to the document to ensure only users with appropriate permissions can view it. For example when doing a security trimmed query, you can input the username that is performing the search, and a Solr fq is formed with the values that match the ACL field on each document. The documents that are returned are restricted to what the user is permitted to view.
-
A single filter can perform a security trimming query against datasources backed by the same ActiveDirectory instance. This is not restricted to the SharePoint V1 Optimized connector. Other connectors, such as the SMB2 connector, can use the same filter.
-
Group membership lookups (LDAP queries) are separated from the SharePoint connector. Now, the AD Connector for ACLs is used to create a separate ACL Solr sidecar collection. First, a Solr graph query is performed to obtain a user’s groups and nested groups from the sidecar collection. Then, a join query is used to match the ACL fields on the content documents.
This process is performed behind-the-scenes. The V1 Optimized connector uses the security trimming stage like all other connectors.
Multiple crawl phases
The V1 platform version does not support multiple crawl phases.
The V1 Optimized platform version performs crawl operations in two phases:
-
Pre-fetch phase - This phase:
-
Utilizes the CSOM REST API to fetch all relevant metadata in large batches. This creates a pre-fetch database, which is exported for use by the post-fetch phase.
-
Does not download the file content of list items. It only fetches the metadata.
-
Is saved in
$FUSION_HOME/var/log/connectors/connectors-classic/connectors-classic.log
and$FUSION_HOME/var/log/connectors/connectors-classic/sharepoint-exporter-DSID.log
whereDSID
is the SharePoint optimized datasource ID.The counters in the data source job status window only increase when the content documents begin to index.
-
-
Post-fetch phase - After the pre-fetch phase has completed, the crawl operation is ready to index documents during the post-fetch phase. The crawl will iterate through all items identified in the pre-fetch phase and index them into the pipeline. If there is file content associated with a pre-fetch list item, that content will be downloaded and parsed using the Fusion parser.
See View the SharePoint Export Database File for information about viewing the database file created when the SharePoint V1 Optimized connector executes a crawl. |
SharePoint (on-premises)
This connector can access a SharePoint repository running on the following platforms:
-
Microsoft SharePoint 2013
-
Microsoft SharePoint 2016
-
Microsoft SharePoint 2019
Understanding incremental crawls
After you have performed your first successful crawl (it successfully completed with no errors), all subsequent crawls are "incremental crawls".
Incremental crawls use SharePoint’s Changes API. For each site collection, this uses the change token (timestamp) to get all additions, updates, and deletions since the full crawl was started.
If the Limit Documents > Fetch all site collections checkbox selected, you are crawling an entire SharePoint Web application, and a site collection was deleted since the last crawl, then the incremental crawl removes it from your index.
If you are filtering on fields, be sure to leave the lw fields in place. These fields are required for successful incremental crawling.
|
Throttling or rate limiting
SharePoint Online is a cloud API. As such, it necessarily has rate limiting policies, which can be an issue during crawling.
Ideally, you want to have a SharePoint Online crawl that runs as fast as possible. But practically, this is not always possible. The SharePoint Online documentation has some important information about this.
User permission configuration options
The SharePoint connectors provide a variety of configuration options for accessing SharePoint and SharePoint Online. Permissions settings should follow the principle of least privilege, as described in the Microsoft SharePoint docs:
Follow the principle of least-privileged: Users should have only the permission levels or individual permissions they must have to perform their assigned tasks.
SharePoint
Account type | Account config | Description |
---|---|---|
Active Directory Service Account |
Account is set up as a Site Collection Auditor |
Allows you to list all site collections. |
Active Directory Service Account |
Account is set up with limited permissions |
Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually. Additionally, noindex tags are ignored. Sites will always be indexed regardless of their noindex settings. |
See Configure A SharePoint V1 Optimized Datasource for configuration instructions.
SharePoint Online
Account type | Account config | Description |
---|---|---|
Full Admin |
Azure App Only |
Allows you to list all site collections in tenant. |
Full Admin |
OAuth App Only |
Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually. |
ADFS Account |
Account is set up as a Site Collection Auditor |
Allows you to list all site collections if the user is a tenant administrator. |
ADFS Account |
Account is set up with limited permissions |
Does not allow you to list site collections in your SharePoint web application. You must list each site collection you want to crawl manually. Use this option if your deployment requires the Lucidworks crawl account to have the fewest privileges possible. |