SharePoint Optimized V2 Connector Configuration Reference

Table of Contents

Remote connectors
Security trimming
Configuration

The SharePoint Optimized V2 connector retrieves content and metadata from an on-premises SharePoint repository and cloud-based SharePoint repositories.

Verify your connector version

This connector depends on specific Fusion versions. See the following table for the required versions:

Fusion version

Connector version

Fusion 5.6.1 and later

v1.1.0 through v1.6.0

Fusion 5.9.0

v1.6.0 or later

Fusion 5.9.1 and later

v2.0.0 and later

For connector downloads, see Download Connectors. For instructions on installing a connector, see Install a Connector.

Note the following guidelines for using the SharePoint Optimized V2 connector:

There is a pod limit. The SharePoint Optimized V2 connector does not support running multiple instances. Don’t run the connector on more than one pod.
Watch for connector compatibility. Use the LDAP ACLs V2 connector with this connector.

For details on crawls and incremental crawls see How to crawl using the SharePoint Optimized V2 Connector .

Remote connectors

You can configure the SharePoint Optimized V2 Connector (v2.0.0 and later) to running remotely in Fusion versions 5.9.1 and later. Refer to Configure Remote V2 Connectors.

Security trimming

The SharePoint Optimized V2 connector supports security trimming. Refer to Configure Security Trimming for SharePoint Optimized V2.

Configuration

To change the number of items to retrieve per page, set the value of apiQueryRowLimit. The default value is 5000.

To change the number of change events to retrieve per page, set the value of changeApiQueryRowLimit. The default value is 2000.

When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

An Optimized Connector for SharePoint 2010, 2013, 2016, 2019 and SharePoint Online

description - string

Optional description

<= 125 characters

pipeline - stringrequired

Name of the IndexPipeline used for processing output.

>= 1 characters

Match pattern: ^[a-zA-Z0-9_-]+$

diagnosticLogging - boolean

Enable diagnostic logging; disabled by default

Default: false

parserId - stringrequired

The Parser to use in the associated IndexPipeline.

coreProperties - Core Properties

Common behavior and performance settings.

fetchSettings - Fetch Settings

System level settings for controlling fetch behavior and performance.

pluginInstances - number

Maximum number of plugin instances for distributed fetching. Only specified number of plugin instanceswill do fetching. This is useful for distributing load between different instances.

>= 1

<= 1

exclusiveMinimum: false

exclusiveMaximum: false

Default: 1

Multiple of: 1

asyncParsing - boolean

When enabled, content will be indexed asynchronously.

Default: false

numFetchThreads - number

Maximum number of fetch threads; defaults to 20.This setting controls the number of threads that call the Connectors fetch method.Higher values can, but not always, help with overall fetch performance.

>= 1

<= 500

exclusiveMinimum: false

exclusiveMaximum: false

Default: 20

Multiple of: 1

indexingThreads - number

Maximum number of indexing threads; defaults to 4.This setting controls the number of threads in the indexing service used for processing content documents emitted by this datasource.Higher values can sometimes help with overall fetch performance.

>= 1

<= 10

exclusiveMinimum: false

exclusiveMaximum: false

Default: 4

Multiple of: 1

fetchResponseScheduledTimeout - number

The maximum amount of time for a response to be scheduled. The task will be canceled if this setting is exceeded.

>= 1000

<= 500000

exclusiveMinimum: false

exclusiveMaximum: false

Default: 300000

Multiple of: 1

indexingInactivityTimeout - number

The maximum amount of time to wait for indexing results (in seconds). If exceeded, the job will fail with an indexing inactivity timeout.

>= 60

<= 691200

exclusiveMinimum: false

exclusiveMaximum: false

Default: 86400

Multiple of: 1

pluginInactivityTimeout - number

The maximum amount of time to wait for plugin activity (in seconds). If exceeded, the job will fail with a plugin inactivity timeout.

>= 60

<= 691200

exclusiveMinimum: false

exclusiveMaximum: false

Default: 600

Multiple of: 1

indexMetadata - boolean

When enabled the metadata of skipped items will be indexed to the content collection.

Default: false

indexContentFields - boolean

When enabled, content fields will be indexed to the crawl-db collection.

Default: false

id - stringrequired

A unique identifier for this Configuration.

>= 1 characters

Match pattern: ^[a-zA-Z0-9_-]+$

properties - SharePoint properties

Plugin specific properties.

webApplication - Web application config

The SharePoint Web application to crawl.

webApplicationUrl - string

>= 1 characters

fetchSiteCollections - boolean

This feature requires site collection administrator rights on your Sharepoint instance. If enabled, the sharepoint crawler will fetch all site collections from the web application automatically. If not enabled, you must explicitly list all site collections in the siteCollections parameter.

Default: true

forceFullCrawl - boolean

Do this if you want to force a full crawl each time you run this datasource.

Default: false

siteCollections - array[string]

A list of site collections to crawl. Because only site collection administrators or site collection auditors can list the site collections in a SharePoint web application, you can use this when you are crawling as a user that is not an admin/auditor. This allows you to explicitly list site collections you want to crawl. Specify paths relative to the web application url, such as /sites/site1

Default:

includedFileExtensions - array[string]

Set of file extensions to be fetched. If specified, all non-matching files will be skipped.

Default:

excludedFileExtensions - array[string]

A set of all file extensions to be skipped from the fetch.

Default:

inclusiveRegexes - array[string]

Regular expressions for URI patterns to include. This will limit this datasource to only URIs that match the regular expression.

Default:

exclusiveRegexes - array[string]

Regular expressions for URI patterns to exclude. This will limit this datasource to only URIs that do not match the regular expression.

Default:

includeContentsExtensions - array[string]

Only files with these file extensions will not have their contents downloaded when indexing this item. The list item metadata will still be indexed but the file contents will not. The comparison is not case sensitive, and you do not have to specify the '.' but it still work if you do. For example "zip" and ".zip" are both acceptable. The whitespace will also be trimmed.

Default:

excludeContentsExtensions - array[string]

File extensions of files that will not have their contents downloaded when indexing this item. The list item metadata will still be indexed but the file contents will not. The comparison is not case sensitive, and you do not have to specify the '.' but it still work if you do. For example "zip" and ".zip" are both acceptable. The whitespace will also be trimmed.

Default:

restrictToSpecificItems - array[string]

Instead of specifying regular expressions to restrict the SharePoint items that are crawled, this allows you to specify specific SharePoint item URLs of the resources that are to be crawled. The crawl will then be restricted to only include these specified SharePoint items URLs. You can specify list, sub-site, folder, and list item URLs.

Default:

apiQueryRowLimit - number

>= 1

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 5000

Multiple of: 1

changeApiQueryRowLimit - number

>= 1

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 2000

Multiple of: 1

aclCommitAfter - number

When doing solr update to the acl collection, specify the commitWithin parameter to use when updating.

>= -2147483648

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 60000

Multiple of: 1

siteCollectionDeletionThreshold - number

Site collections will be removed from the index after they are no longer available for this many hours. Set to 0 for immediate deletion. Default is 2 weeks.

>= -2147483648

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 336

Multiple of: 1

solrSocketTimeout - number

Socket timeout when performing solr operations.

>= -2147483648

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 60000

Multiple of: 1

moderationStatusFilter - array[number]

If specified, only index items with the following moderation statuses specified. Valid values are: 0 = The list item is approved, 1 = The list item has been denied approval, 2 = The list item is pending approval, 3 = The list item is in the draft or checked out state, 4 = The list item is scheduled for automatic approval at a future date.

fetchTaxonomies - boolean

Fetch Taxonomy data from sharepoint.

Default: false

siteCollectionTaxonomyCacheSize - number

To make the connector faster, when the taxonomy terms for a site collection are needed, they are cached to avoid looking up from disk again. This is the size of that cache.

>= 1

<= 10000

exclusiveMinimum: false

exclusiveMaximum: false

Default: 10

Multiple of: 1

fetchACLs - boolean

Fetch Access Control Data

Default: true

asyncParsing - boolean

Enable only if Tika Async is configured in the Fusion environment. Note: To enable async-parsing, check Core Properties -> Fetch Settings -> Async Parsing (since Fusion 5.8.0)

Default: false

zkHosts - string

Solr zk hosts string used for direct connections to solr.

contentCommitAfter - number

When doing solr update to the content collection, specify the commitWithin parameter to use when updating.

>= -2147483648

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 60000

Multiple of: 1

zkChroot - string

Solr zk chroot string used for direct connections to solr.

solrConnectionTimeout - number

Connection timeout when performing solr operations.

>= -2147483648

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 60000

Multiple of: 1

includedListBaseTypes - array[string]

If specified, the only SharePoint lists that will be fetched are the ones that match one of these base types. Accepts values (not case sensitive): [None, GenericList, DocumentLibrary, Unused, DiscussionBoard, Survey, Issue]

includedObjectTypes - array[string]

If specified, only fetch specific SharePoint objects. SharePoint object types that can be specified (not case sensitive): [Site, List, List_Item, Folder, Attachment]

proxyProperties - Proxy options

A set of options for configuring the proxy.

url - string

The proxy URL

>= 1 characters

username - string

Proxy username

>= 1 characters

password - string

Proxy password

>= 1 characters

ntlmProperties - NTLM Authentication settings

user - string

User

>= 1 characters

password - string

Password

>= 1 characters

domain - string

Domain

>= 1 characters

workstation - string

Workstation

>= 1 characters

sharepointOnlineAuthProperties - SharePoint Online Authentication

Settings relevant only when crawling SharePoint online .

account - string

Your Microsoft SharePoint Online Account name which takes the form of username@domain.com

>= 1 characters

password - string

Password for your Microsoft SharePoint Online Account.

>= 1 characters

sessionExpirationMs - number

How long in milliseconds before new SharePoint online authentication cookies should be fetched.

>= 1

<= 172800000

exclusiveMinimum: false

exclusiveMaximum: false

Default: 7200000

Multiple of: 1

userAgent - string

The user agent header decorates the http traffic. This is important for preventing hard rate limiting by SharePoint Online.

Default: ISV|Lucidworks|Fusion/5.x

capUserAgent - string

When "O365 Conditional Access Policy (CAP) setting" is enabled, we need to use a compliant User-Agent that matches one of the supported devices when doing O365 STS authentication. For example if iOS is a supported platform, set this to 'Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_3 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) CriOS/60.0.3112.89 Mobile/14G60 Safari/602.1'

<= 4000 characters

>= 1 characters

appAuthClientId - string

Applicable to SharePoint Online App-Auth Public/Private Service Account. The Azure client ID of your application.

<= 100 characters

>= 1 characters

appAuthPkcs12KeystoreBase64String - string

Applicable to SharePoint Online App-Auth only. This is the base64 string of your PKCS12 keystore loaded with the PFX certificate file supplied by Azure AD. To get this value, first take the Azure AD yourcert.pfx you recieved from Azure and convert to PKCS12 keystore format (example "keytool -importkeystore -srckeystore yourcert.pfx -srcstoretype pkcs12 -destkeystore yourcert.p12 -deststoretype pkcs12"). Next convert yourcert.p12 to base64 string.

<= 10000 characters

>= 1 characters

appAuthPkcs12KeystorePassword - string

Applicable to SharePoint Online App-Auth Public/Private Service Account. Password of the PKCS12 keystore.

<= 100 characters

>= 1 characters

appAuthClientSecret - string

Applicable to SharePoint Online OAuth App-Auth only. The Azure client ID of your application.

<= 100 characters

>= 1 characters

appAuthRefreshToken - string

Applicable to SharePoint Online OAuth App-Auth only. This is a refresh token which is reusable for up to 12 hours. You must obtain a new tokenusing the OAuth login process if the token becomes expired.

<= 1000 characters

>= 1 characters

appAuthTenant - string

Applicable to SharePoint Online App-Auth only. The Office365 tenant name to use when authenticating with Azure AD.

<= 2083 characters

>= 1 characters

appAuthAzureLoginEndpoint - string

Applicable to SharePoint Online App-Auth Public/Private Service Account. The Azure login endpoint to use when authenticating.

<= 2083 characters

>= 1 characters

Default: https://login.windows.net

jsAuthConfigJson - string

JS Auth config json file contains a list of WebCredential to do a web driver login process.

jsAuthLoginUrl - string

JS Auth Login Url to use when doing the login process.

jsAuthSeleniumUrl - string

URL of the Selenium grid service to use while obtaining performing WebDriver auth to sharepoint online.

maximumItemLimitConfig - Item Count Limit

maxItems - number

Limits the number of items emitted to the configured IndexPipeline. The default is no limit (-1).

>= -2147483648

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: -1

Multiple of: 1

sizeLimitProperties - Item Size Limits

For documents which do not meet the maximum/minimum size limits, only metadata will be indexed without body.The documents will indicate reason why content is not indexed, with the field '_lw_contents_excluded_s: file size'

maxSizeBytes - number

Used for excluding items when the item size is larger than the configured value.

>= -2147483648

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: -1

Multiple of: 1

minSizeBytes - number

Used for excluding items when the item size is smaller than the configured value.

>= -2147483648

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 1

Multiple of: 1

fetchRetryProperties - Retry Options

A set of options for configuring retry behavior.

maxDelayTimeMs - number

The maximum time wait time between successive retries.

>= 1

<= 600000

exclusiveMinimum: false

exclusiveMaximum: false

Default: 300000

Multiple of: 1

maxTimeLimitMs - number

This setting is used to limit the maximum amount of time spent on retries. Note: this will be ignored if "Maximum Retries" is specified.

>= 1

<= 28800000

exclusiveMinimum: false

exclusiveMaximum: false

Default: 600000

Multiple of: 1

errorExclusions - array[string]

Optional regex list that will be matched against failed attempts exception class and message. If any regex matches, do not retry this request. This is needed to prevent the retryer from retrying non-recoverable errors that were not already ignored by the connector implementation.

maxRetries - number

The retryer will retry failed operations in the case that they might succeed if attempted again. This parameter states the number of attempts to retry until giving up. This parameter, if specified, will override the "Stop retrying after time (milliseconds)" parameter.

<= 100

exclusiveMinimum: false

exclusiveMaximum: false

Default: 3

Multiple of: 1

delayFactor - number

The retryer will retry failed operations in the case that they might succeed if attempted again. The retryer will sleep an exponential amount of time after the first failed attempt and retry in exponentially incrementing amounts after each failed attempt up to the maximumTime. nextWaitTime = exponentialIncrement * multiplier.

>= 1

<= 9999

exclusiveMinimum: false

exclusiveMaximum: false

Default: 2

Multiple of: 1

delayMs - number

Sets the delay between retries, exponentially backing off to the maxDelayTimeMs and multiplying successive delays by the delayFactor

>= 1

<= 9223372036854776000

exclusiveMinimum: false

exclusiveMaximum: false

Default: 1000

Multiple of: 1

connections - Http client options

A set of options for configuring the http client.

maxConnections - number

The maximum number of connections

>= 1

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 5000

Multiple of: 1

maxPerRoute - number

Defines a connection limit per one HTTP route. In simple cases you can understand this as a per target host limit. Under the hood things are a bit more interesting: HttpClient maintains a couple of HttpRoute objects, which represent a chain of hosts each, like proxy1 -> proxy2 -> targetHost. Connections are pooled on per-route basis. In simple cases, when you're using default route-building mechanism and provide no proxy suport, your routes are likely to include target host only, so per-route connection pool limit effectively becomes per-host limit.

>= 1

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 1000

Multiple of: 1

ignoreSSLValidationExceptions - boolean

Do not attempt to do an SSL Handshake and do not verify the hostname of SSL certificates. Use this when accessing an https url with a self-signed or enterprise certificate authority that you do not want to put in the Java keystore.

Default: false

readTimeoutMs - number

>= -1

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 60000

Multiple of: 1

connectTimeoutMs - number

>= -1

<= 2147483647

exclusiveMinimum: false

exclusiveMaximum: false

Default: 300000

Multiple of: 1

debug - Debug options

Special properties used for debugging the connector.

onlyFetchAcls - boolean

Do a full crawl where we only crawl acls. Also - when the ACLs are all fully indexed, clear any old ACL documents from previous crawl(s) for this datasource. This gives you a fresh SharePoint ACLs without effecting the content.

Default: false

logThreadDumpEveryNSeconds - number

For diagnostic purposes, write a thread dump to logs every N seconds. If set <= 0, no dump is taken.

>= -1

<= 9999999

exclusiveMinimum: false

exclusiveMaximum: false

Default: -1

Multiple of: 1

simulate429ErrorsEveryNRequests - number

If > 0, simulate a SharePoint 429 status (too-many-requests) error such that there will be one error per this many requests.

>= -1

<= 999999

exclusiveMinimum: false

exclusiveMaximum: false

Default: -1

Multiple of: 1

preserveFullExportDb - boolean

The list* tables are normally cleared prior to saving the crawl database. This gives option to leave these files for analysis. This parameter is ignored if using a persistent volume to store the crawl DB because the data will always be saved in that case.

Default: false

onlyFetchMetadata - boolean

For diagnostic purposes, do a dry run where the connector will only generate the metadata sharepoint export database and index the ACL records in the ACL collection, but will not fetch content.

Default: false

logAclInserts - boolean

For diagnostic purposes, log all documents inserted into the ACL collection.

Default: false

security -

collectionId - string

Id of the collection to be used for storing ACL records. If not specified, ACL collection name will be generated automatically using pattern '<datasource_id>_access_control_hierarchy'.