aliasExpiration - integer
The number of crawls after which an alias will expire. The default is 1 crawl.
Default: 1
chunkSize - integer
The number of items to batch for each round of fetching. A higher value can make crawling faster, but memory usage is also increased. The default is 1.
Default: 1
crawlDBType - string
The type of crawl database to use, in-memory or on-disk.
Default: in-memory
Allowed values: in-memoryon-disk
db - Connector DB
Type and properties for a ConnectorDB implementation to use with this datasource.
aliases - boolean
Keep track of original URI-s that resolved to the current URI. This negatively impacts performance and size of DB.
Default: false
inlinks - boolean
Keep track of incoming links. This negatively impacts performance and size of DB.
Default: false
inv_aliases - boolean
Keep track of target URI-s that the current URI resolves to. This negatively impacts performance and size of DB.
Default: false
type - string
Fully qualified class name of ConnectorDb implementation.
>= 1 characters
Default: com.lucidworks.connectors.db.impl.MapDbConnectorDb
dedupe - boolean
If true, documents will be deduplicated. Deduplication can be done based on an analysis of the content, on the content of a specific field, or by a JavaScript function. If neither a field nor a script are defined, content analysis will be used.
Default: false
dedupeField - string
Field to be used for dedupe. Define either a field or a dedupe script, otherwise the full raw content of each document will be used.
dedupeSaveSignature - boolean
If true,the signature used for dedupe will be stored in a 'dedupeSignature_s' field. Note this may cause errors about 'immense terms' in that field.
Default: false
dedupeScript - string
Custom javascript to dedupe documents. The script must define a 'genSignature(content){}' function, but can use any combination of document fields. The function must return a string.
delete - boolean
Set to true to remove documents from the index when they can no longer be accessed as unique documents.
Default: true
deleteErrorsAfter - integer
Number of fetch failures to tolerate before removing a document from the index. The default of -1 means no fetch failures will be removed.
Default: -1
diagnosticMode - boolean
Enable to print more detailed information to the logs about each request.
Default: false
emitThreads - integer
The number of threads used to send documents from the connector to the index pipeline. The default is 5.
Default: 5
enable_security_trimming - Enable Security Trimming
f.cacheUserGroupLimit - integer
Only applicable when cacheUserGroups is enabled, this will limit the number of users who will have their groups cached. This is used for testing purposes only. The default of -1 will cause all users to be cached.
>= -1
exclusiveMinimum: false
Default: -1
f.cacheUserGroups - boolean
If true, user groups will be cached so that they confluence api is not called at query time.
Default: false
f.enableSecurityTrimming - boolean
Enable security trimming of Confluence searches.
Default: true
f.indexGroupPermissions - boolean
Enable indexing of user groups that have permission to view Confluence content.
Default: true
f.indexUserPermissions - boolean
Enable indexing of users who have permission to view Confluence content.
Default: true
f.userGroupCacheCollectionName - string
The name of the solr collection that will store this datasources' user group cache. This user group cache collection can be shared with other datasources. There is a `ds_id_s` field that is used to query user/groups separately.
Default: confluence_usr_grp
excludeExtensions - array[string]
File extensions that should not to be fetched. This will limit this datasource to all extensions except this list.
excludeRegexes - array[string]
Regular expressions for URI patterns to exclude. This will limit this datasource to only URIs that do not match the regular expression.
f.attachmentMaxSizeBytes - integer
Maximum size, in bytes, of an attachment to fetch.
Default: 4194304
f.commentFormat - string
Index comments as JSON in 'comments_ss' or as separate documents?
Default: Separate doc
Allowed values: Embedded JSONSeparate doc
f.confluenceAuthType - string
Authentication method to use. Basic is the only allowed method for connecting to Confluence hosted by Atlassian
Default: basic
Allowed values: basicrequest
f.confluenceCtxPath - string
Context path under which Confluence instance is deployed.
Default: /
f.confluenceHost - string
Hostname of the Confluence server to crawl.
f.confluencePassword - string
Password for the Confluence user.
f.confluencePort - integer
Port for the Confluence server.
Default: 443
f.confluenceUsername - string
Name of a Confluence user who has admin permissions.
f.crawlAttachments - boolean
Enable indexing of attachments.
Default: true
f.crawlBlogPosts - boolean
Enable indexing of Confluence blog posts.
Default: true
f.crawlComments - boolean
Enable indexing of comments.
Default: true
f.crawlPages - boolean
Enable indexing of Confluence pages.
Default: true
f.crawlPersonalSpaces - boolean
Enable indexing of personal spaces of Confluence users.
Default: true
f.excludedSpaces - array[string]
Confluence Spaces that should be skipped during the crawl.
f.includeArchivedSpaces - boolean
If true, archived spaces would be included.
Default: true
f.includePrivateContent - boolean
If true, all the private content would be included.
Default: true
f.includedSpaces - array[string]
Confluence Spaces that should be crawled.
f.indexNonCurrentContent - boolean
Enable indexing of non-current (older) versions of Confluence content.
Default: false
f.indexSpacesAsDocs - boolean
Create a separate document for each Confluence Space indexed.
Default: false
f.sessionTTL - integer
Time in milliseconds until HTTP session is considered expired and re-login is performed.
Default: 150000
f.timeout - integer
Time in milliseconds to wait for a server response.
Default: 10000
f.useHttps - boolean
Enable to use SSL when connecting to the Confluence server.
Default: true
f.verify_access - boolean
Try to connect to Confluence server with current properties before saving changes to datasource.
Default: true
failFastOnStartLinkFailure - boolean
If true, when Fusion cannot connect to any of the provided start links, the crawl is stopped and an exception logged.
Default: true
fetchDelayMS - integer
Number of milliseconds to wait between fetch requests. The default is 100. This property can be used to throttle a crawl if necessary.
Default: 100
fetchThreads - integer
The number of threads to use during fetching. The default is 5.
Default: 5
forceRefresh - boolean
Set to true to recrawl all items even if they have not changed since the last crawl.
Default: false
includeExtensions - array[string]
File extensions to be fetched. This will limit this datasource to only these file extensions.
includeRegexes - array[string]
Regular expressions for URI patterns to include. This will limit this datasource to only URIs that match the regular expression.
indexCrawlDBToSolr - boolean
EXPERIMENTAL: Set to true to index the crawl-database into a 'crawldb_<datasource-ID>' collection in Solr.
Default: false
initial_mapping - Initial field mapping
Provides mapping of fields before documents are sent to an index pipeline.
condition - string
Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.
label - string
A unique label for this stage.
<= 255 characters
mappings - array[object]
List of mapping rules
Default: {"operation":"move","source":"charSet","target":"charSet_s"}{"operation":"move","source":"fetchedDate","target":"fetchedDate_dt"}{"operation":"move","source":"lastModified","target":"lastModified_dt"}{"operation":"move","source":"signature","target":"dedupeSignature_s"}{"operation":"move","source":"contentSignature","target":"signature_s"}{"operation":"move","source":"length","target":"length_l"}{"operation":"move","source":"mimeType","target":"mimeType_s"}{"operation":"move","source":"parent","target":"parent_s"}{"operation":"move","source":"owner","target":"owner_s"}{"operation":"move","source":"group","target":"group_s"}
object attributes:{operation
: {
display name: Operation
type: string
}source
required : {
display name: Source Field
type: string
}target
: {
display name: Target Field
type: string
}}
reservedFieldsMappingAllowed - boolean
Default: false
skip - boolean
Set to true to skip this stage.
Default: false
unmapped - Unmapped Fields
If fields do not match any of the field mapping rules, these rules will apply.
operation - string
The type of mapping to perform: move, copy, delete, add, set, or keep.
Default: copy
Allowed values: copymovedeletesetaddkeep
source - string
The name of the field to be mapped.
target - string
The name of the field to be mapped to.
reevaluateCrawlDbOnStart - boolean
Reevaluate existing crawldb entries for legality on startup?
Default: false
refreshAll - boolean
Set to true to always recrawl all items found in the crawldb.
Default: true
refreshErrors - boolean
Set to true to recrawl items that failed during the last crawl.
Default: false
refreshIDPrefixes - array[string]
A prefix to recrawl all items whose IDs begin with this value.
refreshIDRegexes - array[string]
A regular expression to recrawl all items whose IDs match this pattern.
refreshOlderThan - integer
Number of seconds to recrawl items whose last fetched date is longer ago than this value.
Default: -1
refreshScript - string
A JavaScript function ('shouldRefresh()') to customize the items recrawled.
refreshStartLinks - boolean
Set to true to recrawl items specified in the list of start links.
Default: false
retainOutlinks - boolean
Set to true for links found during fetching to be stored in the crawldb. This increases precision in certain recrawl scenarios, but requires more memory and disk space.
Default: false
retryEmit - boolean
Set to true for emit batch failures to be retried on a document-by-document basis.
Default: true
rewriteLinkScript - string
A Javascript function 'rewriteLink(link) { }' to modify links to documents before they are fetched.