db - Connector DB
Type and properties for a ConnectorDB implementation to use with this datasource.
type - string
Fully qualified class name of ConnectorDb implementation.
>= 1 characters
Default: com.lucidworks.connectors.db.impl.MapDbConnectorDb
inlinks - boolean
Keep track of incoming links. This negatively impacts performance and size of DB.
Default: false
aliases - boolean
Keep track of original URI-s that resolved to the current URI. This negatively impacts performance and size of DB.
Default: false
inv_aliases - boolean
Keep track of target URI-s that the current URI resolves to. This negatively impacts performance and size of DB.
Default: false
startLinks - array[string]
The IDs of the folders or files to crawl. For example if the URL to your folder is https://drive.google.com/drive/folders/0B1u0p7N096R6MWgma3gwUj4j, then enter 0B1u0p7N096R6MWgma3gwUj4j here. To crawl the entire Google Drive, enter the special value 'root'.
dedupe - boolean
If true, documents will be deduplicated. Deduplication can be done based on an analysis of the content, on the content of a specific field, or by a JavaScript function. If neither a field nor a script are defined, content analysis will be used.
Default: false
dedupeField - string
Field to be used for dedupe. Define either a field or a dedupe script, otherwise the full raw content of each document will be used.
dedupeScript - string
Custom javascript to dedupe documents. The script must define a 'genSignature(content){}' function, but can use any combination of document fields. The function must return a string.
dedupeSaveSignature - boolean
If true,the signature used for dedupe will be stored in a 'dedupeSignature_s' field. Note this may cause errors about 'immense terms' in that field.
Default: false
delete - boolean
Set to true to remove documents from the index when they can no longer be accessed as unique documents.
Default: true
deleteErrorsAfter - integer
Number of fetch failures to tolerate before removing a document from the index. The default of -1 means no fetch failures will be removed.
Default: -1
fetchThreads - integer
The number of threads to use during fetching. The default is 5.
Default: 5
emitThreads - integer
The number of threads used to send documents from the connector to the index pipeline. The default is 5.
Default: 5
chunkSize - integer
The number of items to batch for each round of fetching. A higher value can make crawling faster, but memory usage is also increased. The default is 1.
Default: 1
fetchDelayMS - integer
Number of milliseconds to wait between fetch requests. The default is 0. This property can be used to throttle a crawl if necessary.
Default: 0
refreshAll - boolean
Set to true to always recrawl all items found in the crawldb.
Default: false
refreshStartLinks - boolean
Set to true to recrawl items specified in the list of start links.
Default: false
refreshErrors - boolean
Set to true to recrawl items that failed during the last crawl.
Default: false
refreshOlderThan - integer
Number of seconds to recrawl items whose last fetched date is longer ago than this value.
Default: -1
refreshIDPrefixes - array[string]
A prefix to recrawl all items whose IDs begin with this value.
refreshIDRegexes - array[string]
A regular expression to recrawl all items whose IDs match this pattern.
refreshScript - string
A JavaScript function ('shouldRefresh()') to customize the items recrawled.
forceRefresh - boolean
Set to true to recrawl all items even if they have not changed since the last crawl.
Default: false
forceRefreshClearSignatures - boolean
If true, signatures will be cleared if force recrawl is enabled.
Default: true
retryEmit - boolean
Set to true for emit batch failures to be retried on a document-by-document basis.
Default: true
depth - integer
Number of levels in a directory or site tree to descend for documents.
Default: -1
maxItems - integer
Maximum number of documents to fetch. The default (-1) means no limit.
Default: -1
failFastOnStartLinkFailure - boolean
If true, when Fusion cannot connect to any of the provided start links, the crawl is stopped and an exception logged.
Default: true
crawlDBType - string
The type of crawl database to use, in-memory or on-disk.
Default: on-disk
Allowed values: in-memoryon-disk
commitAfterItems - integer
Commit the crawlDB to disk after this many items have been received. A smaller number here will result in a slower crawl because of commits to disk being more frequent; conversely, a larger number here will cause a resumed job after a crash to need to recrawl more records.
Default: 10000
initial_mapping - Initial field mapping
Provides mapping of fields before documents are sent to an index pipeline.
skip - boolean
Set to true to skip this stage.
Default: false
label - string
A unique label for this stage.
<= 255 characters
condition - string
Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.
reservedFieldsMappingAllowed - boolean
Default: false
mappings - array[object]
List of mapping rules
Default: {"source":"charSet","target":"charSet_s","operation":"move"}{"source":"fetchedDate","target":"fetchedDate_dt","operation":"move"}{"source":"lastModified","target":"lastModified_dt","operation":"move"}{"source":"signature","target":"dedupeSignature_s","operation":"move"}{"source":"length","target":"length_l","operation":"move"}{"source":"mimeType","target":"mimeType_s","operation":"move"}{"source":"parent","target":"parent_s","operation":"move"}{"source":"owner","target":"owner_s","operation":"move"}{"source":"group","target":"group_s","operation":"move"}{"source":"driveMimeType","target":"driveMimeType_s","operation":"move"}
object attributes:{source
required : {
display name: Source Field
type: string
}target
: {
display name: Target Field
type: string
}operation
: {
display name: Operation
type: string
}}
unmapped - Unmapped Fields
If fields do not match any of the field mapping rules, these rules will apply.
source - string
The name of the field to be mapped.
target - string
The name of the field to be mapped to.
operation - string
The type of mapping to perform: move, copy, delete, add, set, or keep.
Default: copy
Allowed values: copymovedeletesetaddkeep
f.maxSizeBytes - integer
Maximum size (in bytes) of documents to fetch or -1 for unlimited file size.
Default: 4194304
f.minSizeBytes - integer
Minimum size, in bytes, of documents to fetch.
Default: 0
f.addFileMetadata - boolean
Set to true to add information about documents found in the filesystem to the document, such as document owner, group, or ACL permissions.
Default: true
f.index_items_discarded - boolean
Enable to index discarded document metadata
Default: false
enable_security_trimming - Enable Security Trimming
f.fs.enableSecurityTrimming - boolean
Enable indexing and query-time security-trimming of google users)
Default: true
f.fs.userSearchQuery - string
Google drive crawl works by first getting a list of users, then crawling them. The User Search Query property lets you customize the query of users to fetch. See this link https://developers.google.com/admin-sdk/directory/v1/guides/search-users#fields for formats. Separate each query with a comma. Example: email:a*,email:b*,email:c*
f.fs.userExcludeList - string
By default, all users' files are crawled. Enter a comma-separated list of usernames (email addresses) to exclude from the crawl. If the last character of the email address is an '*', then all email addresses that start with that prefix will be excluded.
f.fs.defaultDomain - string
For Google Drive security trimming to work, the username must be of form username@domain. During the security trimming query stage, this default domain will be applied to the security trimming user names in the case that they do not have 'username@domain' format.
f.fs.applyGroupSecurityFiltering - boolean
Check this box if you want to query the Google Directory API to fetch a users' Google groups during security trimming stage. These group names, once indexed, can be used to filter search results by Google Group.
Default: true
security_filter_cache - boolean
Cache of document access control rules.
Default: true
cache_expiration_time - integer
Time in seconds before the security filter cache expires.
Default: 7200
retainOutlinks - boolean
Set to true for links found during fetching to be stored in the crawldb. This increases precision in certain recrawl scenarios, but requires more memory and disk space.
Default: false
aliasExpiration - integer
The number of crawls after which an alias will expire. The default is 1 crawl.
Default: 1
f.fs.clientID - string
Google OAuth Client ID for a registered application with access to the Drive API.
f.fs.clientSecret - string
Google OAuth Client Secret for the registered application.
f.fs.refreshToken - string
OAuth Refresh Token to allow re-authorization of the connector to the Drive API.
f.fs.serviceAccountId - string
For Service Account configuration - specifies the Service Account ID to use to connect Fusion to Google Drive.
f.fs.serviceAccountPrivateKey - string
For Service Account configuration - specifies the private key file in JSON format. Open the private key json file (from Google api console) with a text editor, select all the text, copy, then paste into this text box. This private key is treated as a password meaning it is encrypted and is not visible in plain text when editing.
f.fs.serviceAccountEmail - string
For Service Account configuration only - NOTE: Required only when 'Apply Group Security Filtering' is checked. This service account email must be assigned to your Google project as a Service Actor in the console. It must have ability to list groups for a user, list users, and read google drive content.
f.fs.extraFileFieldsToIndex - string
Google Drive by default will only index "id,createdTime,modifiedTime,size,name,description,mimeType,owners,permissions,webContentLink,webViewLink,fileExtension,trashed,parents". You can specify additional fields to index here. Note: You can only specify top level fields. Such as "capabilities". Specifying sub fields like "capabilities(canAddChildren,canRename)" will result in an error.
f.fs.mime_type_includes - string
A comma-separated list of the Mime types to include in this crawl. Includes supercede excludes.
f.fs.mime_type_excludes - string
A comma-separated list of the Mime types to exclude from this crawl. NOTE: This is only used if the "Mime Type Includes" field is empty.
f.fs.additional_item_filters - string
In https://developers.google.com/drive/v3/web/search-parameters#fn4 there are additional search parameters you can add to filter the files returned by google to be indexed. Example: modifiedTime > '2012-06-04T12:00:00'
f.fs.indexTrash - boolean
Set to true to index files in users Trash folders
Default: false
f.fs.connectTimeout - integer
Determines how long, in milliseconds, a request to the Google Drive API is allowed to take to connect prior to timing out
Default: 20000
f.fs.readTimeout - integer
Determines how long, in milliseconds, a request to the Google Drive API is allowed to attempt to read content prior to timing out
Default: 20000
f.fs.batchPageSize - integer
Incremental crawling batch page size
>= 1
<= 100
exclusiveMinimum: false
exclusiveMaximum: false
Default: 100
f.fs.RecrawlCollectionName - string
The collection name for incremental crawling
Default: system_google_drive_recrawl
diagnosticMode - boolean
Enable to print more detailed information to the logs about each request.
Default: false
batch_incremental_crawling - boolean
Batch Incremental crawling
Default: true
parserRetryCount - integer
The maximum number of times the configured parser will try getting content before giving up
<= 5
exclusiveMinimum: false
exclusiveMaximum: true
Default: 0