FTP V1 Connector Configuration Reference
Retrieve documents using the File Transfer Protocol (FTP).
|
V1 deprecation and removal notice
Starting in Fusion 5.12.0, all V1 connectors are deprecated. This means they are no longer being actively developed and will be removed in Fusion 5.13.0.
The replacement for this connector is in active development at this time and will be released at a future date.
If you are using this connector, you must migrate to the replacement connector or a supported alternative before upgrading to Fusion 5.13.0. We recommend migrating to the replacement connector as soon as possible to avoid any disruption to your workflows.
|
The configuration property "url" specifies the protocol (ftp
), the host address, and the path to crawl. By default, all files linked to from this URL will be processed. There are several configuration properties available to limit the crawl.
|
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.
|
A crawler for FTP sites using either simple FTP or SFTP.
id - stringrequired
Unique name for this datasource.
>= 1 characters
Match pattern: ^[a-zA-Z0-9_-]+$
pipeline - stringrequired
Name of an existing index pipeline for processing documents.
>= 1 characters
description - string
Optional description for this datasource.
parserId - string
Parser used when parsing raw content. For some connectors, a configuration to 'retry' parsing if an error occurs is available as an advanced setting
properties - Properties
Datasource configuration properties
db - Connector DB
Type and properties for a ConnectorDB implementation to use with this datasource.
type - string
Fully qualified class name of ConnectorDb implementation.
>= 1 characters
Default: com.lucidworks.connectors.db.impl.MapDbConnectorDb
inlinks - boolean
Keep track of incoming links. This negatively impacts performance and size of DB.
Default: false
aliases - boolean
Keep track of original URI-s that resolved to the current URI. This negatively impacts performance and size of DB.
Default: false
inv_aliases - boolean
Keep track of target URI-s that the current URI resolves to. This negatively impacts performance and size of DB.
Default: false
url - string
The link to the FTP site that is being crawled. It can point to a subdirectory, or the root of the site to crawl the entire site, e.g., ftp://ftp.example.com:21/some_folder/
>= 1 characters
Match pattern: .*:.*
max_docs - integer
Maximum number of documents to fetch. The default (-1) means no limit.
>= -1
exclusiveMinimum: false
Default: -1
max_bytes - integer
Maximum size (in bytes) of documents to fetch or -1 for unlimited file size.
>= -1
exclusiveMinimum: false
Default: 10485760
index_directories - boolean
Set to true to add directories to the index as documents. If set to false, directories will not be added to the index, but they will still be traversed for documents.
Default: false
max_threads - integer
The maximum number of threads to use for fetching data. Note: Each thread will create a new connection to the repository, which may make overall throughput faster, but this also requires more system resources, including CPU and memory.
Default: 1
add_failed_docs - boolean
Set to true to add documents even if they partially fail processing. Failed documents will be added with as much metadata as available, but may not include all expected fields.
Default: false
crawl_item_timeout - integer
Time in milliseconds to fetch any individual document.
exclusiveMinimum: true
Default: 600000
maximum_connections - integer
Maximum number of concurrent connections to the filesystem. A large number of documents could cause a large number of simultaneous connections to the repository and lead to errors or degraded performance. In some cases, reducing this number may help performance issues.
Default: 1000
initial_mapping - Initial field mapping
Provides mapping of fields before documents are sent to an index pipeline.
skip - boolean
Set to true to skip this stage.
Default: false
label - string
A unique label for this stage.
<= 255 characters
condition - string
Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.
reservedFieldsMappingAllowed - boolean
Default: false
mappings - array[object]
List of mapping rules
Default: {"source":"fetch_time","target":"fetch_time_dt","operation":"move"}{"source":"ds:description","target":"description","operation":"move"}
object attributes:{source
required : {
display name: Source Field
type: string
}target
: {
display name: Target Field
type: string
}operation
: {
display name: Operation
type: string
}}
unmapped - Unmapped Fields
If fields do not match any of the field mapping rules, these rules will apply.
source - string
The name of the field to be mapped.
target - string
The name of the field to be mapped to.
operation - string
The type of mapping to perform: move, copy, delete, add, set, or keep.
Default: copy
Allowed values: copymovedeletesetaddkeep
username - string
Username with permissions to access the repository, if necessary.
password - string
Password for the user.
crawl_depth - integer
Number of levels in a directory or site tree to descend for documents.
>= -1
exclusiveMinimum: false
Default: -1
bounds - string
Limits the crawl to a specific directory sub-tree, hostname or domain.
Default: tree
Allowed values: treehostdomainnone
include_paths - array[string]
Regular expressions for URI patterns to include. This will limit this datasource to only URIs that match the regular expression.
exclude_paths - array[string]
Regular expressions for URI patterns to exclude. This will limit this datasource to only URIs that do not match the regular expression.
include_extensions - array[string]
List the file extensions to be fetched. Note: Files with possible matching MIME types but non-matching file extensions will be skipped. Extensions should be listed without periods, using whitespace to separate items (e.g., 'pdf zip').
commit_on_finish - boolean
Set to true for a request to be sent to Solr after the last batch has been fetched to commit the documents to the index.
Default: true
verify_access - boolean
Set to true to require successful connection to the filesystem before saving this datasource.
Default: true