Legacy Product

Fusion 5.10
    Fusion 5.10

    AWS S3 V1 Connector Configuration Reference

    Table of Contents

    The AWS S3 V1 Connector can access AWS S3 buckets in native format.

    Deprecation and removal notice

    This connector is deprecated as of Fusion 5.2 and is removed or expected to be removed as of Fusion 5.5. Use the AWS S3 V2 connector instead.

    For more information about deprecations and removals, including possible alternatives, see Deprecations and Removals.

    The connector uses the S3 API to request data from S3. It calls the listBucket service, which lists all buckets owned by the user account supplied to the connector.

    When creating an S3 data source using the UI, Fusion automatically verifies that the user information supplied has access to the bucket defined in the URL property. If the bucket is not in the list returned by S3, data source creation may fail. At crawl time, if the bucket is not in the list returned by S3, the crawl fails.

    Permission errors when trying to create or crawl the data source may be caused by incorrect username or password, or they may be due to user account permissions. The user account must have List Bucket permissions for the account which owns the bucket that the crawler is trying to access.

    Configuration

    When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

    Connector to index content in AWS S3 buckets.

    id - stringrequired

    Unique name for this datasource.

    >= 1 characters

    Match pattern: ^[a-zA-Z0-9_-]+$

    pipeline - stringrequired

    Name of an existing index pipeline for processing documents.

    >= 1 characters

    description - string

    Optional description for this datasource.

    parserId - string

    Parser used when parsing raw content. For some connectors, a configuration to 'retry' parsing if an error occurs is available as an advanced setting

    properties - Properties

    Datasource configuration properties

    db - Connector DB

    Type and properties for a ConnectorDB implementation to use with this datasource.

    type - string

    Fully qualified class name of ConnectorDb implementation.

    >= 1 characters

    Default: com.lucidworks.connectors.db.impl.MapDbConnectorDb

    inlinks - boolean

    Keep track of incoming links. This negatively impacts performance and size of DB.

    Default: false

    aliases - boolean

    Keep track of original URI-s that resolved to the current URI. This negatively impacts performance and size of DB.

    Default: false

    inv_aliases - boolean

    Keep track of target URI-s that the current URI resolves to. This negatively impacts performance and size of DB.

    Default: false

    url - string

    A fully-qualified S3 URL, including bucket and sub-bucket paths, as required, e.g., 's3://{bucketName}/{path}'.

    >= 1 characters

    Match pattern: .*:.*

    max_docs - integer

    Maximum number of documents to fetch. The default (-1) means no limit.

    >= -1

    exclusiveMinimum: false

    Default: -1

    max_bytes - integer

    Maximum size (in bytes) of documents to fetch or -1 for unlimited file size.

    >= -1

    exclusiveMinimum: false

    Default: 10485760

    index_directories - boolean

    Set to true to add directories to the index as documents. If set to false, directories will not be added to the index, but they will still be traversed for documents.

    Default: false

    max_threads - integer

    The maximum number of threads to use for fetching data. Note: Each thread will create a new connection to the repository, which may make overall throughput faster, but this also requires more system resources, including CPU and memory.

    Default: 1

    add_failed_docs - boolean

    Set to true to add documents even if they partially fail processing. Failed documents will be added with as much metadata as available, but may not include all expected fields.

    Default: false

    crawl_item_timeout - integer

    Time in milliseconds to fetch any individual document.

    exclusiveMinimum: true

    Default: 600000

    maximum_connections - integer

    Maximum number of concurrent connections to the filesystem. A large number of documents could cause a large number of simultaneous connections to the repository and lead to errors or degraded performance. In some cases, reducing this number may help performance issues.

    Default: 1000

    initial_mapping - Initial field mapping

    Provides mapping of fields before documents are sent to an index pipeline.

    skip - boolean

    Set to true to skip this stage.

    Default: false

    label - string

    A unique label for this stage.

    <= 255 characters

    condition - string

    Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

    reservedFieldsMappingAllowed - boolean

    Default: false

    mappings - array[object]

    List of mapping rules

    Default: {"source":"fetch_time","target":"fetch_time_dt","operation":"move"}{"source":"ds:description","target":"description","operation":"move"}

    object attributes:{source required : {
     display name: Source Field
     type: string
    }
    target : {
     display name: Target Field
     type: string
    }
    operation : {
     display name: Operation
     type: string
    }
    }

    unmapped - Unmapped Fields

    If fields do not match any of the field mapping rules, these rules will apply.

    source - string

    The name of the field to be mapped.

    target - string

    The name of the field to be mapped to.

    operation - string

    The type of mapping to perform: move, copy, delete, add, set, or keep.

    Default: copy

    Allowed values: copymovedeletesetaddkeep

    username - string

    An AWS Access Key ID that can access the content.

    password - string

    The AWS Secret Key associated with the Access Key.

    crawl_depth - integer

    Number of levels in a directory or site tree to descend for documents.

    >= -1

    exclusiveMinimum: false

    Default: -1

    bounds - string

    Limits the crawl to a specific directory sub-tree, hostname or domain.

    Default: tree

    Allowed values: treehostdomainnone

    include_paths - array[string]

    Regular expressions for URI patterns to include. This will limit this datasource to only URIs that match the regular expression.

    exclude_paths - array[string]

    Regular expressions for URI patterns to exclude. This will limit this datasource to only URIs that do not match the regular expression.

    include_extensions - array[string]

    List the file extensions to be fetched. Note: Files with possible matching MIME types but non-matching file extensions will be skipped. Extensions should be listed without periods, using whitespace to separate items (e.g., 'pdf zip').

    use_instance_creds - boolean

    Use provider chain that use system properties rather than an AWS key. Can be used to provide AWS EC2 instance credentials (For fusion hosted in an EC2 instance, as its nodes already have an ec2 instance role assigned). Detailed information can be found AWS SDK documentation (https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/auth/DefaultAWSCredentialsProviderChain.html) You can specify another AWS region through "SSE-KMS Encryption>AWS Region" option. Other parameters can be set through helm chart or deployment modification.

    Default: false

    commit_on_finish - boolean

    Set to true for a request to be sent to Solr after the last batch has been fetched to commit the documents to the index.

    Default: true

    verify_access - boolean

    Set to true to require successful connection to the filesystem before saving this datasource.

    Default: true

    use_sigv4 - boolean

    Sets the name of the signature algorithm to use for signing requests to "AWSS3V4SignerType". Required to retrieve encrypted objects by SSE-KMS.

    Default: false

    aws_region - string

    Sets the region to be used by the client. This will be used to determine both the service endpoint (eg: https://sns.us-west-2.amazonaws.com) and signing region.

    Default: us-west-2

    Allowed values: us-gov-west-1us-gov-east-1us-east-1us-east-2us-west-1us-west-2eu-west-1eu-west-2eu-west-3eu-central-1eu-north-1eu-south-1ap-east-1ap-south-1ap-southeast-1ap-southeast-2ap-northeast-1ap-northeast-2sa-east-1cn-north-1cn-northwest-1ca-central-1me-south-1af-south-1

    retryDelay - integer

    The initial retry time delay, in milliseconds.

    >= 1000

    exclusiveMinimum: false

    Default: 1000

    stopRetry - integer

    The maximum time to retry failed requests, in minutes.

    >= 1

    exclusiveMinimum: false

    Default: 5

    proxyHost - string

    The optional proxy host the client will connect through

    proxyPort - integer

    The optional proxy port the client will connect through

    proxyUsername - string

    The optional proxy user name to use if connecting through a proxy

    proxyPassword - string

    The optional proxy password to use when connecting through a proxy

    proxyHttps - boolean

    Force the HTTPS protocol to use for connecting to the proxy.