Apache Hadoop 2 V1 Connector Configuration Reference
The Apache Hadoop 2 Connector is a MapReduce-enabled crawler that is compatible with Apache Hadoop v2.x.
|
Deprecation and removal notice
This connector is deprecated as of Fusion 4.2 and is removed or expected to be removed as of Fusion 5.0.
|
|
When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.
|
Connector for using a Hadoop cluster to process documents and forward them to Solr for indexing. This uses a Hadoop job jar to pass arguments to Hadoop for processing with MapReduce.
description - string
Optional description for this datasource.
id - stringrequired
Unique name for this datasource.
>= 1 characters
Match pattern: ^[a-zA-Z0-9_-]+$
pipeline - stringrequired
Name of an existing index pipeline for processing documents.
>= 1 characters
properties - Properties
Datasource configuration properties
db - Connector DB
Type and properties for a ConnectorDB implementation to use with this datasource.
aliases - boolean
Keep track of original URI-s that resolved to the current URI. This negatively impacts performance and size of DB.
Default: false
inlinks - boolean
Keep track of incoming links. This negatively impacts performance and size of DB.
Default: false
inv_aliases - boolean
Keep track of target URI-s that the current URI resolves to. This negatively impacts performance and size of DB.
Default: false
type - string
Fully qualified class name of ConnectorDb implementation.
>= 1 characters
Default: com.lucidworks.connectors.db.impl.MapDbConnectorDb
fusion_batchsize - integer
Fusion Client Batch Size
>= 1
exclusiveMinimum: true
Default: 500
fusion_buffer_timeoutms - integer
Fusion Client Timeout (ms).
>= 1
exclusiveMinimum: true
Default: 1000
fusion_endpoints - array[string]
Default: "http://localhost:8764"
fusion_fail_on_error - boolean
Fusion Client Fail on Error
Default: false
fusion_login_app_name - string
Login Config App Name FusionClient by default.
Default: FusionClient
fusion_login_config - string
The file path of Login Configuration for Fusion kerberized, it must be placed in every mapper/reduce node.
fusion_password - string
Fusion client User's password, leave empty if kerberos is use.
fusion_realm - string
Fusion's Realm, If 'native' is selected the password is mandatory. If 'kerberos' is selected the Login Configuration is mandatory.
Default: NATIVE
Allowed values: NATIVEKERBEROS
fusion_user - string
Fusion client's User or Principal if Kerberos is chosen.
hadoop_home - string
Path to the Hadoop home directory where $HADOOP_HOME/bin/hadoop can be found. The connector requires access to either a full Hadoop installation, or a Hadoop client provided by your Hadoop distribution that has been configured to access the Hadoop installation.
>= 1 characters
hadoop_input - string
Hadoop input source file/directory
>= 1 characters
hadoop_mapper - string
Hadoop Ingest Mapper
Default: CSV
Allowed values: CSVDIRECTORYGROKREGEXSEQUENCE_FILESOLR_XMLWARCZIP
initial_mapping - Initial field mapping
Provides mapping of fields before documents are sent to an index pipeline.
condition - string
Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.
label - string
A unique label for this stage.
<= 255 characters
mappings - array[object]
List of mapping rules
object attributes:{operation
: {
display name: Operation
type: string
}source
required : {
display name: Source Field
type: string
}target
: {
display name: Target Field
type: string
}}
reservedFieldsMappingAllowed - boolean
Default: false
skip - boolean
Set to true to skip this stage.
Default: false
unmapped - Unmapped Fields
If fields do not match any of the field mapping rules, these rules will apply.
operation - string
The type of mapping to perform: move, copy, delete, add, set, or keep.
Default: copy
Allowed values: copymovedeletesetaddkeep
source - string
The name of the field to be mapped.
target - string
The name of the field to be mapped to.
job_jar - string
Path and name of the Hadoop job jar. Unless you are using a custom job jar, the default provided by Fusion is preferred.
>= 1 characters
Default: lucidworks-hadoop-job-2.2.7.jar
kinit_cache - string
Full path of 'kerberos' cache. If this path does not exist, it will be created.
kinit_cmd - string
Full path to the 'kinit' binary.
Default: kinit
kinit_keytab - string
Full path to the Kerberos keytab file.
kinit_principal - string
Kerberos principal name, i.e., username@YOUR-REALM.COM
mapper_args - array[object]
Parameters for the Hadoop job.
object attributes:{arg_name
: {
display name: name
type: string
}arg_value
: {
display name: value
type: string
}}
reducers - integer
(Expert) Depending on the OutputFormat and your system resources, you may wish to have Hadoop do a reduce step first so as to not open too many connections to the output resource
exclusiveMinimum: false
Default: 0
run_kinit - boolean
If your Hadoop installation requires job requests to authenticate with Kerberos, this option will allow Fusion to run 'kinit' to get a valid ticket.
Default: false