Apache Hadoop 2 V1 Connector Configuration Reference

Table of Contents

Configuration

The Apache Hadoop 2 Connector is a MapReduce-enabled crawler that is compatible with Apache Hadoop v2.x.

Deprecation and removal notice

This connector is deprecated as of Fusion 4.2 and is removed or expected to be removed as of Fusion 5.0.

For more information about deprecations and removals, including possible alternatives, see Deprecations and Removals.

There is also a non-MapReduce enabled connector for HDFS filesystem; see page HDFS Connector Configuration Reference for details.

Configuration

When entering configuration values in the UI, use unescaped characters, such as \t for the tab character. When entering configuration values in the API, use escaped characters, such as \\t for the tab character.

Connector for using a Hadoop cluster to process documents and forward them to Solr for indexing. This uses a Hadoop job jar to pass arguments to Hadoop for processing with MapReduce.

description - string

Optional description for this datasource.

id - stringrequired

Unique name for this datasource.

>= 1 characters

Match pattern: ^[a-zA-Z0-9_-]+$

pipeline - stringrequired

Name of an existing index pipeline for processing documents.

>= 1 characters

properties - Properties

Datasource configuration properties

db - Connector DB

Type and properties for a ConnectorDB implementation to use with this datasource.

aliases - boolean

Keep track of original URI-s that resolved to the current URI. This negatively impacts performance and size of DB.

Default: false

inlinks - boolean

Keep track of incoming links. This negatively impacts performance and size of DB.

Default: false

inv_aliases - boolean

Keep track of target URI-s that the current URI resolves to. This negatively impacts performance and size of DB.

Default: false

type - string

Fully qualified class name of ConnectorDb implementation.

>= 1 characters

Default: com.lucidworks.connectors.db.impl.MapDbConnectorDb

fusion_batchsize - integer

Fusion Client Batch Size

>= 1

exclusiveMinimum: true

Default: 500

fusion_buffer_timeoutms - integer

Fusion Client Timeout (ms).

>= 1

exclusiveMinimum: true

Default: 1000

fusion_endpoints - array[string]

Default: "http://localhost:8764"

fusion_fail_on_error - boolean

Fusion Client Fail on Error

Default: false

fusion_login_app_name - string

Default: FusionClient

fusion_login_config - string

The file path of Login Configuration for Fusion kerberized, it must be placed in every mapper/reduce node.

fusion_password - string

Fusion client User's password, leave empty if kerberos is use.

fusion_realm - string

Fusion's Realm, If 'native' is selected the password is mandatory. If 'kerberos' is selected the Login Configuration is mandatory.

Default: NATIVE

Allowed values: NATIVEKERBEROS

fusion_user - string

Fusion client's User or Principal if Kerberos is chosen.

hadoop_home - string

Path to the Hadoop home directory where $HADOOP_HOME/bin/hadoop can be found. The connector requires access to either a full Hadoop installation, or a Hadoop client provided by your Hadoop distribution that has been configured to access the Hadoop installation.

>= 1 characters

hadoop_input - string

Hadoop input source file/directory

>= 1 characters

hadoop_mapper - string

Hadoop Ingest Mapper

Default: CSV

Allowed values: CSVDIRECTORYGROKREGEXSEQUENCE_FILESOLR_XMLWARCZIP

initial_mapping - Initial field mapping

Provides mapping of fields before documents are sent to an index pipeline.

condition - string

Define a conditional script that must result in true or false. This can be used to determine if the stage should process or not.

label - string

A unique label for this stage.

<= 255 characters

mappings - array[object]

List of mapping rules

object attributes:{operation : {
display name: Operation
type: string
}source required : {
display name: Source Field
type: string
}target : {
display name: Target Field
type: string
}}

reservedFieldsMappingAllowed - boolean

Default: false

skip - boolean

Set to true to skip this stage.

Default: false

unmapped - Unmapped Fields

If fields do not match any of the field mapping rules, these rules will apply.

operation - string

The type of mapping to perform: move, copy, delete, add, set, or keep.

Default: copy

Allowed values: copymovedeletesetaddkeep

source - string

The name of the field to be mapped.

target - string

The name of the field to be mapped to.

job_jar - string

Path and name of the Hadoop job jar. Unless you are using a custom job jar, the default provided by Fusion is preferred.

>= 1 characters

Default: lucidworks-hadoop-job-2.2.7.jar

kinit_cache - string

Full path of 'kerberos' cache. If this path does not exist, it will be created.

kinit_cmd - string

Full path to the 'kinit' binary.

Default: kinit

kinit_keytab - string

Full path to the Kerberos keytab file.

kinit_principal - string

Kerberos principal name, i.e., username@YOUR-REALM.COM

mapper_args - array[object]

Parameters for the Hadoop job.

object attributes:{arg_name : {
display name: name
type: string
}arg_value : {
display name: value
type: string
}}

reducers - integer

(Expert) Depending on the OutputFormat and your system resources, you may wish to have Hadoop do a reduce step first so as to not open too many connections to the output resource

exclusiveMinimum: false

Default: 0

run_kinit - boolean

If your Hadoop installation requires job requests to authenticate with Kerberos, this option will allow Fusion to run 'kinit' to get a valid ticket.

Default: false