Lucid.anda Connector Framework
Lucid.anda is a general framework for efficient traversal of data repositories with a rich set of configuration properties that allow fine-grained control of the kind, amount, and rate of data retrieval. Specific implementations have different configuration properties according to the repository type.
To see which properties are required/optional, query the REST API via the URL: api/connectors/plugins/lucid.anda/types/CONNECTOR_TYPE
. For example, see the lucid.anda-web plugin properties:
https://FUSION_HOST:FUSION_PORT/api/connectors/plugins/lucid.anda/types/web
Basic configuration properties
The set of basic configuration properties limit the scope of the crawl.
The crawler fetches the contents of the specified startLink
property, adding any found links found. The connector adds nodes to a database known as crawldb
to prevent re-processing. This database tracks indexed nodes as well as which nodes found to be redirects, duplicates, or otherwise aliases of another node.
Regular expressions can restrict the crawl either by defining name patterns.
API Name / UI Label | Description |
---|---|
|
A list of URIs to use as the seed URIs for the crawl. Changing this field after crawling the content requires you to clear the crawldb. |
|
If true, diagnostic information is written to the
The default is false. |
|
If true, the default, the crawler restricts the crawl to only the tree of items below the provided startLinks. |
Changing this field after crawling the content requires you to clear the crawldb. optional |
The number of path levels to descend. The default, -1, indicates unlimited depth, and crawls all URIs that match other definitions of the crawl. Changing this field after crawling the content requires you to clear the crawldb. |
|
Defines the maximum number of items to retrieve during a crawl. This can be used to limit the crawl of a very large dataset to a smaller number of documents to gauge performance or to test pipeline settings. If this setting is modified mid-crawl where a crawl is started, stopped before it finishes, and then restarted, the original value is retained. If a crawl finishes and this property is then decreased, subsequent recrawls respect the new value, but the specific documents items retrieved are be an unpredictable subset of the original document set. The default is -1, to retrieve all documents found that are allowed according to other property definitions. |
|
Defines a list of file extensions to include in the crawl. Changing this field after crawling the content requires you to clear the crawldb. |
|
Defines a list of regular expressions to include specific URIs or URI patterns in the crawl. Changing this field after crawling the content requires you to clear the crawldb. |
|
Defines a list of file extensions to exclude from the crawl. Only the extension is necessary with no additional characters, as in Changing this field after crawling the content requires you to clear the crawldb. |
|
Defines a list of regular expressions to exclude specific URIs or URI patterns from the crawl. Changing this field after crawling the content requires you to clear the crawldb. |
|
The number of items to batch for each round of fetching. The default is 50 items. |
|
The number of fetch threads. The default is 5 threads. |
|
The number of milliseconds to wait between document requests. This property can be used to throttle a crawl in cases where too frequent requests may cause performance issues in the crawled website and the site does not have a robots.txt file in place to control incoming requests from automated agents. The default is 0 milliseconds. |
|
The number of emit threads. The emitter is responsible for the output of documents from the crawler to Fusion. The default is 5 threads. |
|
If true, the default, documents are removed from the index if they are considered "defunct." There are two cases when a document is considered defunct:
|
|
The number of fetch failures before a document is removed from the index. The default is -1, which means documents that return errors on recrawl are never removed. If you would like document removed after a specific threshold, set this property to your desired threshold. |
Fetcher Configuration Properties
Fetcher configuration properties vary by plugin. Fetcher configuration properties are distinguished by prefix "f.", for example "f.maxSizeBytes".
API Name / UI Label | Description |
---|---|
|
The length of time to wait before timing out of connection requests, expressed in milliseconds. The default is 10000 milliseconds, or 10 seconds. |
|
Defines the maximum size of a document to crawl, expressed in bytes. Documents larger than this is dropped from the crawl. The default is 5Mb (4,194,304 bytes) per document. |
|
The location of the HTTP proxy, if any. The proxy address should be expressed in |
|
Boolean value, default is false. If true, this disables security checks against SSL/TLS certificate signers and origins by skipping the hostname-verification logic. This allows certificates signed by anyone, including self-signed certificates. Hostname-verification logic restricts access to only those certificates which are signed by certificate authorities and certificates in the keystore. |
|
The name of the file within the crawler-container directory that contains authentication credentials. This file is in JSON format and should be located in |
|
A list of URLs that are sitemaps. The URLs added with this property, and all URLs found in each sitemap, is added to the list of start links for the data source and crawled accordingly. A sitemap URL that is a sitemap index, or a sitemap that links other sitemaps, is also supported. Each URL found in each linked sitemap is crawled in accordance with other include or exclude rules of the crawl. If the data source should only contain a sitemap as the main start link, the sitemap URL should be provided to both the start link property and also to the sitemap property. Sitemaps will only be treated as sitemaps when the URL is provided as part of this property. When using the REST API, the sitemaps should be provided as a list, such as: |
|
Boolean value, default is true. If true, the Allow, Disallow and other directives found in robots.txt is respected. |
|
Boolean value, default is true. If true, crawl-delay directives found in robots.txt is respected. |
|
Boolean value, default is false. If true, a trailing slash ( |
|
Boolean value, default is true. If true, queries that are part of a link URL is discarded. |
|
Name of default character set. Default is UTF-8 |
|
Name of default MIME type. Default is application/octet-stream. |
|
Boolean value, default is false. If true, the web-crawler will respect <meta http-equiv=\"refresh\" /> redirects embedded in the <head /> tag of source HTML itself, for example:
|
|
The name to provide as the User-Agent name in HTTP request. The default is Lucidworks-Anda/1.0. |
|
An email address to pass with the user-agent information while crawling. The default is empty. |
|
A web address to use as a HTTP user-agent web address. The default is empty. |
Content Filtering and Selection Configuration Properties
These properties are only used by the web plugin. Like the fetcher properties names, they have the prefix "f".
API Name / UI Label | Description |
---|---|
|
A list of HTML root elements whose child-elements are used to extract the website content. The default list includes body and head. |
|
If true, content is checked for links before it is filtered of other elements in accordance with other include/exclude rules. The default is false, which means links are extracted after other elements have been filtered. |
|
A list of HTML tag names for elements to include with the crawled documents. The default is empty, which means all tags are included. This property may be best used when there is a small list of known tags you know you want to include but also want to exclude all other tags. |
|
A list of HTML tag classes of elements to include in the crawled content. |
|
A list of the HTML tag IDs of elements to include in the crawled content. |
|
A list of Jsoup selectors for elements to include in the crawled content. Jsoup allows using a CSS-like query syntax to find matching elements. For more information on Jsoup selectors, see the Jsoup Cookbook section on Jsoup selector syntax. |
|
A list of HTML tag names for elements to exclude from the crawled documents. |
|
A list of HTML tag classes of elements to exclude from the crawled content. |
|
A list of the HTML tag IDs of elements to exclude from the crawl. |
|
A list of jsoup selectors for elements to exclude from the crawled content. For more information on Jsoup selectors, see the Jsoup Cookbook section on Jsoup selector syntax. |
|
A list of HTML tag names for elements that is added to their own fields. The new field will have the same name as the tag defined. |
|
A list of HTML tag IDs for elements that is added to their own fields. The new field will have the same name as the tag ID defined. |
|
A list of HTML tag classes for elements that is added to their own fields. The new field will have the same name as the tag class defined. |
|
A list of selectors in Jsoup format to put content into its own field. This property allows you to extract HTML tag elements and put them in their own field. Such as, For more information on Jsoup selectors, see the Jsoup Cookbook section on Jsoup selector syntax. This property was formerly named f.fieldSelectors. |
Refresh Policy Configuration Properties
Refresh policies are used to control which items are recrawled, so they only matter on crawls after the first complete crawl, and the default refresh policy is to simply recrawl all items. The refreshAll property is true by default to create that behavior, so the first step in configuring a refresh-policy is to set refreshAll to false.
There are five types of refresh policies: "refreshStartLinks", "refreshErrors", "refreshOlderThan", "refreshIdPrefixes", "refreshIDRegexes".
This is scriptable via a JavaScript function supplied as property "refreshScript", for example:
function shouldRefresh(id, depth, lastModified, lastFetched, lastEmitted, error) {
if (null !== error) {
if (null !== error.getCause()) {
if (-1 !== error.getCause().getMessage().indexOf("503")) {
return true;
}
}
}
return false;
}
API Name / UI Label | Description |
---|---|
|
Boolean value, default is true. If true, recrawl all items. |
|
Refresh all items specified in property "startLinks". |
|
Refresh all items that failed in any way last time |
|
Refresh all items whose last-fetched-date is older than this property’s value, in seconds. for example use 86400 to refresh all items that have not been fetched in one day or more |
|
An array of strings of prefixes. Refresh all items whose ID begins with any of these prefixes, for example "https://lucidworks.com/product/" to only refresh product pages in a crawl of a website. |
|
An array of strings of regexes. Refresh all items which match any regex, for example, "./product/.\.html" to only refresh HTMP pages found under any "/product" path. |
|
A script property that allows users to define a |
|
Boolean value, default is false. If true, recrawl all items, even if they have not changed since last crawl. If you make a change to your pipeline or schema that will lead to analyzing/indexing the text differently, you would want to recrawl all items. forceRefresh is different from clearing the data source because it allows you to clear the last-modified date and ETag while retaining its history. |
Dedupe Configuration Properties
Fusion can be configured to deduplicate documents based on:
-
the entire contents of the document
-
the contents of a specified field
-
custom deduplication based on a document signature generated by a user-supplied JavaScript function genSignature() which returns a string. The Fusion UI Admin tool provides a JavaScript-aware input box which so that you can create and edit this function directly in Fusion.
Dedupe works by maintaining a signature for each document, and ensuring that exactly one document appears in Solr for each signature. It does this by designating the first document it encounters with a particular signature, making it the "canonical" document. All subsequent documents with that signature are designated as "aliases."
It keeps track of the current canonical document for a particular signature across crawls, and when a document signature changes, it maintains its guarantee that exactly one document with each signature shows up in Solr.
In the case where custom deduplication is done either using a field or a custom signature,
you must specify either the field or the JavaScript function, accordingly.
The value of this string is found in the dedupeSignature_s
field.
If the property "dedupe" (UI control checkbox "Dedupe on Content") is true but
neither a field or JavaScript function are specified, the raw contents of the document are used for deduplication.
No deduplication signature is generated, therefore the resulting document does not have a dedupeSignature_s
field.
Here is an example of a genSignature()
function:
function genSignature(content) {
var signature = "";
if (content.hasField("h2")) {
var values = content.getStrings("h2").toArray();
values.sort();
for each (var value in values) {
signature += value;
}
}
return signature.length > 0 ? signature : null;
}
This example finds duplicates based on the h2 fields in each document. This script assumes that the h2 headers in the documents have been pulled into a field with the f.fieldSelectors
property. The entire content object is available here, so implementations of this class can dedupe on any combination of fields. The genSignature()
function should return null when the fields needed to generate a signature are not present.
API Name / UI Label | Description |
---|---|
|
Boolean value, default is false. If true, the crawler will try to de-duplicate content. This can be done with an analysis of the raw content of the document, or based on content in a specific named field (dedupeField) or with JavaScript (dedupeScript). If a document is identified as a duplicate of another, the URI for the duplicate document is entered into the crawl database as an alias. |
|
Boolean value, default is false. If true, the deduplication signature string is saved as part of the Solr document in the field |
|
A field to use in de-duplication. If no field is defined, and no JavaScript is defined with dedupeScript, the item’s full raw-content is used by default. |
|
Specifies a JavaScript to perform custom de-duplication. The JavaScript should contain a |
Splitter Configuration Properties
These properties determine how to process .csv and .tsv files.
API Name / UI Label | Description |
---|---|
|
If true, the default, CSV or TSV files are split. This means documents are created for the unique rows found in the CSV file. |
|
The format of the CSV file. The options are default, rfc, excel, or mysql.
The default is default. |
|
If true, the first row of the CSV file is parsed as a header and each row is treated as column names, which will become field names for the values in each document. The default is false, which means that column names are given numeric values as field names, starting with "0". |
|
If true, the default, .zip, .tar, .tar.gz, .tgz, .jar, .bzip, .bzip2, .cpio, and .dump files are opened and documents found within the archive is added to the index as individual documents. When archives are split, they are split recursively, meaning that multiple embedded archives will each be opened and indexed (for example, if a .tar file contains a .zip file which contains a .csv file, the .csv file is indexed and split into multiple documents according to the CSV-related properties). Note that .7z files are not supported at the current time. |
|
Specify a column-delimiter character. |
|
Specify the character used to indicate a comment row. |
|
Specify the character set. |
Other Configuration Properties
API Name / UI Label | Description |
---|---|
|
The default value is "`in-memory". The other legal value is "on-disk". Crawl-database type "in-memory" uses a RAMStore-based crawldb during the crawl. At the end of the crawl, it writes the crawldb to disk as a binary compressed file whose filename contains a timestamp showing crawl completion time, so the filename is: "crawldb.<timestamp>.bin.gz". This file is written to directory: Crawl database "on-disk" persists the data to disk throughout the crawl, resulting in files named "data" and "data.p" written to the above directory throughout the crawl. |
|
The number of crawls after which an alias will expire. The default is 1 crawl. |
|
Default value is true. When true, the entire set of links that every single item links to is retained and stored in the crawldb. Setting this property to false will lead to smaller crawldbs persisted on disk (in the case of both crawlDBType=in-memory and crawlDBType=on-disk), and in the case of crawlDBType=in-memory, less memory is consumed during the crawl itself too. crawlDBType=in-memory means that the crawldb lives in memory for the entire crawl and is only persisted to disk at the end, so not retaining the entire set of links for every item saves a lot of RAM. This property will make a big difference in memory and disk consumption for web crawls, where the vast majority of space occupied by each item in the crawldb is taken up by its links, usually. The crawldb shrunk by a factor of 10:1 with retainOutlinks=false for some web crawls. It will make a minimal difference in filesystem crawls, where only directories have any links at all. |
|
Default value is false. If true, on startup, Anda will check crawlDb and remove all illegal links from the crawlDb. Used when link-legality rules have been changed to cull set of links stored in crawlDb. |
|
Default value is true. If true, a first-time crawl fails as soon as a missing-start link is detected. It is difficult to figure out why many pages are missing after-the-fact, given a set of start links, each of which leads to swaths of pages. For a first-time crawl, it is reasonable to expect that all start links are valid, therefore, this property is true by default. |
|
Specifies a JavaScript to perform link rewriting. Changing this field after crawling the content requires you to clear the crawldb. |
|
If true, this will allow links from any sub-domain of a URI in the startURIs list to pass link-legality checks. The default is false. Changing this field after crawling the content requires you to clear the crawldb. |
|
If true, the paths provided in URIs within the startLinks list is used as part of link-legality checks. The default is false. Use this if you only want pages under the defined path(s) to be crawled instead of all documents found in the http://host.domain tree. For example, if you define "http://www.cnn.com/US/" as your startLink and only want to crawl URLs that start with that string, choose this option. Changing this field after crawling the content requires you to clear the crawldb. |
|
Defines a list of host prefixes to ignore when evaluating the list of legal links. For example, adding Changing this field after crawling the content requires you to clear the crawldb. |
|
A list of URI schemes that are considered legal URIs for the crawl. This is expressed as a list in the REST API. The default is a list containing only |
|
If true, the default, when a batch emit fails, documents are tried one-by-one. |
|
If true, existing crawl database entries are evaluated for legality at the start of the crawl. This allows for changing link legality rules (legalURISchemes) between crawls and then purging the crawl database of newly prohibited items. The default is false. |
|
The name of the document collection that documents are indexed into. |
|
A JSON map that applies a set of field mappings specific to a data source which is applied before documents are sent to the index pipeline. The index pipeline may also include an additional field mapping stage. This could be useful if a single field mapping stage is used with multiple data sources; in this case, the initial_mapping property could be used to prepare incoming documents for the index pipeline stage. When using the API, the JSON map should look the same as a field-mapping index stage, such as:
The crawler provides a default initial mapping for |
|
Allows overriding the default ConnectorDb implementation. If it is not defined, the default is used, which is defined in |
Querying a crawldb Solr index
To see all errors with the exception that caused the error:
/solr/crawldb_mywebcrawl/select?q=map_s:ERRORS_MAP
To see all deleted items with any exception that lead to deleting them:
/solr/crawldb_mywebcrawl/select?q=map_s:DELETED_MAP
To see all items discovered via links on a particular page:
/solr/crawldb_mywebcrawl/select?q=parentID_s:<some ID>
To see all aliases of a particular page:
/solr/crawldb_mywebcrawl/select?q=id:INVERSE_ALIAS_MAP|<some ID>
Find all pages fetched in the last 24 hours:
/solr/crawldb_mywebcrawl/select?q=fetchedDate_tdt:[NOW-24HOURS TO NOW]