Connector Datasources APIFusion Connectors APIs
Use the Connector Datasources API to create and configure datasources and examine or clear items and tables from the crawl database.
See Connectors and Indexing your Data for related information.
Working with the crawl database
Some of the connectors use a crawl database to track documents that have been seen by prior crawls and are able to use this information to understand which documents are new or have been updated or removed and take appropriate action in the index. The /api/connectors/datasources/DATASOURCE_ID
endpoints allow looking into the crawl database and dropping tables or clearing the database.
The connectors that support the crawl database are currently lucid.fs
and lucid.solrxml
. The lucid.anda
connector also uses a crawl database, but it is not the same database, and does not have a REST API or other interface to access it.
Examining a crawlDB
The output from a GET request to /api/connectors/datasources/DATASOURCE_ID
will include several sections detailing the database structure:
-
counters
: Thecounters
section reports the document counts of database activities, such as table inserts. -
ops
: Theops
section reports on database operations that have occurred, such as initiating tables, retrieving items, processing items and table drops. -
tables
: Thetables
section lists the tables in the database with a count of the number of items in each table. Inspecting the items is described in the next section.
Drop tables from a crawlDB
The output from a DELETE request to /api/connectors/datasources/DATASOURCE_ID/db/<table>
will be empty. When dropping the database, note that no documents will be removed from the index. However, the crawl database will be empty, so on the next datasource run, all documents will be treated as though they were never seen by the connectors.
When dropping tables, be aware that the items
table does not delete documents from the index, but instead changes the database so database considers them new documents. When dropping other tables, such as the errors
table, it will merely clear out old error messages.
Clear or delete items from a crawlDB
A CLEAR request to /api/connectors/datasources/DATASOURCE_ID/db/items/<item>
removes the information from the Solr Index only. This option is useful when Solr data gets out of sync with the Crawl Database.
A DELETE request removes the information from the Crawl Database only. Note that this does not affect the Solr Index.
Examples
REQUEST
curl -u USERNAME:PASSWORD https://FUSION_HOST:FUSION_PORT/api/connectors/datasources?collection=demo
RESPONSE
[ {
"id" : "database",
"created" : "2014-05-04T19:47:22.867Z",
"modified" : "2014-05-04T19:47:22.867Z",
"connector" : "lucid.jdbc",
"type" : "jdbc",
"description" : null,
"pipeline" : "conn_solr",
"properties" : {
"db" : null,
"commit_on_finish" : true,
"verify_access" : true,
"sql_select_statement" : "select CONTENTID as id from CONTENT;",
"debug" : false,
"collection" : "demo",
"password" : "password",
"url" : "jdbc:postgresql://FUSION_HOST:5432/db",
"nested_queries" : null,
"clean_in_full_import_mode" : true,
"username" : "user",
"delta_sql_query" : null,
"commit_within" : 900000,
"primary_key" : null,
"driver" : "org.postgresql.Driver",
"max_docs" : -1
}
} ]
REQUEST
curl -u USERNAME:PASSWORD -X POST -H 'Content-type: application/json' -d '{
"id":"SolrXML",
"connector":"lucid.solrxml",
"type":"solrxml",
"properties":{
"path":"/Applications/solr-4.10.2/example/exampledocs", "generate_unique_key":false, "collection":"MyCollection"
}
}' https://FUSION_HOST:FUSION_PORT/api/connectors/datasources
RESPONSE
{
"id" : "SolrXML",
"created" : "2015-05-18T15:47:51.199Z",
"modified" : "2015-05-18T15:47:51.199Z",
"connector" : "lucid.solrxml",
"type" : "solrxml",
"properties" : {
"commit_on_finish" : true,
"verify_access" : true,
"generate_unique_key" : false,
"collection" : "MyCollection",
"include_datasource_metadata" : true,
"include_paths" : [ ".*\\.xml" ],
"initial_mapping" : {
"id" : "a35c9ff3-dbb6-434b-af40-597722c2986a",
"skip" : false,
"label" : "field-mapping",
"type" : "field-mapping"
},
"path" : "/Applications/apache-repos/solr-4.10.2/example/exampledocs",
"exclude_paths" : [ ],
"url" : "file:/Applications/apache-repos/solr-4.10.2/example/exampledocs/",
"max_docs" : -1
}
}
max_docs
value for the above datasource:REQUEST
curl -u USERNAME:PASSWORD -X PUT -H 'Content-type: application/json' -d '{
"id":"SolrXML",
"connector":"lucid.solrxml",
"type":"solrxml",
"properties":{
"path":"/Applications/solr-4.10.2/example/exampledocs",
"max_docs":10
}
}' https://FUSION_HOST:FUSION_PORT/api/connectors/datasources/SolrXML
RESPONSE
true
REQUEST
curl -u USERNAME:PASSWORD -X DELETE https://FUSION_HOST:FUSION_PORT/api/connectors/datasources/database
RESPONSE
If successful, no response.
You can use both of these APIs in order to fully clear the data:
REQUEST
curl -X POST 'http://FUSION_HOST:FUSION_PORT/api/solr/COLLECTION_NAME/update?commit=true' -H 'Content-Type: application/json' --data-binary '{"delete":{"query":"_lw_data_source_s:DATASOURCENAME"}}'
`curl -X DELETE -u USERNAME:PASSWORD 'http://FUSION_HOST:FUSION_PORT/api/connectors/datasources/DATASOURCENAME/db' `
The first clears the data from the datasource but does not clear the crawlDB. So if you attempt to index the same document set again, indexing will skip the documents because they are still in the crawl DB. If you send the command to delete the crawlDB afterward, you can then reload the docs.
The The |