Crawl REST APIs using the Web connector
This article describes how to crawl REST API endpoints associated with Web connectors using Fusion. For more information about available Web connectors, see Fusion connectors.
Before crawling a REST API endpoint, the following prerequisites must be met:
-
All endpoints are available using bulk start links or a sitemap
-
The response data is in a parseable format (JSON, XML, etc.)
Options
Using bulk start links
If you have a small number of endpoints you want to crawl, enter each endpoint as a bulk start link.
To crawl the API endpoints using bulk start links:
-
Add a new Web connector datasource. To learn how to configure a new datasource, see Configure a New Datasource.
-
Under Start links, enter the main domain that contains the sitemap. For example,
http://www.restapiendpoint.com
. -
In the Link discovery section under Bulk Start Links, enter the URLs you want to crawl. Separate links with a new line. For example:
http://www.restapiendpoint.com/?apikey=user-token&s=dark%20knight&type=movie&page=1 http://www.restapiendpoint.com/?apikey=user-token&s=superman&type=movie&page=1 http://www.restapiendpoint.com/?apikey=user-token&s=superman&type=movie&page=2
-
Save and run the job.
-
Once complete, check the results in the Index Workbench.
Using a sitemap
If you have a large number of endpoints you want to crawl, use a sitemap containing the API endpoint locations. This is also helpful if someone without access to Fusion maintains the list of endpoint URLs. An example sitemap:
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>http://www.restapiendpoint.com/?apikey=user-token&s=dark%20knight&type=movie&page=1</loc>
<lastmod>2021-07-21</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.restapiendpoint.com/?apikey=user-token&s=superman&type=movie&page=1</loc>
<lastmod>2021-07-21</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.restapiendpoint.com/?apikey=user-token&s=superman&type=movie&page=2</loc>
<lastmod>2021-07-21</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.restapiendpoint.com/?apikey=user-token&s=superman&type=movie&page=3</loc>
<lastmod>2021-07-21</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
<url>
<loc>http://www.restapiendpoint.com/?apikey=user-token&s=batman&type=movie&page=1</loc>
<lastmod>2021-07-21</lastmod>
<changefreq>daily</changefreq>
<priority>0.8</priority>
</url>
</urlset>
To crawl the API endpoints using the sitemap:
-
Add a new Web connector datasource. To learn how to configure a new datasource, see Configure a New Datasource.
-
Under Start links, enter the main domain that contains the sitemap. For example,
http://www.restapiendpoint.com
. -
In the Link discovery section under Sitemap URLs, click the Add button.
-
Enter the URL of the sitemap. For example,
http://www.restapiendpoint.com/sitemap.xml
. -
Save and run the job.
-
Once complete, check the results in the Index Workbench.
Results
Both options above achieve the same result. Fusion indexes the JSON response provided at the endpoints. If an array of JSON objects is available, Fusion indexes each object and an individual document.
For example, Fusion creates three documents from the JSON response below:
{
"Search": [{
"Title": "Batman v Superman: Dawn of Justice",
"Year": "2016",
"imdbID": "tt2975590",
"Type": "movie",
"Poster": "https://m.media-amazon.com/images/M/MV5BYThjYzcyYzItNTVjNy00NDk0LTgwMWQtYjMwNmNlNWJhMzMyXkEyXkFqcGdeQXVyMTQxNzMzNDI@._V1_SX300.jpg"
}, {
"Title": "Superman Returns",
"Year": "2006",
"imdbID": "tt0348150",
"Type": "movie",
"Poster": "https://m.media-amazon.com/images/M/MV5BNzY2ZDQ2MTctYzlhOC00MWJhLTgxMmItMDgzNDQwMDdhOWI2XkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_SX300.jpg"
}, {
"Title": "Superman",
"Year": "1978",
"imdbID": "tt0078346",
"Type": "movie",
"Poster": "https://m.media-amazon.com/images/M/MV5BMzA0YWMwMTUtMTVhNC00NjRkLWE2ZTgtOWEzNjJhYzNiMTlkXkEyXkFqcGdeQXVyNjc1NTYyMjg@._V1_SX300.jpg"
}],
"totalResults": "3",
"Response": "True"
}