Skip to content

Latest commit

 

History

History
68 lines (51 loc) · 3.42 KB

File metadata and controls

68 lines (51 loc) · 3.42 KB

Advanced Open Crawler Details

Crawl Lifecycle

The Elastic Open Web Crawler runs crawl jobs based on configuration files you reference when running crawler. As Open Crawler runs, each URL endpoint found during the crawl will be handed to a different thread to be visited, resulting in one document per page being indexed into Elasticsearch.

Crawls are performed in two stages: a primary crawl and a purge crawl.

The Primary crawl

Beginning with URLs included as seed_urls, the crawler begins crawling web content. While crawling, each link it encounters will be added to the crawl queue, unless the link should be ignored due to crawl rules or crawler directives.

The crawl results from visiting these webpages are added to a pool of results. These are indexed into Elasticsearch using the _bulk API once the pool reaches the configured threshold.

The Purge crawl

After a primary crawl is completed, the crawler fetches all documents from the associated index that were not seen during the crawl. It does this through comparing the last_crawled_at date on the doc to the primary crawl's start time. If last_crawled_at is earlier than the start time, that means the webpage was not updated during the primary crawl and should be added to the purge crawl.

The crawler then re-crawls all of these webpages. If a page is still accessible, it updates the corresponding document in Elasticsearch. A webpage can be inaccessible due to any of the following reasons:

  • Updated crawl rules in the configuration file that now exclude the URL
  • Updated crawler directives on the server or webpage that now exclude the URL
  • Non-200 response from the webserver

At the end of the purge crawl, all docs in the index that were not updated during either the primary crawl or the purge crawl are deleted.

Document Schema

Open Crawler generates Elasticsearch documents from crawl results. These documents have a predefined list of fields that are always included.

Open Crawler does not impose any mappings onto indices that it ingests docs into. This means you are free to create whatever mappings you like for an index, so long as you create the mappings before indexing any documents.

If any content extraction rules have been configured, you can add more fields to the Elasticsearch documents. However, the predefined fields can never be changed or overwritten by content extraction rules. If you are ingesting onto an index that has custom mappings, be sure that the mappings don't conflict with these predefined fields.

Field Type
id text
body text
domains text
headings text
last_crawled_at datetime
links test
meta_description text
title text
url text
url_host text
url_path text
url_path_dir1 text
url_path_dir2 text
url_path_dir3 text
url_port long
url_scheme text