Advanced Open Crawler Details

Crawl Lifecycle
- The Primary crawl
- The Purge crawl
Document Schema

Crawl Lifecycle

The Elastic Open Web Crawler runs crawl jobs based on configuration files you reference when running crawler. As Open Crawler runs, each URL endpoint found during the crawl will be handed to a different thread to be visited, resulting in one document per page being indexed into Elasticsearch.

Crawls are performed in two stages: a primary crawl and a purge crawl.

The Primary crawl

Beginning with URLs included as seed_urls, the crawler begins crawling web content. While crawling, each link it encounters will be added to the crawl queue, unless the link should be ignored due to crawl rules or crawler directives.

The crawl results from visiting these webpages are added to a pool of results. These are indexed into Elasticsearch using the _bulk API once the pool reaches the configured threshold.

The Purge crawl

After a primary crawl is completed, the crawler fetches all documents from the associated index that were not seen during the crawl. It does this through comparing the last_crawled_at date on the doc to the primary crawl's start time. If last_crawled_at is earlier than the start time, that means the webpage was not updated during the primary crawl and should be added to the purge crawl.

The crawler then re-crawls all of these webpages. If a page is still accessible, it updates the corresponding document in Elasticsearch. A webpage can be inaccessible due to any of the following reasons:

Updated crawl rules in the configuration file that now exclude the URL
Updated crawler directives on the server or webpage that now exclude the URL
Non-200 response from the webserver

At the end of the purge crawl, all docs in the index that were not updated during either the primary crawl or the purge crawl are deleted.

Document Schema

Open Crawler generates Elasticsearch documents from crawl results. These documents have a predefined list of fields that are always included.

Open Crawler does not impose any mappings onto indices that it ingests docs into. This means you are free to create whatever mappings you like for an index, so long as you create the mappings before indexing any documents.

If any content extraction rules have been configured, you can add more fields to the Elasticsearch documents. However, the predefined fields can never be changed or overwritten by content extraction rules. If you are ingesting onto an index that has custom mappings, be sure that the mappings don't conflict with these predefined fields.

Field	Type
`id`	text
`body`	text
`domains`	text
`headings`	text
`last_crawled_at`	datetime
`links`	test
`meta_description`	text
`title`	text
`url`	text
`url_host`	text
`url_path`	text
`url_path_dir1`	text
`url_path_dir2`	text
`url_path_dir3`	text
`url_port`	long
`url_scheme`	text

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advanced Open Crawler Details

Crawl Lifecycle

The Primary crawl

The Purge crawl

Document Schema

FilesExpand file tree

ADVANCED.md

Latest commit

History

ADVANCED.md

File metadata and controls

Advanced Open Crawler Details

Crawl Lifecycle

The Primary crawl

The Purge crawl

Document Schema