Problem Description
Current behaviours for crawl_rules.policy as of v0.22 appear to have the following behaviours:
allow- Page is crawled and indexed
deny - Page is not crawled and not indexed
For sites that have index pages, or site listings (example) - this forces you to index those pages, or at least send them to Elasticsearch and remove them using ingestion pipelines - not very elegant.
Proposed Solution
Change to the crawl_rules.policy for behaviour as follows:
allow- Page is crawled and indexed, I would also propose re-enumerating this as index
deny - Page is not crawled and not indexed, I would also propose re-enumerating this as discard
- new thing - Page is crawled but not indexed, I would also propose the value of
crawl
The addition of the crawl option would bounce through pages, and use their links for crawling, but not force indexing of the page. This seems to be more in line with the deny behaviour of the previous Elastic Crawler.
Alternatives
Other alternatives would be to have deny behave as per the previous crawler, or to add some kind of post-crawl filtering such pages.
Additional Context
This came to my attention when crawling a site with a large, multi-page directory when I attempted to filter the directory pages out, but then noticed a significant drop in the number of crawled pages.
Problem Description
Current behaviours for crawl_rules.policy as of v0.22 appear to have the following behaviours:
allow- Page is crawled and indexeddeny- Page is not crawled and not indexedFor sites that have index pages, or site listings (example) - this forces you to index those pages, or at least send them to Elasticsearch and remove them using ingestion pipelines - not very elegant.
Proposed Solution
Change to the crawl_rules.policy for behaviour as follows:
allow- Page is crawled and indexed, I would also propose re-enumerating this asindexdeny- Page is not crawled and not indexed, I would also propose re-enumerating this asdiscardcrawlThe addition of the
crawloption would bounce through pages, and use their links for crawling, but not force indexing of the page. This seems to be more in line with thedenybehaviour of the previous Elastic Crawler.Alternatives
Other alternatives would be to have
denybehave as per the previous crawler, or to add some kind of post-crawl filtering such pages.Additional Context
This came to my attention when crawling a site with a large, multi-page directory when I attempted to filter the directory pages out, but then noticed a significant drop in the number of crawled pages.