crawl_rules.policy needs additional option(s)

### Problem Description
Current behaviours for crawl_rules.policy as of v0.22 appear to have the following behaviours:
- `allow`- Page is crawled and indexed
- `deny` - Page is not crawled and not indexed

For sites that have index pages, or site listings ([example](https://thedudeabides.com/articles/)) - this forces you to index those pages, or at least send them to Elasticsearch and remove them using ingestion pipelines - not very elegant.

### Proposed Solution
Change to the crawl_rules.policy for behaviour as follows:
- `allow`- Page is crawled and indexed, I would also propose re-enumerating this as `index`
- `deny` - Page is not crawled and not indexed, I would also propose re-enumerating this as `discard`
- **new thing** - Page is crawled but not indexed, I would also propose the value of `crawl`

The addition of the `crawl` option would bounce through pages, and use their links for crawling, but not force indexing of the page. This seems to be more in line with the `deny` behaviour of the previous Elastic Crawler.

### Alternatives
Other alternatives would be to have `deny` behave as per the previous crawler, or to add some kind of post-crawl filtering such pages.

### Additional Context
This came to my attention when crawling a site with a large, multi-page directory when I attempted to filter the directory pages out, but then noticed a significant drop in the number of crawled pages.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

crawl_rules.policy needs additional option(s) #231

Problem Description

Proposed Solution

Alternatives

Additional Context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

crawl_rules.policy needs additional option(s) #231

Description

Problem Description

Proposed Solution

Alternatives

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions