Extraction Rules

This page explains the individual fields in the extraction ruleset configuration. The last section provides usage examples.

Summary

Extraction rules enable you to customize how the Elastic Open Web Crawler extracts content from webpages. Extraction rules are configured in the crawler config file, under the domains[].extraction_rulesets field.

domains[].extraction_rulesets is an array tied to the url within the same domains array item. If a crawl result's base URL matches the configured domains[].url, Open Crawler will then check whether the result’s full URL matches any of the URL filters. If any filter matches, the crawler will execute the associated extraction rules.

URL Filters

URL filters are an array. If a URL matches any of the conditions in this array, then all of the extraction rules will be executed. If the URL filter is empty, then the extraction rules will be applied to every crawl result.

`domains[].extraction_rulesets[].url_filters[].type`

The type of URL filter that will be used.

Possible values:

begins
- The beginning of the URL endpoint
ends
- The end of the URL endpoint
contains
- Any value match within the endpoint
regex
- Any regular expression

`domains[].extraction_rulesets[].url_filters[].pattern`

The pattern the URL filter will follow, dependent on the value for type.

The following examples would all work for the URL http://example.com/blog/help/contact:

`type` value	`pattern` example
`begins`	`/blog`
`ends`	`contact`, `/contact`
`contains`	`help`, `/help/`
`regex`	`^blog/help/support$`

Rules

Rules are an array. If any of the URL filters are true for an endpoint, the crawler will attempt to execute all of the configured rules in the array.

`domains[].extraction_rulesets[].rules[].action`

Specifies what action the crawler should take for this rule.

Possible values:

extract
- Extracts the full HTML element found using the selector
- The crawler will directly add it to the document using field_name as the doc's field name
- If multiple values are found, they will be concatenated according to the join_as value
set
- The crawler will see if the HTML element configured in selector exists or not
- If one or multiple elements exist, the crawler will add the configured value to the document using field_name as the doc's field name
- If it does not exist, the crawler will not add anything to the document

`domains[].extraction_rulesets[].rules[].field_name`

The field_name of the document where the extracted content will be stored. This can be any string value, as long as it is not one of the predefined field names in the document schema.

`domains[].extraction_rulesets[].rules[].selector`

The selector for finding the content in HTML.

Selectors for `html` sources

If source is html, this can be a CSS selector or an Xpath selector.

There are examples in W3schools for selector syntax:

You can also refer to the official W3C documentation for more details:

Selectors for `url` sources

If source is url, this must be a regular expression (regexp). We recommend using capturing groups to explicitly indicate which part of the regular expression needs to stored as a content field.

Here are some examples:

String	Regex	Match result	Match group (final result)
`https://example.org/posts/2023/01/20/post-1`	`posts\/([0-9]{4})`	`posts/2023`	`2023`
`https://example.org/posts/2023/01/20/post-1`	`posts\/([0-9]{4})\/([0-9]{2})\/([0-9]{2})`	`posts/2023/01/20`	`[2023, 01, 20]`

`domains[].extraction_rulesets[].rules[].join_as`

The method for concatenating multiple values. This is only applicable if action is extract.

Values can be string or array.

`domains[].extraction_rulesets[].rules[].value`

The value to be inserted into the document if the selector finds any value. This is only applicable if action is set.

Value can be anything except null.

`domains[].extraction_rulesets[].rules[].source`

Specifies the content source to extract from. Currently only html or url is supported.

Examples

Extracting from HTML

I have a simple website for an RPG. A page describing cities in the RPG is hosted at https://totally-real-rpg.com/cities. The HTML for this page looks like this:

<!DOCTYPE html>
<html>
  <body>
    <div>Cities:</div>
    <div class="city">Summerstay</div>
    <div class="city">Drenchwell</div>
    <div class="city">Mezzoterran</div>
  </body>
</html>

I want to extract all of the cities as an array, but only from the webpage that ends with /cities. First I must set the url_filters for this extraction rule to apply to only this URL. Then I can define what the Open Crawler should do when it encounters this webpage.

domains:
  - url: https://totally-real-rpg.com
    extraction_rulesets:
      - url_filters:
          - type: "ends"
            pattern: "/cities"
        rules:
          - action: "extract"
            field_name: "cities"
            selector: ".city"
            join_as: "array"
            source: "html"

In this example, the output document will include the following field on top of the standard crawl result fields:

{
  "cities": ["Summerstay", "Drenchwell", "Mezzoterran"]
}

Extracting from URLs

Now, I also have a blog on this website. There are three posts on this blog, which fall under the following URLs:

When these sites are crawled, I want to get only the year that the blog was published. First I should define the url_filters so that this extraction only applies to blogs. Then I can use a regex selector in the rule to fetch the year from the URL.

domains:
  - url: https://totally-real-rpg.com
    extraction_rulesets:
      - url_filters:
          - type: "begins"
            pattern: "/blog"
        rules:
          - action: "extract"
            field_name: "publish_year"
            selector: "blog\/([0-9]{4})"
            join_as: "string"
            source: "url"

In this example, the ingested documents will include the following fields on top of the standard crawl result fields:

https://totally-real-rpg.com/blog/2023/12/25/beginners-guide
```
{ "publish_year": "2023" }
```
https://totally-real-rpg.com/blog/2024/01/07/patch-1.0-changes
```
{ "publish_year": "2024" }
```
https://totally-real-rpg.com/blog/2024/02/18/upcoming-server-maintenance
```
{ "publish_year": "2024" }
```

Multiple rulesets

There's no limit to the number of extraction rulesets that can be defined for a single crawler. Taking the above two examples, we can combine them into a single configuration.

domains:
  - url: https://totally-real-rpg.com
    extraction_rulesets:
      - url_filters:
            - type: "ends"
              pattern: "/cities"
        rules:
          - action: "extract"
            field_name: "cities"
            selector: ".city"
            join_as: "array"
            source: "html"
      - url_filters:
          - type: "begins"
            pattern: "/blog"
        rules:
          - action: "extract"
            field_name: "publish_year"
            selector: "blog\/([0-9]{4})"
            join_as: "string"
            source: "url"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extraction Rules

Summary

URL Filters

`domains[].extraction_rulesets[].url_filters[].type`

`domains[].extraction_rulesets[].url_filters[].pattern`

Rules

`domains[].extraction_rulesets[].rules[].action`

`domains[].extraction_rulesets[].rules[].field_name`

`domains[].extraction_rulesets[].rules[].selector`

Selectors for `html` sources

Selectors for `url` sources

`domains[].extraction_rulesets[].rules[].join_as`

`domains[].extraction_rulesets[].rules[].value`

`domains[].extraction_rulesets[].rules[].source`

Examples

Extracting from HTML

Extracting from URLs

Multiple rulesets

FilesExpand file tree

EXTRACTION_RULES.md

Latest commit

History

EXTRACTION_RULES.md

File metadata and controls

Extraction Rules

Summary

URL Filters

domains[].extraction_rulesets[].url_filters[].type

domains[].extraction_rulesets[].url_filters[].pattern

Rules

domains[].extraction_rulesets[].rules[].action

domains[].extraction_rulesets[].rules[].field_name

domains[].extraction_rulesets[].rules[].selector

Selectors for html sources

Selectors for url sources

domains[].extraction_rulesets[].rules[].join_as

domains[].extraction_rulesets[].rules[].value

domains[].extraction_rulesets[].rules[].source

Examples

Extracting from HTML

Extracting from URLs

Multiple rulesets

`domains[].extraction_rulesets[].url_filters[].type`

`domains[].extraction_rulesets[].url_filters[].pattern`

`domains[].extraction_rulesets[].rules[].action`

`domains[].extraction_rulesets[].rules[].field_name`

`domains[].extraction_rulesets[].rules[].selector`

Selectors for `html` sources

Selectors for `url` sources

`domains[].extraction_rulesets[].rules[].join_as`

`domains[].extraction_rulesets[].rules[].value`

`domains[].extraction_rulesets[].rules[].source`