This page explains the individual fields in the extraction ruleset configuration. The last section provides usage examples.
Extraction rules enable you to customize how the Elastic Open Web Crawler extracts content from webpages.
Extraction rules are configured in the crawler config file, under the domains[].extraction_rulesets field.
domains[].extraction_rulesets is an array tied to the url within the same domains array item.
If a crawl result's base URL matches the configured domains[].url, Open Crawler will then check whether the result’s full URL matches any of the URL filters.
If any filter matches, the crawler will execute the associated extraction rules.
URL filters are an array. If a URL matches any of the conditions in this array, then all of the extraction rules will be executed. If the URL filter is empty, then the extraction rules will be applied to every crawl result.
The type of URL filter that will be used.
Possible values:
begins- The beginning of the URL endpoint
ends- The end of the URL endpoint
contains- Any value match within the endpoint
regex- Any regular expression
The pattern the URL filter will follow, dependent on the value for type.
The following examples would all work for the URL http://example.com/blog/help/contact:
type value |
pattern example |
|---|---|
begins |
/blog |
ends |
contact, /contact |
contains |
help, /help/ |
regex |
^blog/help/support$ |
Rules are an array. If any of the URL filters are true for an endpoint, the crawler will attempt to execute all of the configured rules in the array.
Specifies what action the crawler should take for this rule.
Possible values:
extract- Extracts the full HTML element found using the
selector - The crawler will directly add it to the document using
field_nameas the doc's field name - If multiple values are found, they will be concatenated according to the
join_asvalue
- Extracts the full HTML element found using the
set- The crawler will see if the HTML element configured in
selectorexists or not - If one or multiple elements exist, the crawler will add the configured
valueto the document usingfield_nameas the doc's field name - If it does not exist, the crawler will not add anything to the document
- The crawler will see if the HTML element configured in
The field_name of the document where the extracted content will be stored.
This can be any string value, as long as it is not one of the predefined field names in the document schema.
The selector for finding the content in HTML.
If source is html, this can be a CSS selector or an Xpath selector.
There are examples in W3schools for selector syntax:
You can also refer to the official W3C documentation for more details:
If source is url, this must be a regular expression (regexp).
We recommend using capturing groups to explicitly indicate which part of the regular expression needs to stored as a content field.
Here are some examples:
| String | Regex | Match result | Match group (final result) |
|---|---|---|---|
https://example.org/posts/2023/01/20/post-1 |
posts\/([0-9]{4}) |
posts/2023 |
2023 |
https://example.org/posts/2023/01/20/post-1 |
posts\/([0-9]{4})\/([0-9]{2})\/([0-9]{2}) |
posts/2023/01/20 |
[2023, 01, 20] |
The method for concatenating multiple values.
This is only applicable if action is extract.
Values can be string or array.
The value to be inserted into the document if the selector finds any value.
This is only applicable if action is set.
Value can be anything except null.
Specifies the content source to extract from.
Currently only html or url is supported.
I have a simple website for an RPG.
A page describing cities in the RPG is hosted at https://totally-real-rpg.com/cities.
The HTML for this page looks like this:
<!DOCTYPE html>
<html>
<body>
<div>Cities:</div>
<div class="city">Summerstay</div>
<div class="city">Drenchwell</div>
<div class="city">Mezzoterran</div>
</body>
</html>I want to extract all of the cities as an array, but only from the webpage that ends with /cities.
First I must set the url_filters for this extraction rule to apply to only this URL.
Then I can define what the Open Crawler should do when it encounters this webpage.
domains:
- url: https://totally-real-rpg.com
extraction_rulesets:
- url_filters:
- type: "ends"
pattern: "/cities"
rules:
- action: "extract"
field_name: "cities"
selector: ".city"
join_as: "array"
source: "html"In this example, the output document will include the following field on top of the standard crawl result fields:
{
"cities": ["Summerstay", "Drenchwell", "Mezzoterran"]
}Now, I also have a blog on this website. There are three posts on this blog, which fall under the following URLs:
- https://totally-real-rpg.com/blog/2023/12/25/beginners-guide
- https://totally-real-rpg.com/blog/2024/01/07/patch-1.0-changes
- https://totally-real-rpg.com/blog/2024/02/18/upcoming-server-maintenance
When these sites are crawled, I want to get only the year that the blog was published.
First I should define the url_filters so that this extraction only applies to blogs.
Then I can use a regex selector in the rule to fetch the year from the URL.
domains:
- url: https://totally-real-rpg.com
extraction_rulesets:
- url_filters:
- type: "begins"
pattern: "/blog"
rules:
- action: "extract"
field_name: "publish_year"
selector: "blog\/([0-9]{4})"
join_as: "string"
source: "url"In this example, the ingested documents will include the following fields on top of the standard crawl result fields:
- https://totally-real-rpg.com/blog/2023/12/25/beginners-guide
{ "publish_year": "2023" } - https://totally-real-rpg.com/blog/2024/01/07/patch-1.0-changes
{ "publish_year": "2024" } - https://totally-real-rpg.com/blog/2024/02/18/upcoming-server-maintenance
{ "publish_year": "2024" }
There's no limit to the number of extraction rulesets that can be defined for a single crawler. Taking the above two examples, we can combine them into a single configuration.
domains:
- url: https://totally-real-rpg.com
extraction_rulesets:
- url_filters:
- type: "ends"
pattern: "/cities"
rules:
- action: "extract"
field_name: "cities"
selector: ".city"
join_as: "array"
source: "html"
- url_filters:
- type: "begins"
pattern: "/blog"
rules:
- action: "extract"
field_name: "publish_year"
selector: "blog\/([0-9]{4})"
join_as: "string"
source: "url"