Skip to content

Latest commit

 

History

History
237 lines (180 loc) · 8.15 KB

File metadata and controls

237 lines (180 loc) · 8.15 KB

Extraction Rules

This page explains the individual fields in the extraction ruleset configuration. The last section provides usage examples.

Summary

Extraction rules enable you to customize how the Elastic Open Web Crawler extracts content from webpages. Extraction rules are configured in the crawler config file, under the domains[].extraction_rulesets field.

domains[].extraction_rulesets is an array tied to the url within the same domains array item. If a crawl result's base URL matches the configured domains[].url, Open Crawler will then check whether the result’s full URL matches any of the URL filters. If any filter matches, the crawler will execute the associated extraction rules.

URL Filters

URL filters are an array. If a URL matches any of the conditions in this array, then all of the extraction rules will be executed. If the URL filter is empty, then the extraction rules will be applied to every crawl result.

domains[].extraction_rulesets[].url_filters[].type

The type of URL filter that will be used.

Possible values:

  • begins
    • The beginning of the URL endpoint
  • ends
    • The end of the URL endpoint
  • contains
    • Any value match within the endpoint
  • regex
    • Any regular expression

domains[].extraction_rulesets[].url_filters[].pattern

The pattern the URL filter will follow, dependent on the value for type.

The following examples would all work for the URL http://example.com/blog/help/contact:

type value pattern example
begins /blog
ends contact, /contact
contains help, /help/
regex ^blog/help/support$

Rules

Rules are an array. If any of the URL filters are true for an endpoint, the crawler will attempt to execute all of the configured rules in the array.

domains[].extraction_rulesets[].rules[].action

Specifies what action the crawler should take for this rule.

Possible values:

  • extract
    • Extracts the full HTML element found using the selector
    • The crawler will directly add it to the document using field_name as the doc's field name
    • If multiple values are found, they will be concatenated according to the join_as value
  • set
    • The crawler will see if the HTML element configured in selector exists or not
    • If one or multiple elements exist, the crawler will add the configured value to the document using field_name as the doc's field name
    • If it does not exist, the crawler will not add anything to the document

domains[].extraction_rulesets[].rules[].field_name

The field_name of the document where the extracted content will be stored. This can be any string value, as long as it is not one of the predefined field names in the document schema.

domains[].extraction_rulesets[].rules[].selector

The selector for finding the content in HTML.

Selectors for html sources

If source is html, this can be a CSS selector or an Xpath selector.

There are examples in W3schools for selector syntax:

You can also refer to the official W3C documentation for more details:

Selectors for url sources

If source is url, this must be a regular expression (regexp). We recommend using capturing groups to explicitly indicate which part of the regular expression needs to stored as a content field.

Here are some examples:

String Regex Match result Match group (final result)
https://example.org/posts/2023/01/20/post-1 posts\/([0-9]{4}) posts/2023 2023
https://example.org/posts/2023/01/20/post-1 posts\/([0-9]{4})\/([0-9]{2})\/([0-9]{2}) posts/2023/01/20 [2023, 01, 20]

domains[].extraction_rulesets[].rules[].join_as

The method for concatenating multiple values. This is only applicable if action is extract.

Values can be string or array.

domains[].extraction_rulesets[].rules[].value

The value to be inserted into the document if the selector finds any value. This is only applicable if action is set.

Value can be anything except null.

domains[].extraction_rulesets[].rules[].source

Specifies the content source to extract from. Currently only html or url is supported.

Examples

Extracting from HTML

I have a simple website for an RPG. A page describing cities in the RPG is hosted at https://totally-real-rpg.com/cities. The HTML for this page looks like this:

<!DOCTYPE html>
<html>
  <body>
    <div>Cities:</div>
    <div class="city">Summerstay</div>
    <div class="city">Drenchwell</div>
    <div class="city">Mezzoterran</div>
  </body>
</html>

I want to extract all of the cities as an array, but only from the webpage that ends with /cities. First I must set the url_filters for this extraction rule to apply to only this URL. Then I can define what the Open Crawler should do when it encounters this webpage.

domains:
  - url: https://totally-real-rpg.com
    extraction_rulesets:
      - url_filters:
          - type: "ends"
            pattern: "/cities"
        rules:
          - action: "extract"
            field_name: "cities"
            selector: ".city"
            join_as: "array"
            source: "html"

In this example, the output document will include the following field on top of the standard crawl result fields:

{
  "cities": ["Summerstay", "Drenchwell", "Mezzoterran"]
}

Extracting from URLs

Now, I also have a blog on this website. There are three posts on this blog, which fall under the following URLs:

When these sites are crawled, I want to get only the year that the blog was published. First I should define the url_filters so that this extraction only applies to blogs. Then I can use a regex selector in the rule to fetch the year from the URL.

domains:
  - url: https://totally-real-rpg.com
    extraction_rulesets:
      - url_filters:
          - type: "begins"
            pattern: "/blog"
        rules:
          - action: "extract"
            field_name: "publish_year"
            selector: "blog\/([0-9]{4})"
            join_as: "string"
            source: "url"

In this example, the ingested documents will include the following fields on top of the standard crawl result fields:

Multiple rulesets

There's no limit to the number of extraction rulesets that can be defined for a single crawler. Taking the above two examples, we can combine them into a single configuration.

domains:
  - url: https://totally-real-rpg.com
    extraction_rulesets:
      - url_filters:
            - type: "ends"
              pattern: "/cities"
        rules:
          - action: "extract"
            field_name: "cities"
            selector: ".city"
            join_as: "array"
            source: "html"
      - url_filters:
          - type: "begins"
            pattern: "/blog"
        rules:
          - action: "extract"
            field_name: "publish_year"
            selector: "blog\/([0-9]{4})"
            join_as: "string"
            source: "url"