The Elastic Open Web Crawler can extract content from downloadable binary files, such as PDF and DOCX files.
Binary content is extracted by converting file contents to base64 and including the output in a document to index.
This value is picked up by an Elasticsearch ingest pipeline that will convert the base64 content into plain text, to store in the body field of the same document.
- Enable ingest pipelines in the Elasticsearch configuration
- Enable binary content extraction in the crawler configuration
- Select which MIME types should have their contents extracted
- The MIME type is determined by the HTTP response’s
Content-Typeheader when downloading a given file - While intended primarily for PDF and Microsoft Office formats, you can use any of the formats supported by Apache Tika
- No default MIME types are defined, so at least at least one MIME type must be configured in order to extract non-HTML content
- The ingest attachment processor does not support compressed files, e.g., an archive file containing a set of PDFs
- The MIME type is determined by the HTTP response’s
For example, the following configuration allows for the binary content extraction of PDF and DOCX files, through the default pipeline ent-search-ingestion-pipeline:
binary_content_extraction_enabled: true
binary_content_extraction_mime_types:
- application/pdf
- application/msword
elasticsearch:
pipeline: ent-search-generic-ingestion
pipeline_enabled: trueRead more on ingest pipelines in Open Crawler here.