Use Mozilla's PDF.js to extract OCR text#59
Use Mozilla's PDF.js to extract OCR text#59figadore wants to merge 10 commits intoscambier:masterfrom
Conversation
lib/src/pdf/pdf-worker.ts
Outdated
| onmessage = async evt => { | ||
| const buffer = Uint8Array.from(decodedPlugin, c => c.charCodeAt(0)) | ||
| await plugin.default(Promise.resolve(buffer)) | ||
| onmessage = async path => { |
There was a problem hiding this comment.
Log it to make sure, but this should probably stay async evt =>. This call is triggered when you call worker.run() in pdf-manager.ts, so you should receive an event object of the shape
{
data: {
path: string,
name: string
}
}
lib/package.json
Outdated
| "@apollo/utils.createhash": "^3.0.0", | ||
| "mammoth": "^1.6.0", | ||
| "p-queue": "^7.4.1", | ||
| "pdfjs-dist": "^4.2.67", |
There was a problem hiding this comment.
You probably don't need to add a dependency to pdf.js, as it's already bundled in Obsidian and you should be able to use it with something like this.
const arrayBuffer = await app.vault.readBinary(file);
// @ts-ignore
const document = await window.pdfjsLib.getDocument(arrayBuffer).promise;
for (let i = 1; i <= document.numPages; i++) {
const page = await document.getPage(i);
// etc.But! The bundled version might cause issues so it's still worth trying with an external dependency 👍
There was a problem hiding this comment.
i'll try the bundled version first to reduce the number of variables, i'm having a hard enough time just getting a basic development/iteration workflow
|
@scambier I made a few changes and I think I got pdf.js working on a small number of files. I added a 10 second delay, but with a large number of files, it still crashes within a few seconds, |
|
Thanks @scambier for the help and guidance so far. The latest place I'm stuck at seems to be related to how omnisearch and text-extractor work together. At least that's my best guess so far. My workflow to test my changes has been to quit Obsidian, load a bunch of PDFs into the vault directory, and re-open Obsidian. Once open, since Omnisearch has "PDFs content indexing" enabled, the developer console lists a whole bunch of entries like If there are a large enough number of PDFs, Obsidian crashes. Otherwise it successfully extracts all the text from all the new PDFs. None of the debug messages that I placed in the pdf-worker.ts or pdf-manager.ts show up though, like it's somehow shortcutting the queue and calling the extraction library directly. When I right click on a file and choose "Text Extractor" -> "Extract Text to clipboard", however, then my debug log messages show up (and the pdf-worker script throws an exception about Any thoughts? Am I on the right track in thinking Omnisearch may not be using the pdf queue mechanism? |
|
Omnisearch will get a list of all indexable files, and asynchronously convert them to This conversion is done in 3 different ways:
The queue management happens in Text Extractor: I'm quite confident it works as intended; I had several problems when spawning too many web workers to process the files, and the CPU usage goes through the roof (hence this small trick to leave some room to breathe for the cpu) |
|
Progress report: Gah, I wasted so many hours testing changes to the text-extractor repo, and none of my changes were showing up, and I finally realized it was because of my original attempt at solving this by modifying the omnisearch plugin back when I cloned #290. My changes there made it so nothing I have been testing was having any effect for Omnisearch indexing/extracting (so it was sort of skipping the queue mechanism, just not for the reasons I guessed) I also somehow missed that pdf-worker.ts is a "web worker", which has its own set of rules for sharing state, loading libraries, etc. I'm not able to use |
|
mmmh I think PDF.js uses its own web worker(s), so you should be able to remove pdf-worker.ts and instead directly call PDF.js here. Because yeah, web workers only take serializable data as input/output so they're kinda difficult to work with for anything that requires external dependencies :/ |
|
I've been looking at per-page extraction and was thinking about this PR... have spotted that there's a pdf-parse library that wraps pdf.js and provides a much easier to use API. I ran a test by having a single node process extract the text of 1080 PDFs sequentially and it worked without crashing - @figadore how many PDFs did you experience a crash after? |



Intended to address #21
I'm currently unable to effectively test changes to the code