[GH-ISSUE #1096] Manual reprocess required for PDF screenshots to work #721

Closed
opened 2026-03-02 11:52:10 +03:00 by kerem · 1 comment
Owner

Originally created by @vhsdream on GitHub (Mar 7, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1096

Describe the Bug

I've just cloned Main to test out updating to v0.23 for the Proxmox LXC version. I'm not sure if this is a bug, or perhaps I'm impatient, but when I add a PDF, the OCR job runs, but the screenshot gen job does not. The PDFs look like this:
Image

Then I go into Admin settings and trigger a reprocess and the screenshot is generated:
Image

Steps to Reproduce

  1. Install/update Hoarder to latest, based on Main
  2. Add a PDF
  3. Wait and refresh the page
  4. Trigger a manual reprocess job then see the image generated

Expected Behaviour

Unless I'm misunderstanding how it's supposed to work, I was thinking that upon adding a PDF, multiple jobs would be running; at least one for the OCR and another for the screenshot gen.

Screenshots or Additional Context

I'm not using the Docker version, but a Proxmox LXC install using the script we created.

I've broken the log output into sections, but there is nothing left out, it's just to note when certain events occur as a result of my actions.

The new dependencies are installed:

root@hoarder-v023:~# dpkg -l | grep ghostscript
ii  ghostscript                     10.0.0~dfsg-11+deb12u6              amd64        interpreter for the PostScript language and for PDF
root@hoarder-v023:~# dpkg -l | grep graphicsmagick
ii  graphicsmagick                  1.4+really1.3.40-4                  amd64        collection of image processing tools
ii  libgraphicsmagick-q16-3         1.4+really1.3.40-4                  amd64        format-independent image processing - C shared library

Adding a PDF:

Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.350Z info: [Crawler] Connecting to existing browser instance: http://127.0.0.1:9222
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.350Z info: [Crawler] Successfully resolved IP address, new address: http://127.0.0.1:9222/
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting crawler worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting inference worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting search indexing worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting tidy assets worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting video worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.400Z info: Starting feed worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.400Z info: Starting asset preprocessing worker ...
Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.400Z info: Starting webhook worker ...
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.583Z info: [Crawler][69] Will crawl "https://getsamplefiles.com/download/pdf/sample-1.pdf" for link with id "df6uiumrmz6mluy1bq9r26zf"
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.584Z info: [Crawler][69] Attempting to determine the content-type for the url https://getsamplefiles.com/download/pdf/sample-1.pdf
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.627Z info: [webhook][71] Starting a webhook job for bookmark with id "df6uiumrmz6mluy1bq9r26zf"
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.627Z info: [webhook][71] Completed successfully
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.634Z info: [search][70] Attempting to index bookmark with id df6uiumrmz6mluy1bq9r26zf ...
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.704Z info: [search][70] Completed successfully
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.710Z info: [Crawler][69] Content-type for the url https://getsamplefiles.com/download/pdf/sample-1.pdf is "application/pdf"
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.710Z info: [Crawler][69] Downloading pdf from "https://getsamplefiles.com/download/pdf/sample-1.pdf"
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.731Z info: [Crawler][69] Downloaded pdf as assetId: 3ff23b35-95c1-444c-b6a9-73146ce01a44
Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.742Z info: [Crawler][69] Completed successfully
Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.639Z info: [assetPreprocessing][72] Starting an asset preprocessing job for bookmark with id "df6uiumrmz6mluy1bq9r26zf"
Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.642Z info: [assetPreprocessing][72] Attempting to extract text from pdf.
Mar 06 19:08:04 hoarder-v023 pnpm[12413]: Warning: Setting up fake worker.
Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.840Z info: [assetPreprocessing][72] Extracted 2212 characters from pdf.
Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.850Z info: [assetPreprocessing][72] Completed successfully
Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.629Z debug: [inference][73] No inference client configured, nothing to do now
Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.629Z info: [inference][73] Completed successfully
Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.789Z info: [search][74] Attempting to index bookmark with id df6uiumrmz6mluy1bq9r26zf ...
Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.924Z info: [search][74] Completed successfully

Manually triggering a reprocessing job from the Admin console:

Mar 06 19:09:06 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:06.948Z info: [assetPreprocessing][75] Starting an asset preprocessing job for bookmark with id "df6uiumrmz6mluy1bq9r26zf"
Mar 06 19:09:06 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:06.952Z info: [assetPreprocessing][75] Skipping PDF text extraction as it's already been extracted.
Mar 06 19:09:06 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:06.952Z info: [assetPreprocessing][75] Attempting to generate PDF screenshot for bookmarkId: df6uiumrmz6mluy1bq9r26zf
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.293Z info: [assetPreprocessing][75] Successfully saved PDF screenshot to database
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.295Z info: [assetPreprocessing][75] Completed successfully
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.298Z info: [assetPreprocessing][76] Starting an asset preprocessing job for bookmark with id "jw045iecs73tcp72cs90xtwz"
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.299Z info: [assetPreprocessing][76] Skipping PDF text extraction as it's already been extracted.
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.299Z info: [assetPreprocessing][76] Skipping PDF screenshot generation as it's already been generated.
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.299Z info: [assetPreprocessing][76] Completed successfully
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.727Z debug: [inference][77] No inference client configured, nothing to do now
Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.728Z info: [inference][77] Completed successfully
Mar 06 19:09:08 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:08.032Z info: [search][78] Attempting to index bookmark with id df6uiumrmz6mluy1bq9r26zf ...
Mar 06 19:09:08 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:08.106Z info: [search][78] Completed successfully

Device Details

Firefox latest Arch Linux

Exact Hoarder Version

Pulled from Main

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @vhsdream on GitHub (Mar 7, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1096 ### Describe the Bug I've just cloned Main to test out updating to v0.23 for the Proxmox LXC version. I'm not sure if this is a bug, or perhaps I'm impatient, but when I add a PDF, the OCR job runs, but the screenshot gen job does not. The PDFs look like this: ![Image](https://github.com/user-attachments/assets/44966bb6-be1d-4075-b178-6b7e3435e7ac) Then I go into Admin settings and trigger a reprocess and the screenshot is generated: ![Image](https://github.com/user-attachments/assets/89fa139a-2f40-414e-926b-a56bdba4e036) ### Steps to Reproduce 1. Install/update Hoarder to latest, based on Main 2. Add a PDF 3. Wait and refresh the page 4. Trigger a manual reprocess job then see the image generated ### Expected Behaviour Unless I'm misunderstanding how it's supposed to work, I was thinking that upon adding a PDF, multiple jobs would be running; at least one for the OCR and another for the screenshot gen. ### Screenshots or Additional Context I'm not using the Docker version, but a Proxmox LXC install using the script we created. I've broken the log output into sections, but there is nothing left out, it's just to note when certain events occur as a result of my actions. The new dependencies are installed: ```bash root@hoarder-v023:~# dpkg -l | grep ghostscript ii ghostscript 10.0.0~dfsg-11+deb12u6 amd64 interpreter for the PostScript language and for PDF root@hoarder-v023:~# dpkg -l | grep graphicsmagick ii graphicsmagick 1.4+really1.3.40-4 amd64 collection of image processing tools ii libgraphicsmagick-q16-3 1.4+really1.3.40-4 amd64 format-independent image processing - C shared library ``` Adding a PDF: ```bash Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.350Z info: [Crawler] Connecting to existing browser instance: http://127.0.0.1:9222 Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.350Z info: [Crawler] Successfully resolved IP address, new address: http://127.0.0.1:9222/ Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting crawler worker ... Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting inference worker ... Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting search indexing worker ... Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting tidy assets worker ... Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.399Z info: Starting video worker ... Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.400Z info: Starting feed worker ... Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.400Z info: Starting asset preprocessing worker ... Mar 06 19:06:20 hoarder-v023 pnpm[12413]: 2025-03-07T00:06:20.400Z info: Starting webhook worker ... Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.583Z info: [Crawler][69] Will crawl "https://getsamplefiles.com/download/pdf/sample-1.pdf" for link with id "df6uiumrmz6mluy1bq9r26zf" Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.584Z info: [Crawler][69] Attempting to determine the content-type for the url https://getsamplefiles.com/download/pdf/sample-1.pdf Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.627Z info: [webhook][71] Starting a webhook job for bookmark with id "df6uiumrmz6mluy1bq9r26zf" Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.627Z info: [webhook][71] Completed successfully Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.634Z info: [search][70] Attempting to index bookmark with id df6uiumrmz6mluy1bq9r26zf ... Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.704Z info: [search][70] Completed successfully Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.710Z info: [Crawler][69] Content-type for the url https://getsamplefiles.com/download/pdf/sample-1.pdf is "application/pdf" Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.710Z info: [Crawler][69] Downloading pdf from "https://getsamplefiles.com/download/pdf/sample-1.pdf" Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.731Z info: [Crawler][69] Downloaded pdf as assetId: 3ff23b35-95c1-444c-b6a9-73146ce01a44 Mar 06 19:08:03 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:03.742Z info: [Crawler][69] Completed successfully Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.639Z info: [assetPreprocessing][72] Starting an asset preprocessing job for bookmark with id "df6uiumrmz6mluy1bq9r26zf" Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.642Z info: [assetPreprocessing][72] Attempting to extract text from pdf. Mar 06 19:08:04 hoarder-v023 pnpm[12413]: Warning: Setting up fake worker. Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.840Z info: [assetPreprocessing][72] Extracted 2212 characters from pdf. Mar 06 19:08:04 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:04.850Z info: [assetPreprocessing][72] Completed successfully Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.629Z debug: [inference][73] No inference client configured, nothing to do now Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.629Z info: [inference][73] Completed successfully Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.789Z info: [search][74] Attempting to index bookmark with id df6uiumrmz6mluy1bq9r26zf ... Mar 06 19:08:05 hoarder-v023 pnpm[12413]: 2025-03-07T00:08:05.924Z info: [search][74] Completed successfully ``` Manually triggering a reprocessing job from the Admin console: ```bash Mar 06 19:09:06 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:06.948Z info: [assetPreprocessing][75] Starting an asset preprocessing job for bookmark with id "df6uiumrmz6mluy1bq9r26zf" Mar 06 19:09:06 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:06.952Z info: [assetPreprocessing][75] Skipping PDF text extraction as it's already been extracted. Mar 06 19:09:06 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:06.952Z info: [assetPreprocessing][75] Attempting to generate PDF screenshot for bookmarkId: df6uiumrmz6mluy1bq9r26zf Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.293Z info: [assetPreprocessing][75] Successfully saved PDF screenshot to database Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.295Z info: [assetPreprocessing][75] Completed successfully Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.298Z info: [assetPreprocessing][76] Starting an asset preprocessing job for bookmark with id "jw045iecs73tcp72cs90xtwz" Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.299Z info: [assetPreprocessing][76] Skipping PDF text extraction as it's already been extracted. Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.299Z info: [assetPreprocessing][76] Skipping PDF screenshot generation as it's already been generated. Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.299Z info: [assetPreprocessing][76] Completed successfully Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.727Z debug: [inference][77] No inference client configured, nothing to do now Mar 06 19:09:07 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:07.728Z info: [inference][77] Completed successfully Mar 06 19:09:08 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:08.032Z info: [search][78] Attempting to index bookmark with id df6uiumrmz6mluy1bq9r26zf ... Mar 06 19:09:08 hoarder-v023 pnpm[12413]: 2025-03-07T00:09:08.106Z info: [search][78] Completed successfully ``` ### Device Details Firefox latest Arch Linux ### Exact Hoarder Version Pulled from Main ### Have you checked the troubleshooting guide? - [x] I have checked the troubleshooting guide and I haven't found a solution to my problem
kerem 2026-03-02 11:52:10 +03:00
Author
Owner

@MohamedBassem commented on GitHub (Mar 7, 2025):

This indeed is a bug. Thanks for the report, will send a quick fix.

<!-- gh-comment-id:2705911121 --> @MohamedBassem commented on GitHub (Mar 7, 2025): This indeed is a bug. Thanks for the report, will send a quick fix.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#721
No description provided.