[GH-ISSUE #1677] Archiving a page that is a PDF fails #1046

Closed
opened 2026-03-02 11:54:37 +03:00 by kerem · 0 comments
Owner

Originally created by @Fmstrat on GitHub (Jun 25, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1677

Describe the Bug

Archiving https://legislation.nysenate.gov/pdf/bills/2023/S5640, which is a PDF, fails.

|archiver-karakeep | 2025-06-25T14:45:35.778838760Z 2025-06-25T14:45:35.778Z info: [Crawler][15783] Will crawl "https://legislation.nysenate.gov/pdf/bills/2023/S5640" for link with id "y3h5bbsebq0au9gw6lrdkpg2"
|archiver-karakeep | 2025-06-25T14:45:35.778875099Z 2025-06-25T14:45:35.778Z info: [Crawler][15783] Attempting to determine the content-type for the url https://legislation.nysenate.gov/pdf/bills/2023/S5640
|archiver-karakeep | 2025-06-25T14:45:35.913397418Z 2025-06-25T14:45:35.913Z info: [Crawler][15783] Content-type for the url https://legislation.nysenate.gov/pdf/bills/2023/S5640 is "application/pdf;charset=UTF-8"
|archiver-karakeep | 2025-06-25T14:45:36.146469898Z 2025-06-25T14:45:36.146Z error: [Crawler][15783] Crawling job failed: Error: net::ERR_ABORTED at https://legislation.nysenate.gov/pdf/bills/2023/S5640
|archiver-karakeep | 2025-06-25T14:45:36.146510255Z Error: net::ERR_ABORTED at https://legislation.nysenate.gov/pdf/bills/2023/S5640
|archiver-karakeep | 2025-06-25T14:45:36.146520174Z     at navigate (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/src/cdp/Frame.ts:193:13)
|archiver-karakeep | 2025-06-25T14:45:36.146527839Z     at process.processTicksAndRejections (node:internal/process/task_queues:105:5)
|archiver-karakeep | 2025-06-25T14:45:36.146534932Z     at async Function.race (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/src/util/Deferred.ts:49:14)
|archiver-karakeep | 2025-06-25T14:45:36.146542226Z     at async CdpFrame.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/src/cdp/Frame.ts:146:17)
|archiver-karakeep | 2025-06-25T14:45:36.146549500Z     at async CdpPage.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/src/api/Page.ts:1581:12)
|archiver-karakeep | 2025-06-25T14:45:36.146556724Z     at async crawlPage (/app/apps/workers/workers/crawlerWorker.ts:301:22)
|archiver-karakeep | 2025-06-25T14:45:36.146563767Z     at async crawlAndParseUrl (/app/apps/workers/workers/crawlerWorker.ts:661:14)
|archiver-karakeep | 2025-06-25T14:45:36.146570741Z     at async runCrawler (/app/apps/workers/workers/crawlerWorker.ts:845:27)
|archiver-karakeep | 2025-06-25T14:45:36.146577854Z     at async Object.run (/app/apps/workers/utils.ts:6:12)
|archiver-karakeep | 2025-06-25T14:45:36.146584928Z     at async Runner.runOnce (/app/apps/workers/node_modules/.pnpm/liteque@0.3.2_better-sqlite3@11.3.0/node_modules/liteque/dist/runner.js:76:13)

Steps to Reproduce

  • Archive the above URL

Expected Behaviour

  • PDF is saved

Screenshots or Additional Context

No response

Device Details

No response

Exact Karakeep Version

0.25.0

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @Fmstrat on GitHub (Jun 25, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1677 ### Describe the Bug Archiving https://legislation.nysenate.gov/pdf/bills/2023/S5640, which is a PDF, fails. ``` |archiver-karakeep | 2025-06-25T14:45:35.778838760Z 2025-06-25T14:45:35.778Z info: [Crawler][15783] Will crawl "https://legislation.nysenate.gov/pdf/bills/2023/S5640" for link with id "y3h5bbsebq0au9gw6lrdkpg2" |archiver-karakeep | 2025-06-25T14:45:35.778875099Z 2025-06-25T14:45:35.778Z info: [Crawler][15783] Attempting to determine the content-type for the url https://legislation.nysenate.gov/pdf/bills/2023/S5640 |archiver-karakeep | 2025-06-25T14:45:35.913397418Z 2025-06-25T14:45:35.913Z info: [Crawler][15783] Content-type for the url https://legislation.nysenate.gov/pdf/bills/2023/S5640 is "application/pdf;charset=UTF-8" |archiver-karakeep | 2025-06-25T14:45:36.146469898Z 2025-06-25T14:45:36.146Z error: [Crawler][15783] Crawling job failed: Error: net::ERR_ABORTED at https://legislation.nysenate.gov/pdf/bills/2023/S5640 |archiver-karakeep | 2025-06-25T14:45:36.146510255Z Error: net::ERR_ABORTED at https://legislation.nysenate.gov/pdf/bills/2023/S5640 |archiver-karakeep | 2025-06-25T14:45:36.146520174Z at navigate (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/src/cdp/Frame.ts:193:13) |archiver-karakeep | 2025-06-25T14:45:36.146527839Z at process.processTicksAndRejections (node:internal/process/task_queues:105:5) |archiver-karakeep | 2025-06-25T14:45:36.146534932Z at async Function.race (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/src/util/Deferred.ts:49:14) |archiver-karakeep | 2025-06-25T14:45:36.146542226Z at async CdpFrame.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/src/cdp/Frame.ts:146:17) |archiver-karakeep | 2025-06-25T14:45:36.146549500Z at async CdpPage.goto (/app/apps/workers/node_modules/.pnpm/puppeteer-core@22.3.0/node_modules/puppeteer-core/src/api/Page.ts:1581:12) |archiver-karakeep | 2025-06-25T14:45:36.146556724Z at async crawlPage (/app/apps/workers/workers/crawlerWorker.ts:301:22) |archiver-karakeep | 2025-06-25T14:45:36.146563767Z at async crawlAndParseUrl (/app/apps/workers/workers/crawlerWorker.ts:661:14) |archiver-karakeep | 2025-06-25T14:45:36.146570741Z at async runCrawler (/app/apps/workers/workers/crawlerWorker.ts:845:27) |archiver-karakeep | 2025-06-25T14:45:36.146577854Z at async Object.run (/app/apps/workers/utils.ts:6:12) |archiver-karakeep | 2025-06-25T14:45:36.146584928Z at async Runner.runOnce (/app/apps/workers/node_modules/.pnpm/liteque@0.3.2_better-sqlite3@11.3.0/node_modules/liteque/dist/runner.js:76:13) ``` ### Steps to Reproduce - Archive the above URL ### Expected Behaviour - PDF is saved ### Screenshots or Additional Context _No response_ ### Device Details _No response_ ### Exact Karakeep Version 0.25.0 ### Have you checked the troubleshooting guide? - [x] I have checked the troubleshooting guide and I haven't found a solution to my problem
kerem 2026-03-02 11:54:37 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#1046
No description provided.