[GH-ISSUE #1966] Extension not scraping whole page #1223

Closed
opened 2026-03-02 11:55:52 +03:00 by kerem · 9 comments
Owner

Originally created by @alfureu on GitHub (Sep 21, 2025).
Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1966

Describe the Bug

The issue is that on many pages the Extension / Karakeep does not scrape the whole text, making the archiving of the page problematic

Steps to Reproduce

  1. Go to a page, eg. https://arstechnica.com/ai/2025/09/science-journalists-find-chatgpt-is-bad-at-summarizing-scientific-papers/

  2. Click on Karakeep browser extension (currently v. 1.2.6)

  3. Check the Karakeep page:

Expected Behaviour

The extension / Karakeep should scrape the whole text

Screenshots or Additional Context

Screenshot of the beginning of the orginal web article:

Image

Screenshot in Karakeep (from the beginning, not scrolled):

Image

Device Details

Vivaldi

Exact Karakeep Version

0.27.1

Have you checked the troubleshooting guide?

  • I have checked the troubleshooting guide and I haven't found a solution to my problem
Originally created by @alfureu on GitHub (Sep 21, 2025). Original GitHub issue: https://github.com/karakeep-app/karakeep/issues/1966 ### Describe the Bug The issue is that on many pages the Extension / Karakeep does not scrape the whole text, making the archiving of the page problematic ### Steps to Reproduce 1. Go to a page, eg. https://arstechnica.com/ai/2025/09/science-journalists-find-chatgpt-is-bad-at-summarizing-scientific-papers/ 2. Click on Karakeep browser extension (currently v. 1.2.6) 3. Check the Karakeep page: ### Expected Behaviour The extension / Karakeep should scrape the whole text ### Screenshots or Additional Context Screenshot of the beginning of the orginal web article: <img width="863" height="711" alt="Image" src="https://github.com/user-attachments/assets/751feff5-2e49-4f16-9ff4-6afb4a69bc17" /> Screenshot in Karakeep (from the beginning, not scrolled): <img width="695" height="787" alt="Image" src="https://github.com/user-attachments/assets/5b3ac442-a96f-4977-a505-847946903618" /> ### Device Details Vivaldi ### Exact Karakeep Version 0.27.1 ### Have you checked the troubleshooting guide? - [x] I have checked the troubleshooting guide and I haven't found a solution to my problem
kerem 2026-03-02 11:55:52 +03:00
Author
Owner

@alfureu commented on GitHub (Sep 23, 2025):

In addition to this, it would be great if the browser extension would archive the whole visible page instead of offloading the download to the server. The former option would allow to archive paywalled articles to which the user have access to (via subscription) while implementing this functionality in the server scraper might be challenging.

<!-- gh-comment-id:3322529387 --> @alfureu commented on GitHub (Sep 23, 2025): In addition to this, it would be great if the browser extension would archive the whole visible page instead of offloading the download to the server. The former option would allow to archive paywalled articles to which the user have access to (via subscription) while implementing this functionality in the server scraper might be challenging.
Author
Owner

@qixing-jk commented on GitHub (Sep 23, 2025):

Consider this solution, the karakeep extension is not planned to add the archive feature

<!-- gh-comment-id:3322618973 --> @qixing-jk commented on GitHub (Sep 23, 2025): Consider this [solution](https://docs.karakeep.app/guides/singlefile), the karakeep extension is not planned to add the archive feature
Author
Owner

@alfureu commented on GitHub (Sep 23, 2025):

Consider this solution, the karakeep extension is not planned to add the archive feature

Thanks this is helpful. Please fix in the link the ?ifexists=MODE" to "&ifexists=MODE, first the url did not work with the question mark

<!-- gh-comment-id:3322933829 --> @alfureu commented on GitHub (Sep 23, 2025): > Consider this [solution](https://docs.karakeep.app/guides/singlefile), the karakeep extension is not planned to add the archive feature Thanks this is helpful. Please fix in the link the `?ifexists=MODE" to "&ifexists=MODE`, first the url did not work with the question mark
Author
Owner

@qixing-jk commented on GitHub (Sep 23, 2025):

Really? Could you provide a detailed description?

<!-- gh-comment-id:3323481045 --> @qixing-jk commented on GitHub (Sep 23, 2025): Really? Could you provide a detailed description?
Author
Owner

@alfureu commented on GitHub (Sep 23, 2025):

Really? Could you provide a detailed description?

I just followed the guide on the linked article, when I loaded the SimpleFile extension

<!-- gh-comment-id:3323564338 --> @alfureu commented on GitHub (Sep 23, 2025): > Really? Could you provide a detailed description? I just followed the guide on the linked article, when I loaded the SimpleFile extension
Author
Owner

@qixing-jk commented on GitHub (Sep 23, 2025):

first the url did not work with the question mark

What is the URL with the problem? That tutorial should not have any problems; there is no problem for me.

<!-- gh-comment-id:3323904967 --> @qixing-jk commented on GitHub (Sep 23, 2025): > first the url did not work with the question mark What is the URL with the problem? That tutorial should not have any problems; there is no problem for me.
Author
Owner

@alfureu commented on GitHub (Sep 24, 2025):

first the url did not work with the question mark

What is the URL with the problem? That tutorial should not have any problems; there is no problem for me.

Sorry, you are right, my bad. All is fine with the tutorial linked above. Still, in the above article I am unable to scrape the beginning of the article, no matter what I do

<!-- gh-comment-id:3327657081 --> @alfureu commented on GitHub (Sep 24, 2025): > > first the url did not work with the question mark > > What is the URL with the problem? That tutorial should not have any problems; there is no problem for me. Sorry, you are right, my bad. All is fine with the tutorial linked above. Still, in the above article I am unable to scrape the beginning of the article, no matter what I do
Author
Owner

@qixing-jk commented on GitHub (Sep 24, 2025):

first the url did not work with the question mark

What is the URL with the problem? That tutorial should not have any problems; there is no problem for me.

Sorry, you are right, my bad. All is fine with the tutorial linked above. Still, in the above article I am unable to scrape the beginning of the article, no matter what I do

There is no good solution at the moment. Currently, content extraction is implemented through https://github.com/mozilla/readability, and there is currently no way to customize the extraction part. Related code:
github.com/karakeep-app/karakeep@9fe09bfa90/apps/workers/workers/crawlerWorker.ts (L526)

<!-- gh-comment-id:3328011843 --> @qixing-jk commented on GitHub (Sep 24, 2025): > > > first the url did not work with the question mark > > > > > > What is the URL with the problem? That tutorial should not have any problems; there is no problem for me. > > Sorry, you are right, my bad. All is fine with the tutorial linked above. Still, in the above article I am unable to scrape the beginning of the article, no matter what I do There is no good solution at the moment. Currently, content extraction is implemented through https://github.com/mozilla/readability, and there is currently no way to customize the extraction part. Related code: https://github.com/karakeep-app/karakeep/blob/9fe09bfa9021c8d85d2d9aef591936101cab19f6/apps/workers/workers/crawlerWorker.ts#L526
Author
Owner

@MohamedBassem commented on GitHub (Oct 12, 2025):

As @qixing-jk mentioned, that's probably mozilla/readability problem. Not much we'll be able to do about it unfortunately.

<!-- gh-comment-id:3394541754 --> @MohamedBassem commented on GitHub (Oct 12, 2025): As @qixing-jk mentioned, that's probably `mozilla/readability` problem. Not much we'll be able to do about it unfortunately.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/karakeep#1223
No description provided.