[GH-ISSUE #1203] Feature Request: a web clipper #3763

Open
opened 2026-03-15 00:22:49 +03:00 by kerem · 3 comments
Owner

Originally created by @berezovskyi on GitHub (Aug 8, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1203

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

Being able to clip webpage contents that are hard to fetch using ArchiveBox (captchas, datacenter IP blocks, authentication).

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

Evernote Web Clipper did it perfectly.

What hacks or alternative solutions have you tried to solve the problem?

Using Evernote Web Clipper for pages that ArchiveBox cannot archive. Tried Joplin and it seems to do the job too.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @berezovskyi on GitHub (Aug 8, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1203 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you :) --> ## Type - [ ] General question or discussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves <!-- e.g. I need to be able to archive spanish and french subtitle files from a particular <example.com> movie site that's going down soon. --> Being able to clip webpage contents that are hard to fetch using ArchiveBox (captchas, datacenter IP blocks, authentication). ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes <!-- e.g. I specifically need a new archive method to look for multilingual subtitle files related to pages. The bigger picture solution is the ability for custom user scripts to be run in a puppeteer context during archiving. --> Evernote Web Clipper did it perfectly. ## What hacks or alternative solutions have you tried to solve the problem? <!-- A clear and concise description of any alternative solutions, workarounds, or other software you've considered using to fix the problem. --> Using Evernote Web Clipper for pages that ArchiveBox cannot archive. Tried Joplin and it seems to do the job too. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [x] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [x] I'm willing to contribute [dev time](https://github.com/ArchiveBox/ArchiveBox#archivebox-development) / money to fix this issue - [x] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@berezovskyi commented on GitHub (Aug 8, 2023):

I started to look around on how this could be done technically and the first idea I have is to take some OSS clipper extension and fork it to suit AB needs. Eg https://github.com/go-shiori/shiori-web-ext

Regarding the upload, I think the best way would be to allow AB to import WARCs (also see https://github.com/ArchiveBox/ArchiveBox/issues/160). Then, perhaps, an extension like https://github.com/machawk1/warcreate could be used without any changes or with a minimal one (to automatically upload the WARC).

<!-- gh-comment-id:1669253478 --> @berezovskyi commented on GitHub (Aug 8, 2023): I started to look around on how this could be done technically and the first idea I have is to take some OSS clipper extension and fork it to suit AB needs. Eg https://github.com/go-shiori/shiori-web-ext Regarding the upload, I think the best way would be to allow AB to import WARCs (also see https://github.com/ArchiveBox/ArchiveBox/issues/160). Then, perhaps, an extension like https://github.com/machawk1/warcreate could be used without any changes or with a minimal one (to automatically upload the WARC).
Author
Owner

@gerroon commented on GitHub (Dec 15, 2023):

This would be so awesome! Joplin has a good web clipper. Trilium's web cliper is ok.

<!-- gh-comment-id:1857222326 --> @gerroon commented on GitHub (Dec 15, 2023): This would be so awesome! Joplin has a good web clipper. Trilium's web cliper is ok.
Author
Owner

@pirate commented on GitHub (Dec 15, 2023):

In the meantime as a workaround if you urgently need this, any files placed into the snapshot folder (./archive/<timestamp>/) will be respected by archivebox. So if you have any external WARC, PNG, PDF, etc files you can drag them into the snapshot folder manually or create a small script to place them in there.

If you overwrite the existing files or use the default names archivebox uses it will even display them properly in the UI as part of the snapshot.

I try to respect the UNIX "everything is a file" mentality, and may even move towards supporting more pure filesystem-based manipulation of the archives in future releases.

<!-- gh-comment-id:1857955318 --> @pirate commented on GitHub (Dec 15, 2023): In the meantime as a workaround if you urgently need this, any files placed into the snapshot folder (`./archive/<timestamp>/`) will be respected by archivebox. So if you have any external WARC, PNG, PDF, etc files you can drag them into the snapshot folder manually or create a small script to place them in there. If you overwrite the existing files or use the default names archivebox uses it will even display them properly in the UI as part of the snapshot. I try to respect the UNIX "everything is a file" mentality, and may even move towards supporting more pure filesystem-based manipulation of the archives in future releases.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3763
No description provided.