[GH-ISSUE #639] Feature Request: allow for importing WebScrapbook archives #1906

Open
opened 2026-03-01 17:54:50 +03:00 by kerem · 3 comments
Owner

Originally created by @osborne6 on GitHub (Jan 30, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/639

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

It may be useful to import other archive snapshots of websites create from the WebScrapbook plugin for Firefox, Chrome, etc.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

A sub-command in the current tool that would convert/import a WebScrapbook archive into an ArchiveBox archive.

Something like:

archivebox import --type webscrapbook <dir-with-webscrapbook archives>

What hacks or alternative solutions have you tried to solve the problem?

I have not tried anything else, but put this here as a suggestion. I may try to code something up and put in a pull-request in the future, unless someone else beats me to it.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually
Originally created by @osborne6 on GitHub (Jan 30, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/639 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you :) --> ## Type - [ ] General question or discussion - [x] Propose a brand new feature - [ ] Request modification of existing behavior or design ## What is the problem that your feature request solves It may be useful to import other archive snapshots of websites create from the WebScrapbook plugin for Firefox, Chrome, etc. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes A sub-command in the current tool that would convert/import a WebScrapbook archive into an ArchiveBox archive. Something like: ``` archivebox import --type webscrapbook <dir-with-webscrapbook archives> ``` ## What hacks or alternative solutions have you tried to solve the problem? I have not tried anything else, but put this here as a suggestion. I may try to code something up and put in a pull-request in the future, unless someone else beats me to it. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [ ] It's important to add it in the near-mid term future - [x] It would be nice to have eventually
Author
Owner

@tabashir commented on GitHub (Jan 31, 2021):

Funnily enough, I was just playing around with AB as an alternative to WSB myself.

I found that to get a list of sources to import, the following will give you I think what is needed:
cat tree/meta.js |grep '"source":' | sed 's/.*"source": "//' |sed 's/".*$//' |tee import.txt
You can then import this with pasting in the web frontend, or:
cat import.txt | docker-compose run --rm archivebox archivebox add
However if some of your sources have since disappeared from the internet as some of mine have then this doesn't help.

Also, I'm now hitting a
thread 'sonic-channel-client' panicked at 'buffer overflow (20093/20002 bytes)', src/channel/handle.rs:149:29
spewing out over the import and things failing to be collected and indexed, though that isn't specifically related to this issue.

<!-- gh-comment-id:770308100 --> @tabashir commented on GitHub (Jan 31, 2021): Funnily enough, I was just playing around with AB as an alternative to WSB myself. I found that to get a list of sources to import, the following will give you I think what is needed: ` cat tree/meta.js |grep '"source":' | sed 's/.*"source": "//' |sed 's/".*$//' |tee import.txt ` You can then import this with pasting in the web frontend, or: ` cat import.txt | docker-compose run --rm archivebox archivebox add ` However if some of your sources have since disappeared from the internet as some of mine have then this doesn't help. Also, I'm now hitting a `thread 'sonic-channel-client' panicked at 'buffer overflow (20093/20002 bytes)', src/channel/handle.rs:149:29` spewing out over the import and things failing to be collected and indexed, though that isn't specifically related to this issue.
Author
Owner

@pirate commented on GitHub (Jan 31, 2021):

As a stopgap measure you can always just put the files directly into the snapshot folders on disk. ArchiveBox wont delete those files (it wont index them either), and you can get to them from the UI by going to the snapshot index /archive/<timestamp>/index.html and clicking the Files... link.

P.S. I think that sonic overflow is fixed in v0.5.4 fyi @tabashir. Give dev a try:

docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
docker run -v $PWD:/data archivebox:dev ...
<!-- gh-comment-id:770310279 --> @pirate commented on GitHub (Jan 31, 2021): As a stopgap measure you can always just put the files directly into the snapshot folders on disk. ArchiveBox wont delete those files (it wont index them either), and you can get to them from the UI by going to the snapshot index `/archive/<timestamp>/index.html` and clicking the `Files...` link. P.S. I think that sonic overflow is fixed in v0.5.4 fyi @tabashir. Give `dev` a try: ```bash docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev docker run -v $PWD:/data archivebox:dev ... ```
Author
Owner

@tabashir commented on GitHub (Jan 31, 2021):

As a stopgap measure you can always just put the files directly into the snapshot folders on disk. ArchiveBox wont delete those files (it wont index them either), and you can get to them from the UI by going to the snapshot index /archive/<timestamp>/index.html and clicking the Files... link.

P.S. I think that sonic overflow is fixed in v0.5.4 fyi @tabashir. Give dev a try:

docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev
docker run -v $PWD:/data archivebox:dev ...

Thanks so much for the pointers here @pirate I'll give that a go for the missing ones.

In relation to the sonic issue, after posting this I did find reference to it in the PR: https://github.com/ArchiveBox/ArchiveBox/pull/625 . I manually hacked the sonic.py within my docker image and that seems to have done the trick too. Will give the dev image a go too though.

<!-- gh-comment-id:770360674 --> @tabashir commented on GitHub (Jan 31, 2021): > As a stopgap measure you can always just put the files directly into the snapshot folders on disk. ArchiveBox wont delete those files (it wont index them either), and you can get to them from the UI by going to the snapshot index `/archive/<timestamp>/index.html` and clicking the `Files...` link. > > P.S. I think that sonic overflow is fixed in v0.5.4 fyi @tabashir. Give `dev` a try: > > ```shell > docker build -t archivebox:dev https://github.com/ArchiveBox/ArchiveBox.git#dev > docker run -v $PWD:/data archivebox:dev ... > ``` Thanks so much for the pointers here @pirate I'll give that a go for the missing ones. In relation to the sonic issue, after posting this I did find reference to it in the PR: https://github.com/ArchiveBox/ArchiveBox/pull/625 . I manually hacked the sonic.py within my docker image and that seems to have done the trick too. Will give the dev image a go too though.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1906
No description provided.