mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 09:36:01 +03:00
[GH-ISSUE #639] Feature Request: allow for importing WebScrapbook archives #1906
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1906
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @osborne6 on GitHub (Jan 30, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/639
Type
What is the problem that your feature request solves
It may be useful to import other archive snapshots of websites create from the WebScrapbook plugin for Firefox, Chrome, etc.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
A sub-command in the current tool that would convert/import a WebScrapbook archive into an ArchiveBox archive.
Something like:
What hacks or alternative solutions have you tried to solve the problem?
I have not tried anything else, but put this here as a suggestion. I may try to code something up and put in a pull-request in the future, unless someone else beats me to it.
How badly do you want this new feature?
@tabashir commented on GitHub (Jan 31, 2021):
Funnily enough, I was just playing around with AB as an alternative to WSB myself.
I found that to get a list of sources to import, the following will give you I think what is needed:
cat tree/meta.js |grep '"source":' | sed 's/.*"source": "//' |sed 's/".*$//' |tee import.txtYou can then import this with pasting in the web frontend, or:
cat import.txt | docker-compose run --rm archivebox archivebox addHowever if some of your sources have since disappeared from the internet as some of mine have then this doesn't help.
Also, I'm now hitting a
thread 'sonic-channel-client' panicked at 'buffer overflow (20093/20002 bytes)', src/channel/handle.rs:149:29spewing out over the import and things failing to be collected and indexed, though that isn't specifically related to this issue.
@pirate commented on GitHub (Jan 31, 2021):
As a stopgap measure you can always just put the files directly into the snapshot folders on disk. ArchiveBox wont delete those files (it wont index them either), and you can get to them from the UI by going to the snapshot index
/archive/<timestamp>/index.htmland clicking theFiles...link.P.S. I think that sonic overflow is fixed in v0.5.4 fyi @tabashir. Give
deva try:@tabashir commented on GitHub (Jan 31, 2021):
Thanks so much for the pointers here @pirate I'll give that a go for the missing ones.
In relation to the sonic issue, after posting this I did find reference to it in the PR: https://github.com/ArchiveBox/ArchiveBox/pull/625 . I manually hacked the sonic.py within my docker image and that seems to have done the trick too. Will give the dev image a go too though.