[GH-ISSUE #1556] Question: Importing Data? #3946

Closed
opened 2026-03-15 01:05:00 +03:00 by kerem · 2 comments
Owner

Originally created by @Godly-Avenger on GitHub (Oct 21, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1556

If I already have archived data from another source (in the form of WACZ or WARC files), is it possible to import it into ArchiveBox? If so, how?

Originally created by @Godly-Avenger on GitHub (Oct 21, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1556 If I already have archived data from another source (in the form of WACZ or WARC files), is it possible to import it _into_ ArchiveBox? If so, how?
kerem closed this issue 2026-03-15 01:05:05 +03:00
Author
Owner

@pirate commented on GitHub (Oct 21, 2024):

I recomment adding all the URLs to archivebox with just the title and favicon methods selected first.

This will create snapshot dirs for each URL.

Then iterate over the output of archivebox list --csv=timestamp,url to get snapshot IDs (timestamps) for all the directories it created, and the corresponding URLs they're for.

If you match the filesystem structure archivebox expects in each dir, it should show your files in the UI:

archive/
    <snapshot ts>/
        index.json
        index.html
        dom.html
        example.com/index.html
        example.com/images/
            icon.png
        media/
            video_files_go_here.mp4
            or_audio_files.mp3
        ...

They wont have corresponding ArchiveResult entries in the Admin UI (you can manually create those if you need them) but it's not necessary if you only want to see your files on the Snapshot detail view page.

For more see here: https://github.com/ArchiveBox/ArchiveBox#output-formats

<!-- gh-comment-id:2427903102 --> @pirate commented on GitHub (Oct 21, 2024): I recomment adding all the URLs to archivebox with just the `title` and `favicon` methods selected first. This will create snapshot dirs for each URL. Then iterate over the output of `archivebox list --csv=timestamp,url` to get snapshot IDs (timestamps) for all the directories it created, and the corresponding URLs they're for. If you match the filesystem structure archivebox expects in each dir, it should show your files in the UI: ``` archive/ <snapshot ts>/ index.json index.html dom.html example.com/index.html example.com/images/ icon.png media/ video_files_go_here.mp4 or_audio_files.mp3 ... ``` They wont have corresponding `ArchiveResult` entries in the Admin UI (you can manually create those if you need them) but it's not necessary if you only want to see your files on the Snapshot detail view page. For more see here: https://github.com/ArchiveBox/ArchiveBox#output-formats
Author
Owner

@pirate commented on GitHub (Oct 21, 2024):

Lets move ongoing discussion here though in order to keep the convo in one place: https://github.com/ArchiveBox/ArchiveBox/issues/160

<!-- gh-comment-id:2427908139 --> @pirate commented on GitHub (Oct 21, 2024): Lets move ongoing discussion here though in order to keep the convo in one place: https://github.com/ArchiveBox/ArchiveBox/issues/160
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3946
No description provided.