[GH-ISSUE #557] Re-running archivebox init loses metadata #3374

Closed
opened 2026-03-14 22:29:00 +03:00 by kerem · 13 comments
Owner

Originally created by @shepner on GitHub (Nov 30, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/557

This is extracting a defect commented in #556

Describe the bug

Running archivebox init is a dangerous task (data is lost when the DB is recreated)

Steps to reproduce

Run archivebox init

Software versions

archivebox/archivebox:latest as of 11/29/2020

Discussion

archivebox init should not be dangerous. Did some folders get wiped, or some entries in the database get lost when you ran it? Can you reliable reproduce it? It would be very helpful if that is the case.

Originally posted by @cdvv7788 in https://github.com/ArchiveBox/ArchiveBox/issues/556#issuecomment-735478673

I was mainly referring to the Timestamp and Title fields. I havnt used tags yet so I havnt tested that. When archivebox init re-adds the snapshots to the DB, the timestamp gets overwritten and the title is re-generated. The net result is that I have a few hundred entries that claim to be added the same minute and a handful of titles that reverted back to "403 Forbidden" due to the WARC method failing. Finally, the Files indicators dont seem to be re-populating correctly in the admin panel

Files indicators:
image

Timestamps:
image

Originally created by @shepner on GitHub (Nov 30, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/557 This is extracting a defect commented in [#556](https://github.com/ArchiveBox/ArchiveBox/issues/556) #### Describe the bug Running archivebox init is a dangerous task (data is lost when the DB is recreated) #### Steps to reproduce Run `archivebox init` #### Software versions archivebox/archivebox:latest as of 11/29/2020 #### Discussion `archivebox init` should not be dangerous. Did some folders get wiped, or some entries in the database get lost when you ran it? Can you reliable reproduce it? It would be very helpful if that is the case. _Originally posted by @cdvv7788 in https://github.com/ArchiveBox/ArchiveBox/issues/556#issuecomment-735478673_ I was mainly referring to the Timestamp and Title fields. I havnt used tags yet so I havnt tested that. When `archivebox init` re-adds the snapshots to the DB, the timestamp gets overwritten and the title is re-generated. The net result is that I have a few hundred entries that claim to be added the same minute and a handful of titles that reverted back to "403 Forbidden" due to the WARC method failing. Finally, the Files indicators dont seem to be re-populating correctly in the admin panel Files indicators: <img width="306" alt="image" src="https://user-images.githubusercontent.com/5170785/100622958-1d93ed00-32e7-11eb-91bb-a657dcadb0a9.png"> Timestamps: <img width="171" alt="image" src="https://user-images.githubusercontent.com/5170785/100623033-34d2da80-32e7-11eb-87a7-aa2368c311b4.png">
kerem closed this issue 2026-03-14 22:29:06 +03:00
Author
Owner

@pirate commented on GitHub (Dec 1, 2020):

Thanks for reporting, this is definitely a serious bug, and we'll look into it before releasing v0.5.0 to make sure it doesn't affect anyone else.

@cdvv7788 can you investigate further if you have some time.

<!-- gh-comment-id:736381229 --> @pirate commented on GitHub (Dec 1, 2020): Thanks for reporting, this is definitely a serious bug, and we'll look into it before releasing v0.5.0 to make sure it doesn't affect anyone else. @cdvv7788 can you investigate further if you have some time.
Author
Owner

@cdvv7788 commented on GitHub (Dec 1, 2020):

Yes, I will make this my priority.

<!-- gh-comment-id:736590436 --> @cdvv7788 commented on GitHub (Dec 1, 2020): Yes, I will make this my priority.
Author
Owner

@cdvv7788 commented on GitHub (Dec 1, 2020):

@shepner does this happen when you are updating from v0.4 to v0.5? Or just re-running it on v0.5 triggers the issue?

<!-- gh-comment-id:736591671 --> @cdvv7788 commented on GitHub (Dec 1, 2020): @shepner does this happen when you are updating from v0.4 to v0.5? Or just re-running it on v0.5 triggers the issue?
Author
Owner

@shepner commented on GitHub (Dec 1, 2020):

Good question. The script has been pulling the latest version when it runs and I never before looked to see what version Im on. Apparently v0.5.0:

$ sudo docker exec -it archivebox su archivebox -c "archivebox --version"
ArchiveBox v0.5.0
Linux Linux-5.4.0-53-generic-x86_64-with-glibc2.28 x86_64

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY      /usr/local/bin/archivebox                                                    v0.5.0          valid 
 √  PYTHON_BINARY          /usr/local/bin/python3.9                                                     v3.9.0          valid 
 √  DJANGO_BINARY          /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py            v3.1.3          valid 
 √  CURL_BINARY            /usr/bin/curl                                                                v7.64.0         valid 
 √  WGET_BINARY            /usr/bin/wget                                                                v1.20.1         valid 
 √  NODE_BINARY            /usr/bin/node                                                                v15.3.0         valid 
 √  SINGLEFILE_BINARY      /node/node_modules/single-file/cli/single-file                               v0.1.14         valid 
 √  READABILITY_BINARY     /node/node_modules/readability-extractor/readability-extractor               v0.1.0          valid 
 √  MERCURY_BINARY         /node/node_modules/@postlight/mercury-parser/cli.js                          v1.0.0          valid 
 √  GIT_BINARY             /usr/bin/git                                                                 v2.20.1         valid 
 √  YOUTUBEDL_BINARY       /usr/local/bin/youtube-dl                                                    v2020.11.26     valid 
 √  CHROME_BINARY          /usr/bin/chromium                                                            v83.0.4103.116  valid 

[i] Source-code locations:
 √  PACKAGE_DIR            /app/archivebox                                                              21 files        valid 
 √  TEMPLATES_DIR          /app/archivebox/themes/legacy                                                7 files         valid 

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR   None                                                                         -               disabled 
 -  COOKIES_FILE           None                                                                         -               disabled 

[i] Data locations:
 √  OUTPUT_DIR             /data                                                                        7 files         valid 
 √  SOURCES_DIR            /data/sources                                                                4 files         valid 
 √  LOGS_DIR               /data/logs                                                                   0 files         valid 
 √  ARCHIVE_DIR            /data/archive                                                                407 files       valid 
 √  CONFIG_FILE            /data/ArchiveBox.conf                                                        193.0 Bytes     valid 
 √  SQL_INDEX              /data/index.sqlite3                                                          412.0 KB        valid 

Just ran sudo docker exec -it archivebox su archivebox -c "archivebox init" again. I dont see any differences in the times and etc so perhaps I started with v0.4.x and at some point got switched over to v0.5.0? I havnt been using this very long so Im guessing it just occurred a few days ago?

On a related note, I suggest that ongoing archivebox/archivebox:latest should point to your latest stable version. Ie, from the documentation, I was expecting to see v0.4.21 as its what is listed on the releases page. This might help alleviate later confusion/issues as more adopt the Docker container.

<!-- gh-comment-id:736758512 --> @shepner commented on GitHub (Dec 1, 2020): Good question. The script has been pulling the latest version when it runs and I never before looked to see what version Im on. Apparently v0.5.0: ``` $ sudo docker exec -it archivebox su archivebox -c "archivebox --version" ArchiveBox v0.5.0 Linux Linux-5.4.0-53-generic-x86_64-with-glibc2.28 x86_64 [i] Dependency versions: √ ARCHIVEBOX_BINARY /usr/local/bin/archivebox v0.5.0 valid √ PYTHON_BINARY /usr/local/bin/python3.9 v3.9.0 valid √ DJANGO_BINARY /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py v3.1.3 valid √ CURL_BINARY /usr/bin/curl v7.64.0 valid √ WGET_BINARY /usr/bin/wget v1.20.1 valid √ NODE_BINARY /usr/bin/node v15.3.0 valid √ SINGLEFILE_BINARY /node/node_modules/single-file/cli/single-file v0.1.14 valid √ READABILITY_BINARY /node/node_modules/readability-extractor/readability-extractor v0.1.0 valid √ MERCURY_BINARY /node/node_modules/@postlight/mercury-parser/cli.js v1.0.0 valid √ GIT_BINARY /usr/bin/git v2.20.1 valid √ YOUTUBEDL_BINARY /usr/local/bin/youtube-dl v2020.11.26 valid √ CHROME_BINARY /usr/bin/chromium v83.0.4103.116 valid [i] Source-code locations: √ PACKAGE_DIR /app/archivebox 21 files valid √ TEMPLATES_DIR /app/archivebox/themes/legacy 7 files valid [i] Secrets locations: - CHROME_USER_DATA_DIR None - disabled - COOKIES_FILE None - disabled [i] Data locations: √ OUTPUT_DIR /data 7 files valid √ SOURCES_DIR /data/sources 4 files valid √ LOGS_DIR /data/logs 0 files valid √ ARCHIVE_DIR /data/archive 407 files valid √ CONFIG_FILE /data/ArchiveBox.conf 193.0 Bytes valid √ SQL_INDEX /data/index.sqlite3 412.0 KB valid ``` Just ran `sudo docker exec -it archivebox su archivebox -c "archivebox init"` again. I dont see any differences in the times and etc so perhaps I started with v0.4.x and at some point got switched over to v0.5.0? I havnt been using this very long so Im guessing it just occurred a few days ago? On a related note, I suggest that ongoing archivebox/archivebox:latest should point to your latest stable version. Ie, from the documentation, I was expecting to see v0.4.21 as its what is listed on the [releases page](https://github.com/ArchiveBox/ArchiveBox/releases). This might help alleviate later confusion/issues as more adopt the Docker container.
Author
Owner

@pirate commented on GitHub (Dec 1, 2020):

Aha that explains it, v0.5.0 is an unreleased working branch, still full of bugs. Definitely downgrade your image back to v0.4.21.

Unfortunately I haven't figured out an easy for us to control the :latest tag because Docker hub assigns it to whatever the last uploaded image is.

We need to be able to build and share images of our working branches (which get pushed more often than stable releases), do you know of a good way to do that without bumping latest each time we push some WIP branch? No worries if not, I'll probably do some research on my own, and worst case we can put our WIP code on a separate docker hub acct entirely.

<!-- gh-comment-id:736831268 --> @pirate commented on GitHub (Dec 1, 2020): Aha that explains it, v0.5.0 is an unreleased working branch, still full of bugs. Definitely downgrade your image back to v0.4.21. Unfortunately I haven't figured out an easy for us to control the `:latest` tag because Docker hub assigns it to whatever the last uploaded image is. We need to be able to build and share images of our working branches (which get pushed more often than stable releases), do you know of a good way to do that without bumping `latest` each time we push some WIP branch? No worries if not, I'll probably do some research on my own, and worst case we can put our WIP code on a separate docker hub acct entirely.
Author
Owner

@shepner commented on GitHub (Dec 3, 2020):

This part of the conversation prolly should be moved elsewhere but:

Note that I do not work as a developer, my own projects/repos arent to the point where anyone (myself included) particularly care about releases, and professionally I'm a bit removed from that side of the process. That said, Im pretty sure having :latest assigned to the, well, latest image is to be expected. :)

I suspect there are several ways to go about this but I dont gone through the nuances of each. For example:

  • One approach is to have a well structured set of branches. The dev branch being where you do your day to day work, a series of release branches, and master being the current stable. Im guessing that is the intent for for this project.
  • Another seems doesnt seem to care much about branches at all, and somehow relies on git's release mechanism to trigger scripted automatons so docker build only sees that release version.
    I havnt messed with Jenkins near enough but Im guessing the latter is closer to the ideal, if for no other reason, because it doesnt require code merges or the dev team remembering what branch to use.

In thinking about it, that latter approach could would mean docker hub would only receive the images pushed to it and anyone wanting to play can just run their own build locally. The hard part is figuring out how to configure the Jenkins job (Jenkins itself is easy enough to get running)

<!-- gh-comment-id:738044579 --> @shepner commented on GitHub (Dec 3, 2020): This part of the conversation prolly should be moved elsewhere but: Note that I do not work as a developer, my own projects/repos arent to the point where anyone (myself included) particularly care about releases, and professionally I'm a bit removed from that side of the process. That said, Im pretty sure having `:latest` assigned to the, well, latest image is to be expected. :) I suspect there are several ways to go about this but I dont gone through the nuances of each. For example: * One approach is to have a well structured set of branches. The dev branch being where you do your day to day work, a series of release branches, and master being the current stable. Im guessing that is the intent for for this project. * Another seems doesnt seem to care much about branches at all, and somehow relies on git's release mechanism to trigger scripted automatons so `docker build` only sees that release version. I havnt messed with Jenkins near enough but Im guessing the latter is closer to the ideal, if for no other reason, because it doesnt require code merges or the dev team remembering what branch to use. In thinking about it, that latter approach could would mean docker hub would only receive the images pushed to it and anyone wanting to play can just run their own build locally. The hard part is figuring out how to configure the Jenkins job (Jenkins itself is easy enough to get running)
Author
Owner

@pirate commented on GitHub (Dec 3, 2020):

We already have Github Actions CI (we don't use Jenkins) set up to auto-push built images on every commit, so it would be easy for us to limit that only to master and work on the dev branch instead.

I think that's probably the way to go, it's what we had before and it worked well, I was just hoping we could autobuild all commits because it makes for easier development if you don't have to build locally, but it seems to be causing more trouble than it's worth with the constant :latest bumping.

<!-- gh-comment-id:738103359 --> @pirate commented on GitHub (Dec 3, 2020): We already have Github Actions CI (we don't use Jenkins) set up to auto-push built images on every commit, so it would be easy for us to limit that only to `master` and work on the `dev` branch instead. I think that's probably the way to go, it's what we had before and it worked well, I was just hoping we could autobuild *all* commits because it makes for easier development if you don't have to build locally, but it seems to be causing more trouble than it's worth with the constant `:latest` bumping.
Author
Owner

@cdvv7788 commented on GitHub (Dec 3, 2020):

@pirate we discussed this approach before. It looks like the way to go...we should do that. If you want to have a build for everything, let's setup another dockerhub repository.

<!-- gh-comment-id:738120081 --> @cdvv7788 commented on GitHub (Dec 3, 2020): @pirate we discussed this approach before. It looks like the way to go...we should do that. If you want to have a build for everything, let's setup another dockerhub repository.
Author
Owner

@pirate commented on GitHub (Dec 3, 2020):

Done github.com/ArchiveBox/ArchiveBox@b186e98cd2 (we no longer push every commit to docker hub as :latest images, only the full releases)

<!-- gh-comment-id:738122680 --> @pirate commented on GitHub (Dec 3, 2020): Done https://github.com/ArchiveBox/ArchiveBox/commit/b186e98cd2eeb5cb375dedfaa21abcae1abec2be (we no longer push every commit to docker hub as `:latest` images, only the full releases)
Author
Owner

@cdvv7788 commented on GitHub (Dec 5, 2020):

@shepner do you have the original archive (a copy before you ran archivebox init)?

<!-- gh-comment-id:739319971 --> @cdvv7788 commented on GitHub (Dec 5, 2020): @shepner do you have the original archive (a copy before you ran `archivebox init`)?
Author
Owner

@shepner commented on GitHub (Dec 5, 2020):

Sorry no. TBH, it isnt a big deal on my part. Im still getting a feel for how this works and if/how I want to use it. That latter part is prolly best suited for a new thread

<!-- gh-comment-id:739401095 --> @shepner commented on GitHub (Dec 5, 2020): Sorry no. TBH, it isnt a big deal on my part. Im still getting a feel for how this works and if/how I want to use it. That latter part is prolly best suited for a new thread
Author
Owner

@pirate commented on GitHub (Feb 1, 2021):

I suspect the original issue here is being caused by this https://github.com/ArchiveBox/ArchiveBox/issues/640

<!-- gh-comment-id:770752613 --> @pirate commented on GitHub (Feb 1, 2021): I suspect the original issue here is being caused by this https://github.com/ArchiveBox/ArchiveBox/issues/640
Author
Owner

@pirate commented on GitHub (Apr 6, 2021):

I believe these issues are all fixed in the latest versions, if you're still having any issues with corruption please comment back here and we'll investigate.

<!-- gh-comment-id:813877559 --> @pirate commented on GitHub (Apr 6, 2021): I believe these issues are all fixed in the latest versions, if you're still having any issues with corruption please comment back here and we'll investigate.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3374
No description provided.