mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #557] Re-running archivebox init loses metadata #3374
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3374
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @shepner on GitHub (Nov 30, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/557
This is extracting a defect commented in #556
Describe the bug
Running archivebox init is a dangerous task (data is lost when the DB is recreated)
Steps to reproduce
Run
archivebox initSoftware versions
archivebox/archivebox:latest as of 11/29/2020
Discussion
archivebox initshould not be dangerous. Did some folders get wiped, or some entries in the database get lost when you ran it? Can you reliable reproduce it? It would be very helpful if that is the case.Originally posted by @cdvv7788 in https://github.com/ArchiveBox/ArchiveBox/issues/556#issuecomment-735478673
I was mainly referring to the Timestamp and Title fields. I havnt used tags yet so I havnt tested that. When
archivebox initre-adds the snapshots to the DB, the timestamp gets overwritten and the title is re-generated. The net result is that I have a few hundred entries that claim to be added the same minute and a handful of titles that reverted back to "403 Forbidden" due to the WARC method failing. Finally, the Files indicators dont seem to be re-populating correctly in the admin panelFiles indicators:

Timestamps:

@pirate commented on GitHub (Dec 1, 2020):
Thanks for reporting, this is definitely a serious bug, and we'll look into it before releasing v0.5.0 to make sure it doesn't affect anyone else.
@cdvv7788 can you investigate further if you have some time.
@cdvv7788 commented on GitHub (Dec 1, 2020):
Yes, I will make this my priority.
@cdvv7788 commented on GitHub (Dec 1, 2020):
@shepner does this happen when you are updating from v0.4 to v0.5? Or just re-running it on v0.5 triggers the issue?
@shepner commented on GitHub (Dec 1, 2020):
Good question. The script has been pulling the latest version when it runs and I never before looked to see what version Im on. Apparently v0.5.0:
Just ran
sudo docker exec -it archivebox su archivebox -c "archivebox init"again. I dont see any differences in the times and etc so perhaps I started with v0.4.x and at some point got switched over to v0.5.0? I havnt been using this very long so Im guessing it just occurred a few days ago?On a related note, I suggest that ongoing archivebox/archivebox:latest should point to your latest stable version. Ie, from the documentation, I was expecting to see v0.4.21 as its what is listed on the releases page. This might help alleviate later confusion/issues as more adopt the Docker container.
@pirate commented on GitHub (Dec 1, 2020):
Aha that explains it, v0.5.0 is an unreleased working branch, still full of bugs. Definitely downgrade your image back to v0.4.21.
Unfortunately I haven't figured out an easy for us to control the
:latesttag because Docker hub assigns it to whatever the last uploaded image is.We need to be able to build and share images of our working branches (which get pushed more often than stable releases), do you know of a good way to do that without bumping
latesteach time we push some WIP branch? No worries if not, I'll probably do some research on my own, and worst case we can put our WIP code on a separate docker hub acct entirely.@shepner commented on GitHub (Dec 3, 2020):
This part of the conversation prolly should be moved elsewhere but:
Note that I do not work as a developer, my own projects/repos arent to the point where anyone (myself included) particularly care about releases, and professionally I'm a bit removed from that side of the process. That said, Im pretty sure having
:latestassigned to the, well, latest image is to be expected. :)I suspect there are several ways to go about this but I dont gone through the nuances of each. For example:
docker buildonly sees that release version.I havnt messed with Jenkins near enough but Im guessing the latter is closer to the ideal, if for no other reason, because it doesnt require code merges or the dev team remembering what branch to use.
In thinking about it, that latter approach could would mean docker hub would only receive the images pushed to it and anyone wanting to play can just run their own build locally. The hard part is figuring out how to configure the Jenkins job (Jenkins itself is easy enough to get running)
@pirate commented on GitHub (Dec 3, 2020):
We already have Github Actions CI (we don't use Jenkins) set up to auto-push built images on every commit, so it would be easy for us to limit that only to
masterand work on thedevbranch instead.I think that's probably the way to go, it's what we had before and it worked well, I was just hoping we could autobuild all commits because it makes for easier development if you don't have to build locally, but it seems to be causing more trouble than it's worth with the constant
:latestbumping.@cdvv7788 commented on GitHub (Dec 3, 2020):
@pirate we discussed this approach before. It looks like the way to go...we should do that. If you want to have a build for everything, let's setup another dockerhub repository.
@pirate commented on GitHub (Dec 3, 2020):
Done
github.com/ArchiveBox/ArchiveBox@b186e98cd2(we no longer push every commit to docker hub as:latestimages, only the full releases)@cdvv7788 commented on GitHub (Dec 5, 2020):
@shepner do you have the original archive (a copy before you ran
archivebox init)?@shepner commented on GitHub (Dec 5, 2020):
Sorry no. TBH, it isnt a big deal on my part. Im still getting a feel for how this works and if/how I want to use it. That latter part is prolly best suited for a new thread
@pirate commented on GitHub (Feb 1, 2021):
I suspect the original issue here is being caused by this https://github.com/ArchiveBox/ArchiveBox/issues/640
@pirate commented on GitHub (Apr 6, 2021):
I believe these issues are all fixed in the latest versions, if you're still having any issues with corruption please comment back here and we'll investigate.