mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-27 10:06:00 +03:00
[GH-ISSUE #500] Bugfix: Exception thrown from wget extractor #3346
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3346
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jrruethe on GitHub (Oct 5, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/500
Describe the bug
The Wget extractor is throwing an exception
Failed to archive link: ValueError: '/data/index.html' does not start with '/data/archive/1601694245.432877', which appears to originate from this block of code.Steps to reproduce
I am using
archivebox add --update-all --depth 1 http://bookmarks?do=atom, wherehttp://bookmarks?do=atomis my Shaarli instance. Essentially, I am archiving all my bookmarks.The log output shows everything works fine for quite a while, then this exception occurs and the Docker container dies.
The folder referenced in the log output below is empty (
/data/archive/1601694245.432877).I believe that the archived link beforehand is fine (
https://blog.thefactual.com/what-are-the-best-nonpartisan-news-sources), and that the exception is occurring on the next link, however I the log output doesn't show me what link it is trying to archive. The/data/index.htmlfile that the exception is referring to appears to be the main index file, so I am not sure if this is happening at the very end, after all links have been indexed, and it is trying to rebuild the main index.html file as a last step.This is with a recently published Docker image
nikisweeting/archivebox@sha256:ee4c84369b8620c53f7d5772b70ad86aa22ca71d3ae3648ef19b75ba2c14efaf. I was previously usingnikisweeting/archivebox:0.4.21, and when I switched to the newer image, I did not run anyinitor migration steps before callingaddagain.Thank you for your time and help, let me know if there is additional information I can provide that would be useful.
BTW, I saw the comments in this issue describing how the main index is in the process of being removed. It very well could be that this issue is just a consequence of running code that is mid-refactor.
Screenshots or log output
Software versions
Docker image
nikisweeting/archivebox@sha256:ee4c84369b8620c53f7d5772b70ad86aa22ca71d3ae3648ef19b75ba2c14efaf@cdvv7788 commented on GitHub (Oct 5, 2020):
I will check it. It was introduced in the last refactor probably. Thanks for the report.
@cdvv7788 commented on GitHub (Oct 5, 2020):
We are not able to reproduce the issue. Can you try running
initbefore this command? (backup your archive before).@jrruethe commented on GitHub (Oct 7, 2020):
Yes, I will try this and report back with the results. Thank you
@jrruethe commented on GitHub (Oct 13, 2020):
This issue can be closed. I tried a few things, I'm not sure exactly what fixed it, but here is what I did:
data/index.htmlnikisweeting/archivebox@sha256:f3db6ca0ac5eb9405daf5110dcb934cf7ba20b0a362adc380ccbf2d086a679c3archivebox initarchivebox add ...This ran for a bit, then completed successfully.
Thank you!
@jrruethe commented on GitHub (Oct 17, 2020):
I have hit this exception again. I'm still investigating to see if this is something strange with my setup / configuration. The following occurred with Docker image
nikisweeting/archivebox@sha256:5810591719d05f15cb3af20fce517fb9866f468dda23e321e8312a4baa455009, it appears to be right at the end when it should be writing out the/data/index.htmlfile. The/data/index.jsonfile was properly written, in my case it is 4.0GB, so I have a pretty large archive. Due to this, even a small update takes a while to process before the exception occurs, so it will take me some time to reproduce.@cdvv7788 commented on GitHub (Oct 19, 2020):
@jrruethe can you try with this branch? https://github.com/pirate/ArchiveBox/pull/502
you will need to build the image yourself:
docker build -t archivebox --no-cacheThe actual archiving should be faster in that branch, making testing easier.
@jrruethe commented on GitHub (Oct 19, 2020):
Sure, I will give that a try later this week and report back my results. Thank you.
@pirate commented on GitHub (Oct 22, 2020):
wow a 4GB
index.json! I am sorry you had to suffer through that file being rewritten so many times in the old version. As @cdvv7788 the situation is much improved after #502. The new release with that improvement should be out within the next week or two.@jrruethe commented on GitHub (Oct 22, 2020):
No suffering at all, I have it running on a cronjob in the background and I don't check on it too often, so I didn't notice for a while.
Honestly, Archivebox handles the large archive pretty well, I've been impressed.
I haven't had a chance to try out the #502 branch yet, I'll see if I can get to it soon.
Thanks!
@jrruethe commented on GitHub (Oct 22, 2020):
Ok, I tested out this branch, and it fixes the exception. This issue can be closed again.
Right now, I am using an
nginxcontainer to serve thedata/index.htmlfile, but I am going to work on switching to thearchivebox servermethod.Thanks!
@cdvv7788 commented on GitHub (Oct 22, 2020):
Glad it worked. Let us know if it reappears.