mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #236] Bugfixes: Large crawls eventually crash during json loading/dumping #1675
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1675
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @anarcat on GitHub (May 7, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/236
Describe the bug
This is yet another 0.4.1 bug, feel free to close it but do notice that I can't upgrade either. ;)
Steps to reproduce
TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'Screenshots or log output
Full backtrace:
I have tried upgrading archivebox to the django branch but then it fails with:
I suspect the database structure has changed but it's not immediately obvious to me how to fix that...
Software versions
@pirate commented on GitHub (May 8, 2019):
Run
archivebox initto migrate the db to the latest version, and definitely don't continue using v0.4.1, it's full of bugs (my latest local version is already v0.4.5 but I'm hesitant to release more alpha versions it as there's still lots of stuff unfinished and I don't want to ruin people's archives).@anarcat commented on GitHub (May 9, 2019):
On 2019-05-08 16:35:11, Nick Sweeting wrote:
I'll try that, thanks.
I'd say you should release what you have. :)
@anarcat commented on GitHub (May 9, 2019):
archivebox initdoesn't solve the problem, i still get:to be honest, i'd be fine with flushing this entire archive and starting from scratch - i have no history to keep, really, so it's not a big deal. if you're confident the
__init__bug is fixed and theUNIQUEstuff is just a weird fluke, i'm happy to close this and move on.i'm just worried it would do the same thing after crawling 20GB for three hours. ;)
@anarcat commented on GitHub (Jun 3, 2019):
any suggestion on where i should start to debug this? should i scrap the 20GB archive and/or database and start from scratch?
thanks in advance! :)
@pirate commented on GitHub (Jul 9, 2019):
Sorry for the long delay @anarcat I'm still swamped by my day job, going to try to get to this in the next couple months but it may be tricky with upcoming travel and client meetings.
Whatever you do don't scrap that archive, it's 100% recoverable, I'm sure there's a simple fix I can add for this in v0.4, I just need a solid block of time to figure it out.
@anarcat commented on GitHub (Jul 9, 2019):
awesome, thanks for the update! no rush, of course :)
@pirate commented on GitHub (May 9, 2020):
Closing this for now, I think I've fixed a few of these bugs in the
djangobranch, and the atomicity/corruption issue is moved to #234.@dvpc commented on GitHub (May 16, 2020):
I was running into the exact same problem (tested both v.0.4.2 and v.0.4.3 branches) yesterday and
noticed that the type error (below) occurs when a link couldn't be processed (e.g. 404).
tldr:
The
outputfield in the classArchiveResultmust always (i guess) contain a string value. In case of an error it holds an instance of the error object, which in turn makes the deepcopy operationat the end of the json serialization to throw the type error.
Solution:
in archivebox/extractors/title.py (line 62)
Change the value of output from err to str(err).
I don't know if i did overlook something else but this appears to fix the error.
@pirate commented on GitHub (May 18, 2020):
Thanks, good catch @dvpc.
@dvpc commented on GitHub (May 18, 2020):
You're welcome :)
I happened to look into the code of the other extractors and i guess they should convert the error responses as well.
Should i create a patch?
@pirate commented on GitHub (May 21, 2020):
If you can that would be awesome, otherwise if you're ok waiting an indefinite amount of time I've already saved this issue in the queue of the long list of things left for me to do in the 0.4 release.
@dvpc commented on GitHub (Sep 9, 2020):
Sorry for the delay. If its still relevant, i attach the patch file (from branch origin/v0.4.3) here.
fix_json_type_error.patch.txt
PS
I had to rename the file (add .txt suffix), that seems to be a "wrong" workflow i guess (I didn't fork the project yet).
Hope it helps anyway.
@cdvv7788 commented on GitHub (Sep 9, 2020):
@dvpc Can you please describe a way to reliably reproduce it, and create a PR if this happens in the latest version?
There have been a LOT of changes, so I would like to make sure it still makes sense. Thanks!
@dvpc commented on GitHub (Sep 10, 2020):
@cdvv7788 Sorry i don't have any time now. If it helps, i took a quick look into for example
https://github.com/pirate/ArchiveBox/blob/v0.5.0/archivebox/extractors/archive_org.py and in line 73 for example it still says:
output = err, which reliably leads to the error described in the first post of this thread:TypeError: __init__() missing 5 required positional arguments: 'url', 'code', 'msg', 'hdrs', and 'fp'So it seems it is still makes sense.
(Assuming that v0.5.0 is the latest version - which i can't say since i didn't follow the progress).
All changes in the patch are one-liners for all extractors. Take a look at the patch file and my first post. It is very simple.