mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #617] Question: archivebox server throws 500 error #382
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#382
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @winteriscariot on GitHub (Jan 15, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/617
I'm using nginx in front of archivebox on my local intranet to access my archives, and it's throwing a 500 error whenever I simply load the front page. nginx is recording no errors, so I assume this is happening internal to archivebox. The
archivebox serverconsole -- running in a screen -- simply reports the 500 error and nothing further.Is there anything I can do to troubleshoot this? Perhaps increase the verbosity of
archivebox serverat the console so I can see where the fault exists?tried an
archivebox initon my archive directory to no resultI can dump the archivebox list to html, which is what I'm doing for now, but it's not ideal since I can change the order of links (I prefer newest links at the top of the list, while the html dump puts oldest at top; it's a pet peeve but the archivebox server allows me to change that).
Arch Linux, archivebox 0.5.3 via pip
Thanks!
@winteriscariot commented on GitHub (Jan 15, 2021):
Here is the result of running
archivebox server --debugand then loading reloading the main archivebox page (at which point I receive the 500 error):Note: the archivebox server was working fine, until yesterday it just started throwing the 500 error. I'm pretty much at a loss, as nothing has changed on the server since then. (literally went to bed one night and woke up to it not working).
Not saying nothing has changed (there's probably something) I'm just kind of clueless as to where to start looking.
EDIT: Did some more testing. I created a new archive directory and copied my .archivebox/archive directory into it, ran
archivebox initand thenarchivebox update, created a superuser witharchivebox manage createsuperuser, confirmed no errors, then ranarchivebox serverin the new directory. It continues to throw a 500 error. I will likely try creating a fresh archivebox directory (without the existing links) to see if the server will run properly without any customizing of the config or anything.@winteriscariot commented on GitHub (Jan 15, 2021):
After creating a new archive dir without the existing archives, I was able to avoid the 500 error. Therefore I'm marking this as closed, since it must be something to do with one of my links.
@mAAdhaTTah commented on GitHub (Jan 15, 2021):
If you set DEBUG=True in your config, you'll get a Django debug stack trace, which can help with stuff like this.
@winteriscariot commented on GitHub (Jan 15, 2021):
Hm the issue has started to recur. thanks for that @mAAdhaTTah -- i was able to get more info when loading /admin/core/snapshot/. Django traceback output:
I assume this has something to do with something I archived? I took my original archive, dumped all the URLs to a text file, then created a new collection in a new directory, and imported the URLs into the new collection. It's just not super clear to me how that might effect the rendering of the admin page?
I did try removing everything via pip then reinstalling (included removing stuff from the python site-packages dir) but the issue persists.
@winteriscariot commented on GitHub (Jan 15, 2021):
A few things:
@dohlin commented on GitHub (Jan 15, 2021):
I too am seeing the issue the moment my Chrome .html bookmarks file gets written to the database (starting a fresh install). For me, it seems to be only the /public URI that throws the error 500; the admin section seems to work fine (from what I've tested). There's clearly something going on here.
EDIT: Even spun up a brand new Ubuntu 20.04 server to test setup from scratch, same issue. Tried an old html bookmarks backup file I had laying around from several months back and same issue. Everything works until I start the initial archive.
@pirate commented on GitHub (Jan 16, 2021):
Yup, definitely a valid bug, we'll look into it, thanks for reporting.
If you're able to narrow it down to a specific link that breaks that would help us a ton.
@aspensmonster commented on GitHub (Jan 23, 2021):
I have the same issue @dohlin, which I suspect might be different from the issue @winteriscariot is experiencing. In my case, after enabling Django debug output, I get a KeyError exception:
The Django error page that is shown has lots of helpful information. I've uploaded it here (you'll probably need to download it and view it in a browser though).
So far as I can tell, the problem is with a URL that has curly braces in it (
{and}). I know nothing about Django or its templating engine, but I'm assuming it uses those characters for template expansion, and complains about not having an actual variable called1f5a0aab-2088-4ecc-84e1-6eaa4de7d6c3to populate from.The offending link itself:
http://learning.microsoft.com/manager/LearningPlanV2.aspx?resourceId=%7b1f5a0aab-2088-4ecc-84e1-6eaa4de7d6c3%7d&clang=en-US
And the braces showing up in the
wget_pathproperty ofcanon:And it does look like local var
output, which gets pushed into theformat_htmlfunction inarchivebox/index/html.py, has raw curly braces in it:'<a ' 'href="/archive/1577435751.7223/learning.microsoft.com/manager/LearningPlanV2.aspx@resourceId={1f5a0aab-2088-4ecc-84e1-6eaa4de7d6c3}&clang=en-US.html" ' 'class="exists-False" title="wget">🆆 </a><a 'Presumably,
format_htmleventually triggers template expansion and causes Django to barf.Full disclosure, I'm in the process of migrating backups from early 2019 into version 0.5.3. Presently, all of my datadirs are "invalid", as it looks like the datadir index.json format has changed significantly.
That being said, I did start an empty archive in a new directory
archivebox initenabled
DEBUGarchivebox config --set DEBUG=Trueadded the single offending link to the archive
archivebox add http://learning.microsoft.com/manager/LearningPlanV2.aspx?resourceId=%7b1f5a0aab-2088-4ecc-84e1-6eaa4de7d6c3%7d&clang=en-Uand got the same HTTP 500 error and KeyError exception Django error output. So I suspect that my efforts of porting old backups to 0.5.3 are unrelated to this bug.
EDIT (2021-01-23T16:51:00-06:00): I've also included the index.json (path
archive/1611442069.323929/index.json) from the test case here (again, you'll need to download the file). The curly braces are present in the output here too.@pirate commented on GitHub (Jan 23, 2021):
Supremely helpful, thank you @aggroskater. I suspect your diagnosis is correct with the curly braces, that template rendering code is being improved anyway in another branch I'm working on, so hopefully that will clear up some of these issues as well (
canonis getting ripped out completely, it's always been a source of problems). If anyone wants to submit a quick patch for this I'll approve and merge it in time for v0.5.4, otherwise expect a few weeks until I have my next chunk of free time for AB development.If you're able to post one of the
./archive/<timestamp/index.jsonfiles that's being flagged as "invalid" I can help you get it into v0.5.3. Alternatively you can do a 2-step migration through one of the v0.4 versions to get it into v0.5. Everything from v0.4 and up has a sqlitedb + rollback-safe migrations system to avoid upgrading pains in the future, so once you get to v0.4 it should be easier from then on to do upgrades.@pirate commented on GitHub (Feb 1, 2021):
v0.5.4 is released, please give it a try. Report back here if you have any further issues and I can reopen the ticket.
@berezovskyi commented on GitHub (Jun 6, 2021):
I just got a similar error with a URL
https://link.foreignaffairs.com/click/60b4025ed373750fd780a8d9/aHR0cHM6Ly93d3cuZm9yZWlnbmFmZmFpcnMuY29tL2ZhX3VzZXIvc2ltcGxlX3JlZy9hdXRvbG9naW4_dG9rZW49YlVKelQzc0lDZ3FjWjB1bVhLQlVWdklxZHN1ajRFaEtWM3FYbE5TUnNtMXFFV3haM3loRmFTN05mWlc1RVRRSUlzOUxhQ1dvJTJCM1RQTnJaUE0lMkJaTlNSSTRNZUFycUZXVTNnNkxRdVJIN21zJTNEJmRlc3RpbmF0aW9uPS9ub2RlLzExMjc0NjcmdXRtX21lZGl1bT1wcm9tb19lbWFpbCZ1dG1fc291cmNlPWxvX2Zsb3dzJnV0bV9jYW1wYWlnbj1yZWdpc3RlcmVkX3VzZXJfd2VsY29tZSZ1dG1fdGVybT1lbWFpbF8xJnV0bV9jb250ZW50PTIwMjEwNTMw/60b4025ee8467f2b795d0d96B1f48eb65:Here is how to fix the system without losing the index:
sqlite3 %filename%and then the following query:select url, added from core_snapshot order by added desc limit 10;. You should see the full URL now.docker exec -it -u archivebox archivebox archivebox remove 'https://link.foreignaffairs.com/click/60b4025ed373750fd780a8d9/aHR0cHM6Ly93d3cuZm9yZWlnbmFmZmFpcnMuY29tL2ZhX3VzZXIvc2ltcGxlX3JlZy9hdXRvbG9naW4_dG9rZW49YlVKelQzc0lDZ3FjWjB1bVhLQlVWdklxZHN1ajRFaEtWM3FYbE5TUnNtMXFFV3haM3loRmFTN05mWlc1RVRRSUlzOUxhQ1dvJTJCM1RQTnJaUE0lMkJaTlNSSTRNZUFycUZXVTNnNkxRdVJIN21zJTNEJmRlc3RpbmF0aW9uPS9ub2RlLzExMjc0NjcmdXRtX21lZGl1bT1wcm9tb19lbWFpbCZ1dG1fc291cmNlPWxvX2Zsb3dzJnV0bV9jYW1wYWlnbj1yZWdpc3RlcmVkX3VzZXJfd2VsY29tZSZ1dG1fdGVybT1lbWFpbF8xJnV0bV9jb250ZW50PTIwMjEwNTMw/60b4025ee8467f2b795d0d96B1f48eb65'where my URL is replaced by the URL that is causing errors on your system.