mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #461] Slow to add new URLs and a lack of verbose logging output #304
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#304
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pigmonkey on GitHub (Aug 31, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/461
I am running ArchiveBox v0.4.21. The archive method section of my config is this:
Here is the output of adding one new URL from my latest Pinboard JSON dump:
In total, my shell tells me this took 7 minutes and 43 minutes to complete, which seems exceptionally long for a single URL. There are a few things to note:
You can tell from the timestamps of the first two log entry blocks that there is an almost two minute gap (19:26:48 to 19:28:39) between ArchiveBox starting up and actually beginning to add links. There is no indication of what ArchiveBox is doing during this time.
The time between the second log entry block ("Adding 871 links...") and the third ("Writing 880 links...") is another two minutes (19:28:39 to 19:30:23). Pretty much all of this time is spent parsing the 871 URLs from the input. During this time, ArchiveBox shows me a progress bar, so I do know it is doing something, unlike during the previous pause. But two minutes seems like a really long time to check which of 871 strings exist in a SQLite database. (I assume ArchiveBox has already parsed the URLs out of the input JSON at this point, since it has already printed the "Adding 871 links..." message.) Is ArchiveBox trying to do something more complicated here, or is this indicative of a problem?
The time between the log entry for the single new URL and the completion message is another two minutes (19:30:29 to 19:32:40). The individual archive methods all show a progress bar when they are running, and they all complete as fast as I would expect, so this is not a network issue. After the final (archive.org) archive method completes, I counted a 1 minute and 46 pause where ArchiveBox seems to just hang without giving me any indication of what it is doing.
The time between the last two log entries is yet another two minutes (19:32:40 to 19:34:26). What happens here is another mystery. After the hint output was printed, I counted 1 minute and 33 seconds where ArchiveBox seems to hang again, with no indication of what it is doing, before it prints the final "Writing 880 links..." output.
Overall, it seems like this process is slower than it should be. The lack of any logging output during the long pauses is exceptionally frustrating, since it makes it seem like the program has frozen and makes it difficult to debug what the problem may be. A flag to make the logging more verbose might be a good start to addressing this problem.
The slowness has kept me using the old pre-Django version of ArchiveBox, where it feels that the only performance limitation when adding new URLs is my network connection.
@pirate commented on GitHub (Aug 31, 2020):
This should improve dramatically with the v0.5 release that's coming out soon. v0.4 is a transition release where we do lots of slow index writing to make sure the JSON and SQL indexes match and are correct. The delays you're seeing are likely because it's actually writing the main index 9 times in your test!
I'll leave this open for now, once the next release drops you should comment back here if it's still slow for you. Out of curiosity what device re you on? And are you running it in docker or directly on the host?
@pigmonkey commented on GitHub (Aug 31, 2020):
Ok, I'll report back with 0.5.
This is running on a Thinkpad x260 (16GB RAM, Intel i7 6600U, SSD) in a virtualenv, not with Docker.
@cdvv7788 commented on GitHub (Oct 7, 2020):
@pigmonkey did you experience improvements with the current
masterbranch?@pigmonkey commented on GitHub (Oct 7, 2020):
Yes, with the current master branch the speed is much improved. But it still seems to hang at the end without any indication of activity.
It only takes 1 second from executing the command (16:52:52) to when it begins archiving the new URL (16:52:53), which is great.
The time between beginning to archive the new URL (16:52:53) and showing the success message (16:53:08) is taken up by the various archive methods I have enabled, all of which show progress indicators and are constrained mostly just by my network connection, so that is great.
However, after it prints the hint about opening the index file or running the server, it hangs for about 2.5 minutes without showing any indication of what it is doing. Then it prints out the path to the index JSON and HTML files. I assume that during this gap it is rebuilding or verifying the index. Some sort of message to this effect would improve the user experience.
It also might be odd for it to prompt me to view the index (via the hint) while it is still doing something to the index. That hint may be more appropriate as the final output after everything else is done.
Tangentially: I also used
archivebox remove --delete ...to delete a URL from my archive. This command also seemed to hang, taking much longer than needed tormthe directory and remove the entry from the index. I assume that it is also doing some sort of rebuilding or verification of the index here, after it has deleted the URL. But again there's no output to that effect, so as a user it feels like something is wrong and I am tempted toctrl+cit. A simple "Rebuilding index..." message whenever this occurs would go a long way.@cdvv7788 commented on GitHub (Oct 7, 2020):
Well, we are in the middle of a big refactor. The json and html indexes are deprecated, but not yet removed. At the end of every command, they are being rebuild, which can still cause some delays. They will be completely removed from the process in a later iteration.
I need to review the rm command, but the issue is probably the same. I will create a PR removing those bits so you can further experiment the speed improvements. It is a good idea to document that in a decently sized archive.
@pigmonkey commented on GitHub (Oct 8, 2020):
Ok. I'm happy to test anything against my archive.
@cdvv7788 commented on GitHub (Oct 8, 2020):
@pigmonkey just added #502 . Please check how your index behaves with those changes. The
removecommand was indeed re-creating these indexes, which is why that could have been that slow.Expected behavior: Your
clicommands should be faster overall, with no long hang ups at the beginning or the end of the process.Let me know how it goes.
@pigmonkey commented on GitHub (Oct 8, 2020):
That completely solves the problem.
When running the latest master, the
addcommand takes an average of 3 minutes to add a single URL, despite it only taking 10-20 seconds to archive that URL. When using #502, the entire command only adds about a second on top of the archive time.Removing is also close to instant.
The
addcommand still prints the hint suggesting that I load the HTML index. That should probably go away if that file is going to be out of date until I runarchivebox list --html >! index.html.@cdvv7788 commented on GitHub (Oct 9, 2020):
Good point. I will remove the warning too.
@cdvv7788 commented on GitHub (Nov 13, 2020):
This is solved on master. Please reopen if there is something still pending.