mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #184] Quiet / Minimal verbosity flag #127
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#127
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @n0ncetonic on GitHub (Mar 19, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/184
Type:
What is the problem that your feature request solves
ArchiveBox could see a performance increase by allowing users to minimize or completely disable the mostly informational/debug messages during archiving.
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
ArchiveBox is great at what it does and the multi-output makes for an extremely robust archival solution but it is pretty slow as I'm experiencing as archive 4.3k urls. I know part of this is because the urls are not processed in parallel (which was being discussed elsewhere) but another big contributor to many commandline applications taking a performance hit is the I/O blocking from logging to console. Terminal output of every individual phase of archiving a url adds to the I/O footprint of ArchiveBox because it forces the terminal to wait for input, draw to the screen, update its view, refresh the screen, etc. for every message that is posted.
I'm interested In knowing if anyone has run benchmarks on
./archivewhile redirecting stdout to /dev/null vs standard output with no progress bar vs standard output with progress bars to assess the possibility of introducing a flag/option that will either silence output entirely or limit archival status messages to the[+] [2019-03-19 13:46:40] ...message output when a URL is beginning to be archived.What hacks or alternative solutions have you tried to solve the problem?
Tests could be done by running
cat inputFile.txt | ./archive > /dev/nullHow badly do you want this new feature?
Here are some links further detailing the need for buffered I/O when dealing with applications that are heavily impacted in performance by blocking I/O operations:
https://stackoverflow.com/questions/3857052/why-is-printing-to-stdout-so-slow-can-it-be-sped-up
https://eklitzke.org/stdout-buffering
@pirate commented on GitHub (Mar 19, 2019):
Thanks for the suggestion!
On a multi-core machine the stdout buffering is not going to be the limiting factor on ArchiveBox performance until it's running at least 2 or 3 orders of magnitude faster than right now.
If you're interested in the performance breakdown, there are several major performance tickets that are going to be cleared up in the next ~6 months before stdout buffering becomes worth investigating:
I would be open to adding a
-qoption for quieter output, but only if it does something more than just> /dev/null.Maybe it could output just the status lines at the start and end, in a format suitable for logging, e.g.:
@n0ncetonic commented on GitHub (Mar 20, 2019):
So I was running into weird issues trying to pipe to
/dev/nulland so that experiment was put on hold. Here I'm posting some logs of runs with no progress bar or color vs my branch with some flags added to the chromium headless execution.Current ArchiveBox
masterbranchOptimization flags on Chromium
masterbranch has an average archival time per url using all archival methods except Google Favicon of about 12seconds with a range of 10seconds to 13seconds. With some added flags to help optimize the performance of headless chromium and the same archival methods I saw an average time of about 9seconds with a range of 8seconds to 10seconds. Slight increase in performance of only a few seconds. At 12seconds a request I am getting approximately 5 links archived a minute vs 6 links a minute with optimized chromium headless.While optimized flags helped a bit I noticed the major bottleneck in archival speed was with
wgetand there isn't a whole lot that can be done in this regard as wget's functionality is not present in many other commandline download utilities and is almost non-existent in faster tools such asaxeloraria2. I did stumble upon gnuwget2which is a spiritual successor to gnu wget with much more modern performance focused features such as support for HTTP/2, parallel connections, andIf-Modified-Sinceheader support as well as support for processing RSS feeds. I am going to test wget2 both for performance and to determine if it can easily be dropped into themasterbranch with minimal changes to the overall code base.My hope is that the bottleneck in processing links will be alleviated and will add to the performance gains expected once the project is moved to Django.
wget2 project home is https://gitlab.com/gnuwget/wget2 in case others are interested
@pirate commented on GitHub (Mar 20, 2019):
We're already planning on moving to
wpullto replace wget in the near future, see here for more info: #177If it ends up being IO bound and not CPU bound we can always stick an event loop in each worker and do simultaneous async archiving of multiple links on each core.
@pirate commented on GitHub (Apr 10, 2021):
Going to close this for now because the speed concerns have been addressed in other ways (v0.6 is 10-100x faster than v0.5 in many operations), and much of archivebox's CLI output is now split between stderr and stdout, so if you want less verbose output you can always pipe stdout to /dev/null and just read stderr.
I thought about adding a
SHOW_HINTS=True/Falseconfig flag to further reduce verbosity but decided against it, as it's not worth the overhead of another config option.Instead I'm just hiding most hints once you have more than 25 snapshots in your archive, as it assumes you know how to use it by then.