[GH-ISSUE #184] Quiet / Minimal verbosity flag #1638

Closed
opened 2026-03-01 17:52:24 +03:00 by kerem · 4 comments
Owner

Originally created by @n0ncetonic on GitHub (Mar 19, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/184

Type:

  • General Question or Disussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves
ArchiveBox could see a performance increase by allowing users to minimize or completely disable the mostly informational/debug messages during archiving.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

ArchiveBox is great at what it does and the multi-output makes for an extremely robust archival solution but it is pretty slow as I'm experiencing as archive 4.3k urls. I know part of this is because the urls are not processed in parallel (which was being discussed elsewhere) but another big contributor to many commandline applications taking a performance hit is the I/O blocking from logging to console. Terminal output of every individual phase of archiving a url adds to the I/O footprint of ArchiveBox because it forces the terminal to wait for input, draw to the screen, update its view, refresh the screen, etc. for every message that is posted.

I'm interested In knowing if anyone has run benchmarks on ./archive while redirecting stdout to /dev/null vs standard output with no progress bar vs standard output with progress bars to assess the possibility of introducing a flag/option that will either silence output entirely or limit archival status messages to the [+] [2019-03-19 13:46:40] ... message output when a URL is beginning to be archived.

What hacks or alternative solutions have you tried to solve the problem?
Tests could be done by running cat inputFile.txt | ./archive > /dev/null

How badly do you want this new feature?

  • It's an urgent deal-breaker, I cant live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually
  • I'm willing to contribute to development

Here are some links further detailing the need for buffered I/O when dealing with applications that are heavily impacted in performance by blocking I/O operations:

https://stackoverflow.com/questions/3857052/why-is-printing-to-stdout-so-slow-can-it-be-sped-up
https://eklitzke.org/stdout-buffering

Originally created by @n0ncetonic on GitHub (Mar 19, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/184 Type: - [ ] General Question or Disussion - [ ] Propose a brand new feature - [x] Request modification of existing behavior or design **What is the problem that your feature request solves** ArchiveBox could see a performance increase by allowing users to minimize or completely disable the mostly informational/debug messages during archiving. **Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes** ArchiveBox is great at what it does and the multi-output makes for an extremely robust archival solution but it is pretty slow as I'm experiencing as archive 4.3k urls. I know part of this is because the urls are not processed in parallel (which was being discussed elsewhere) but another big contributor to many commandline applications taking a performance hit is the I/O blocking from logging to console. Terminal output of every individual phase of archiving a url adds to the I/O footprint of ArchiveBox because it forces the terminal to wait for input, draw to the screen, update its view, refresh the screen, etc. for every message that is posted. I'm interested In knowing if anyone has run benchmarks on `./archive` while redirecting stdout to /dev/null vs standard output with no progress bar vs standard output with progress bars to assess the possibility of introducing a flag/option that will either silence output entirely or limit archival status messages to the `[+] [2019-03-19 13:46:40] ...` message output when a URL is beginning to be archived. **What hacks or alternative solutions have you tried to solve the problem?** Tests could be done by running `cat inputFile.txt | ./archive > /dev/null` **How badly do you want this new feature?** - [ ] It's an urgent deal-breaker, I cant live without it - [ ] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually - [x] I'm willing to contribute to development Here are some links further detailing the need for buffered I/O when dealing with applications that are heavily impacted in performance by blocking I/O operations: https://stackoverflow.com/questions/3857052/why-is-printing-to-stdout-so-slow-can-it-be-sped-up https://eklitzke.org/stdout-buffering
Author
Owner

@pirate commented on GitHub (Mar 19, 2019):

Thanks for the suggestion!

On a multi-core machine the stdout buffering is not going to be the limiting factor on ArchiveBox performance until it's running at least 2 or 3 orders of magnitude faster than right now.

If you're interested in the performance breakdown, there are several major performance tickets that are going to be cleared up in the next ~6 months before stdout buffering becomes worth investigating:

  • an entire instance of Chrome is launched and killed 3 times for every link, this will be fixed by moving to pyppeteer: #177 (the current design is so ridiculously inefficient, I'm amazed that no one has opened a ticket complaining about it yet)
  • we create child processes to call out to wget, youtube-dl, and curl multiple times for each link, this will also be fixed by moving to pure python versions: #177
  • we load and rewrite the entire main index file on every link as a hack to get semi-realtime index UI updates during the archive process
  • it's singlethreaded. pyppeteer will fix this as well by allowing us to create 1 headless browser per core and process archive to n links at a time
  • it uses static HTML and JSON files to store the data instead of SQLite with indexes, so everything constantly has to be read, parsed, iterated+filtered, and dumped back to disk, this will be fixed by: #57

I would be open to adding a -q option for quieter output, but only if it does something more than just > /dev/null.
Maybe it could output just the status lines at the start and end, in a format suitable for logging, e.g.:

[2019-03-12 07:24:53] ArchiveBox started. Importing 30 new links from output/sources/sharli-example.txt (Parsed as Plain Text)...
[2019-03-12 07:24:53] ArchiveBox finished. Imported 30 new links: 2 failed, 28 succeeded. Saved index (2,945 links) to output/index.html.
<!-- gh-comment-id:474623889 --> @pirate commented on GitHub (Mar 19, 2019): Thanks for the suggestion! On a multi-core machine the stdout buffering is not going to be the limiting factor on ArchiveBox performance until it's running at least 2 or 3 orders of magnitude faster than right now. If you're interested in the performance breakdown, there are several major performance tickets that are going to be cleared up in the next ~6 months before stdout buffering becomes worth investigating: - an *entire instance of Chrome* is launched and killed 3 times for every link, this will be fixed by moving to pyppeteer: #177 (the current design is so ridiculously inefficient, I'm amazed that no one has opened a ticket complaining about it yet) - we create child processes to call out to wget, youtube-dl, and curl multiple times for each link, this will also be fixed by moving to pure python versions: #177 - we load and rewrite the entire main index file on every link as a hack to get semi-realtime index UI updates during the archive process - it's singlethreaded. pyppeteer will fix this as well by allowing us to create 1 headless browser per core and process archive to n links at a time - it uses static HTML and JSON files to store the data instead of SQLite with indexes, so everything constantly has to be read, parsed, iterated+filtered, and dumped back to disk, this will be fixed by: #57 I would be open to adding a `-q` option for quieter output, but only if it does something more than just `> /dev/null`. Maybe it could output just the status lines at the start and end, in a format suitable for logging, e.g.: ```bash [2019-03-12 07:24:53] ArchiveBox started. Importing 30 new links from output/sources/sharli-example.txt (Parsed as Plain Text)... [2019-03-12 07:24:53] ArchiveBox finished. Imported 30 new links: 2 failed, 28 succeeded. Saved index (2,945 links) to output/index.html. ```
Author
Owner

@n0ncetonic commented on GitHub (Mar 20, 2019):

So I was running into weird issues trying to pipe to /dev/null and so that experiment was put on hold. Here I'm posting some logs of runs with no progress bar or color vs my branch with some flags added to the chromium headless execution.

Current ArchiveBox master branch

[+] [2019-03-19 10:04:15] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourUserInterface/InternationalizingYourUserInterface.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourUserInterface/InternationalizingYourUserInterface.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1280 (new)
      <snip>
[+] [2019-03-19 10:04:34] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1281 (new)
    <snip>
[+] [2019-03-19 10:04:49] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingLocaleData/InternationalizingLocaleData.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingLocaleData/InternationalizingLocaleData.html
   <snip>
[+] [2019-03-19 10:05:05] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/Glossary/Glossary.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/Glossary/Glossary.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1283 (new)
  <snip>
[+] [2019-03-19 10:05:18] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Tasks/InstallingFrameworks.html"
    https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Tasks/InstallingFrameworks.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1284 (new)

Optimization flags on Chromium

[+] [2019-03-19 15:56:07] "https://developer.apple.com/library/archive/documentation/General/Conceptual/MOSXAppProgrammingGuide/AppRuntime/AppRuntime.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/MOSXAppProgrammingGuide/AppRuntime/AppRuntime.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1909 (new)
<snip>
[+] [2019-03-19 15:56:16] "https://developer.apple.com/library/archive/documentation/General/Conceptual/GameplayKit_Guide/index.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/GameplayKit_Guide/index.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1910 (new)
<snip>
[+] [2019-03-19 15:56:25] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ExtensibilityPG/index.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ExtensibilityPG/index.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1911 (new)
<snip>
[+] [2019-03-19 15:56:34] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/ThreadMigration/ThreadMigration.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/ThreadMigration/ThreadMigration.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1912 (new)
<snip>
[+] [2019-03-19 15:56:43] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/RevisionHistory.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/RevisionHistory.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1913 (new)
 <snip>
[+] [2019-03-19 15:56:52] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html"
    https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html
    > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1914 (new)

master branch has an average archival time per url using all archival methods except Google Favicon of about 12seconds with a range of 10seconds to 13seconds. With some added flags to help optimize the performance of headless chromium and the same archival methods I saw an average time of about 9seconds with a range of 8seconds to 10seconds. Slight increase in performance of only a few seconds. At 12seconds a request I am getting approximately 5 links archived a minute vs 6 links a minute with optimized chromium headless.

While optimized flags helped a bit I noticed the major bottleneck in archival speed was with wget and there isn't a whole lot that can be done in this regard as wget's functionality is not present in many other commandline download utilities and is almost non-existent in faster tools such as axel or aria2. I did stumble upon gnu wget2 which is a spiritual successor to gnu wget with much more modern performance focused features such as support for HTTP/2, parallel connections, and If-Modified-Since header support as well as support for processing RSS feeds. I am going to test wget2 both for performance and to determine if it can easily be dropped into the master branch with minimal changes to the overall code base.

My hope is that the bottleneck in processing links will be alleviated and will add to the performance gains expected once the project is moved to Django.

wget2 project home is https://gitlab.com/gnuwget/wget2 in case others are interested

<!-- gh-comment-id:474675437 --> @n0ncetonic commented on GitHub (Mar 20, 2019): So I was running into weird issues trying to pipe to `/dev/null` and so that experiment was put on hold. Here I'm posting some logs of runs with no progress bar or color vs my branch with some flags added to the chromium headless execution. Current ArchiveBox `master` branch ``` [+] [2019-03-19 10:04:15] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourUserInterface/InternationalizingYourUserInterface.html" https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourUserInterface/InternationalizingYourUserInterface.html > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1280 (new) <snip> [+] [2019-03-19 10:04:34] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html" https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingYourCode/InternationalizingYourCode.html > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1281 (new) <snip> [+] [2019-03-19 10:04:49] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingLocaleData/InternationalizingLocaleData.html" https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/InternationalizingLocaleData/InternationalizingLocaleData.html <snip> [+] [2019-03-19 10:05:05] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/Glossary/Glossary.html" https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPInternational/Glossary/Glossary.html > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1283 (new) <snip> [+] [2019-03-19 10:05:18] "https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Tasks/InstallingFrameworks.html" https://developer.apple.com/library/archive/documentation/MacOSX/Conceptual/BPFrameworks/Tasks/InstallingFrameworks.html > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1284 (new) ``` Optimization flags on Chromium ``` [+] [2019-03-19 15:56:07] "https://developer.apple.com/library/archive/documentation/General/Conceptual/MOSXAppProgrammingGuide/AppRuntime/AppRuntime.html" https://developer.apple.com/library/archive/documentation/General/Conceptual/MOSXAppProgrammingGuide/AppRuntime/AppRuntime.html > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1909 (new) <snip> [+] [2019-03-19 15:56:16] "https://developer.apple.com/library/archive/documentation/General/Conceptual/GameplayKit_Guide/index.html" https://developer.apple.com/library/archive/documentation/General/Conceptual/GameplayKit_Guide/index.html > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1910 (new) <snip> [+] [2019-03-19 15:56:25] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ExtensibilityPG/index.html" https://developer.apple.com/library/archive/documentation/General/Conceptual/ExtensibilityPG/index.html > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1911 (new) <snip> [+] [2019-03-19 15:56:34] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/ThreadMigration/ThreadMigration.html" https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/ThreadMigration/ThreadMigration.html > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1912 (new) <snip> [+] [2019-03-19 15:56:43] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/RevisionHistory.html" https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/RevisionHistory.html > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1913 (new) <snip> [+] [2019-03-19 15:56:52] "https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html" https://developer.apple.com/library/archive/documentation/General/Conceptual/ConcurrencyProgrammingGuide/OperationQueues/OperationQueues.html > /Volumes/home/Archive/ArchiveBox/archive/1552993745.1914 (new) ``` `master` branch has an average archival time per url using all archival methods except Google Favicon of about 12seconds with a range of 10seconds to 13seconds. With some added flags to help optimize the performance of headless chromium and the same archival methods I saw an average time of about 9seconds with a range of 8seconds to 10seconds. Slight increase in performance of only a few seconds. At 12seconds a request I am getting approximately 5 links archived a minute vs 6 links a minute with optimized chromium headless. While optimized flags helped a bit I noticed the major bottleneck in archival speed was with `wget` and there isn't a whole lot that can be done in this regard as wget's functionality is not present in many other commandline download utilities and is almost non-existent in faster tools such as `axel` or `aria2`. I did stumble upon gnu `wget2` which is a spiritual successor to gnu wget with much more modern performance focused features such as support for HTTP/2, parallel connections, and `If-Modified-Since` header support as well as support for processing RSS feeds. I am going to test wget2 both for performance and to determine if it can easily be dropped into the `master` branch with minimal changes to the overall code base. My hope is that the bottleneck in processing links will be alleviated and will add to the performance gains expected once the project is moved to Django. wget2 project home is https://gitlab.com/gnuwget/wget2 in case others are interested
Author
Owner

@pirate commented on GitHub (Mar 20, 2019):

We're already planning on moving to wpull to replace wget in the near future, see here for more info: #177

If it ends up being IO bound and not CPU bound we can always stick an event loop in each worker and do simultaneous async archiving of multiple links on each core.

<!-- gh-comment-id:475067822 --> @pirate commented on GitHub (Mar 20, 2019): We're already planning on moving to `wpull` to replace wget in the near future, see here for more info: #177 If it ends up being IO bound and not CPU bound we can always stick an event loop in each worker and do simultaneous async archiving of multiple links on each core.
Author
Owner

@pirate commented on GitHub (Apr 10, 2021):

Going to close this for now because the speed concerns have been addressed in other ways (v0.6 is 10-100x faster than v0.5 in many operations), and much of archivebox's CLI output is now split between stderr and stdout, so if you want less verbose output you can always pipe stdout to /dev/null and just read stderr.

I thought about adding a SHOW_HINTS=True/False config flag to further reduce verbosity but decided against it, as it's not worth the overhead of another config option.
Instead I'm just hiding most hints once you have more than 25 snapshots in your archive, as it assumes you know how to use it by then.

<!-- gh-comment-id:817104111 --> @pirate commented on GitHub (Apr 10, 2021): Going to close this for now because the speed concerns have been addressed in other ways (v0.6 is 10-100x faster than v0.5 in many operations), and much of archivebox's CLI output is now split between stderr and stdout, so if you want less verbose output you can always pipe stdout to /dev/null and just read stderr. I thought about adding a `SHOW_HINTS=True/False` config flag to further reduce verbosity but decided against it, as it's not worth the overhead of another config option. Instead I'm just hiding most hints once you have more than 25 snapshots in your archive, as it assumes you know how to use it by then.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1638
No description provided.