[GH-ISSUE #461] Slow to add new URLs and a lack of verbose logging output #304

Closed
opened 2026-03-01 14:42:17 +03:00 by kerem · 10 comments
Owner

Originally created by @pigmonkey on GitHub (Aug 31, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/461

I am running ArchiveBox v0.4.21. The archive method section of my config is this:

[ARCHIVE_METHOD_TOGGLES]
SAVE_GIT = false
SAVE_MEDIA = false
SAVE_SINGLEFILE = false
SAVE_READABILITY = False

Here is the output of adding one new URL from my latest Pinboard JSON dump:

$ archivebox add < ~/library/conf/pinboard.json
[i] [2020-08-31 19:26:48] ArchiveBox v0.4.21: archivebox add
    > /home/pigmonkey/tmp/bookmarks

[+] [2020-08-31 19:28:39] Adding 871 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1598902119-import.txt
    > Parsed 871 URLs from input (Generic JSON)
    > Found 1 new URLs not already in index

[*] [2020-08-31 19:30:23] Writing 880 links to main index...
    √ /home/pigmonkey/tmp/bookmarks/index.sqlite3
    √ /home/pigmonkey/tmp/bookmarks/index.json
    √ /home/pigmonkey/tmp/bookmarks/index.html

[▶] [2020-08-31 19:30:29] Collecting content for 1 Snapshots in archive...

[+] [2020-08-31 19:30:29] "welcome the covid influencer - Culture Study"
    https://annehelen.substack.com/p/welcome-the-covid-influencer
    > ./archive/1598900405.0
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > archive_org

[√] [2020-08-31 19:32:40] Update of 1 pages complete (2.18 min)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors

    Hint: To view your archive index, open:
        /home/pigmonkey/tmp/bookmarks/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2020-08-31 19:34:26] Writing 880 links to main index...
    √ /home/pigmonkey/tmp/bookmarks/index.sqlite3
    √ /home/pigmonkey/tmp/bookmarks/index.json
    √ /home/pigmonkey/tmp/bookmarks/index.html

In total, my shell tells me this took 7 minutes and 43 minutes to complete, which seems exceptionally long for a single URL. There are a few things to note:

  • You can tell from the timestamps of the first two log entry blocks that there is an almost two minute gap (19:26:48 to 19:28:39) between ArchiveBox starting up and actually beginning to add links. There is no indication of what ArchiveBox is doing during this time.

  • The time between the second log entry block ("Adding 871 links...") and the third ("Writing 880 links...") is another two minutes (19:28:39 to 19:30:23). Pretty much all of this time is spent parsing the 871 URLs from the input. During this time, ArchiveBox shows me a progress bar, so I do know it is doing something, unlike during the previous pause. But two minutes seems like a really long time to check which of 871 strings exist in a SQLite database. (I assume ArchiveBox has already parsed the URLs out of the input JSON at this point, since it has already printed the "Adding 871 links..." message.) Is ArchiveBox trying to do something more complicated here, or is this indicative of a problem?

  • The time between the log entry for the single new URL and the completion message is another two minutes (19:30:29 to 19:32:40). The individual archive methods all show a progress bar when they are running, and they all complete as fast as I would expect, so this is not a network issue. After the final (archive.org) archive method completes, I counted a 1 minute and 46 pause where ArchiveBox seems to just hang without giving me any indication of what it is doing.

  • The time between the last two log entries is yet another two minutes (19:32:40 to 19:34:26). What happens here is another mystery. After the hint output was printed, I counted 1 minute and 33 seconds where ArchiveBox seems to hang again, with no indication of what it is doing, before it prints the final "Writing 880 links..." output.

Overall, it seems like this process is slower than it should be. The lack of any logging output during the long pauses is exceptionally frustrating, since it makes it seem like the program has frozen and makes it difficult to debug what the problem may be. A flag to make the logging more verbose might be a good start to addressing this problem.

The slowness has kept me using the old pre-Django version of ArchiveBox, where it feels that the only performance limitation when adding new URLs is my network connection.

Originally created by @pigmonkey on GitHub (Aug 31, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/461 I am running ArchiveBox v0.4.21. The archive method section of my config is this: ``` [ARCHIVE_METHOD_TOGGLES] SAVE_GIT = false SAVE_MEDIA = false SAVE_SINGLEFILE = false SAVE_READABILITY = False ``` Here is the output of adding one new URL from my latest Pinboard JSON dump: ``` $ archivebox add < ~/library/conf/pinboard.json [i] [2020-08-31 19:26:48] ArchiveBox v0.4.21: archivebox add > /home/pigmonkey/tmp/bookmarks [+] [2020-08-31 19:28:39] Adding 871 links to index (crawl depth=0)... > Saved verbatim input to sources/1598902119-import.txt > Parsed 871 URLs from input (Generic JSON) > Found 1 new URLs not already in index [*] [2020-08-31 19:30:23] Writing 880 links to main index... √ /home/pigmonkey/tmp/bookmarks/index.sqlite3 √ /home/pigmonkey/tmp/bookmarks/index.json √ /home/pigmonkey/tmp/bookmarks/index.html [▶] [2020-08-31 19:30:29] Collecting content for 1 Snapshots in archive... [+] [2020-08-31 19:30:29] "welcome the covid influencer - Culture Study" https://annehelen.substack.com/p/welcome-the-covid-influencer > ./archive/1598900405.0 > favicon > wget > pdf > screenshot > dom > archive_org [√] [2020-08-31 19:32:40] Update of 1 pages complete (2.18 min) - 0 links skipped - 1 links updated - 0 links had errors Hint: To view your archive index, open: /home/pigmonkey/tmp/bookmarks/index.html Or run the built-in webserver: archivebox server [*] [2020-08-31 19:34:26] Writing 880 links to main index... √ /home/pigmonkey/tmp/bookmarks/index.sqlite3 √ /home/pigmonkey/tmp/bookmarks/index.json √ /home/pigmonkey/tmp/bookmarks/index.html ``` In total, my shell tells me this took 7 minutes and 43 minutes to complete, which seems exceptionally long for a single URL. There are a few things to note: * You can tell from the timestamps of the first two log entry blocks that there is an almost two minute gap (19:26:48 to 19:28:39) between ArchiveBox starting up and actually beginning to add links. There is no indication of what ArchiveBox is doing during this time. * The time between the second log entry block ("Adding 871 links...") and the third ("Writing 880 links...") is another two minutes (19:28:39 to 19:30:23). Pretty much all of this time is spent parsing the 871 URLs from the input. During this time, ArchiveBox shows me a progress bar, so I do know it is doing something, unlike during the previous pause. But two minutes seems like a really long time to check which of 871 strings exist in a SQLite database. (I assume ArchiveBox has already parsed the URLs out of the input JSON at this point, since it has already printed the "Adding 871 links..." message.) Is ArchiveBox trying to do something more complicated here, or is this indicative of a problem? * The time between the log entry for the single new URL and the completion message is another two minutes (19:30:29 to 19:32:40). The individual archive methods all show a progress bar when they are running, and they all complete as fast as I would expect, so this is not a network issue. After the final (archive.org) archive method completes, I counted a 1 minute and 46 pause where ArchiveBox seems to just hang without giving me any indication of what it is doing. * The time between the last two log entries is yet another two minutes (19:32:40 to 19:34:26). What happens here is another mystery. After the hint output was printed, I counted 1 minute and 33 seconds where ArchiveBox seems to hang again, with no indication of what it is doing, before it prints the final "Writing 880 links..." output. Overall, it seems like this process is slower than it should be. The lack of any logging output during the long pauses is exceptionally frustrating, since it makes it seem like the program has frozen and makes it difficult to debug what the problem may be. A flag to make the logging more verbose might be a good start to addressing this problem. The slowness has kept me using the old pre-Django version of ArchiveBox, where it feels that the only performance limitation when adding new URLs is my network connection.
kerem closed this issue 2026-03-01 14:42:17 +03:00
Author
Owner

@pirate commented on GitHub (Aug 31, 2020):

This should improve dramatically with the v0.5 release that's coming out soon. v0.4 is a transition release where we do lots of slow index writing to make sure the JSON and SQL indexes match and are correct. The delays you're seeing are likely because it's actually writing the main index 9 times in your test!

I'll leave this open for now, once the next release drops you should comment back here if it's still slow for you. Out of curiosity what device re you on? And are you running it in docker or directly on the host?

<!-- gh-comment-id:684062272 --> @pirate commented on GitHub (Aug 31, 2020): This should improve dramatically with the v0.5 release that's coming out soon. v0.4 is a transition release where we do lots of slow index writing to make sure the JSON and SQL indexes match and are correct. The delays you're seeing are likely because it's actually writing the main index 9 times in your test! I'll leave this open for now, once the next release drops you should comment back here if it's still slow for you. Out of curiosity what device re you on? And are you running it in docker or directly on the host?
Author
Owner

@pigmonkey commented on GitHub (Aug 31, 2020):

Ok, I'll report back with 0.5.

This is running on a Thinkpad x260 (16GB RAM, Intel i7 6600U, SSD) in a virtualenv, not with Docker.

<!-- gh-comment-id:684066223 --> @pigmonkey commented on GitHub (Aug 31, 2020): Ok, I'll report back with 0.5. This is running on a Thinkpad x260 (16GB RAM, Intel i7 6600U, SSD) in a virtualenv, not with Docker.
Author
Owner

@cdvv7788 commented on GitHub (Oct 7, 2020):

@pigmonkey did you experience improvements with the current master branch?

<!-- gh-comment-id:705012706 --> @cdvv7788 commented on GitHub (Oct 7, 2020): @pigmonkey did you experience improvements with the current `master` branch?
Author
Owner

@pigmonkey commented on GitHub (Oct 7, 2020):

Yes, with the current master branch the speed is much improved. But it still seems to hang at the end without any indication of activity.

$ archivebox add < ~/library/conf/pinboard.json
[i] [2020-10-07 16:52:52] ArchiveBox v0.4.21: archivebox add
    > /home/pigmonkey/tmp/bookmarks

[+] [2020-10-07 16:52:52] Adding 912 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1602089572-import.txt
    > Parsed 912 URLs from input (Generic JSON)
    > Found 1 new URLs not already in index

[*] [2020-10-07 16:52:53] Writing 1 links to main index...
    √ /home/pigmonkey/tmp/bookmarks/index.sqlite3

[▶] [2020-10-07 16:52:53] Collecting content for 1 Snapshots in archive...

[+] [2020-10-07 16:52:53] "A cluster randomized clinical trial comparing fit‐tested and non‐fit‐tested N95 respirators to medical masks to prevent respiratory virus infection in health care workers"
    https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1750-2659.2011.00198.x
    > ./archive/1583349878.0
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > headers
      > archive_org

[√] [2020-10-07 16:53:08] Update of 1 pages complete (15.03 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors

    Hint: To view your archive index, open:
        /home/pigmonkey/tmp/bookmarks/index.html
    Or run the built-in webserver:
        archivebox server
    √ /home/pigmonkey/tmp/bookmarks/index.json
    √ /home/pigmonkey/tmp/bookmarks/index.html

It only takes 1 second from executing the command (16:52:52) to when it begins archiving the new URL (16:52:53), which is great.

The time between beginning to archive the new URL (16:52:53) and showing the success message (16:53:08) is taken up by the various archive methods I have enabled, all of which show progress indicators and are constrained mostly just by my network connection, so that is great.

However, after it prints the hint about opening the index file or running the server, it hangs for about 2.5 minutes without showing any indication of what it is doing. Then it prints out the path to the index JSON and HTML files. I assume that during this gap it is rebuilding or verifying the index. Some sort of message to this effect would improve the user experience.

It also might be odd for it to prompt me to view the index (via the hint) while it is still doing something to the index. That hint may be more appropriate as the final output after everything else is done.

Tangentially: I also used archivebox remove --delete ... to delete a URL from my archive. This command also seemed to hang, taking much longer than needed to rm the directory and remove the entry from the index. I assume that it is also doing some sort of rebuilding or verification of the index here, after it has deleted the URL. But again there's no output to that effect, so as a user it feels like something is wrong and I am tempted to ctrl+c it. A simple "Rebuilding index..." message whenever this occurs would go a long way.

<!-- gh-comment-id:705075136 --> @pigmonkey commented on GitHub (Oct 7, 2020): Yes, with the current master branch the speed is much improved. But it still seems to hang at the end without any indication of activity. ``` $ archivebox add < ~/library/conf/pinboard.json [i] [2020-10-07 16:52:52] ArchiveBox v0.4.21: archivebox add > /home/pigmonkey/tmp/bookmarks [+] [2020-10-07 16:52:52] Adding 912 links to index (crawl depth=0)... > Saved verbatim input to sources/1602089572-import.txt > Parsed 912 URLs from input (Generic JSON) > Found 1 new URLs not already in index [*] [2020-10-07 16:52:53] Writing 1 links to main index... √ /home/pigmonkey/tmp/bookmarks/index.sqlite3 [▶] [2020-10-07 16:52:53] Collecting content for 1 Snapshots in archive... [+] [2020-10-07 16:52:53] "A cluster randomized clinical trial comparing fit‐tested and non‐fit‐tested N95 respirators to medical masks to prevent respiratory virus infection in health care workers" https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1750-2659.2011.00198.x > ./archive/1583349878.0 > favicon > wget > pdf > screenshot > dom > headers > archive_org [√] [2020-10-07 16:53:08] Update of 1 pages complete (15.03 sec) - 0 links skipped - 1 links updated - 0 links had errors Hint: To view your archive index, open: /home/pigmonkey/tmp/bookmarks/index.html Or run the built-in webserver: archivebox server √ /home/pigmonkey/tmp/bookmarks/index.json √ /home/pigmonkey/tmp/bookmarks/index.html ``` It only takes 1 second from executing the command (16:52:52) to when it begins archiving the new URL (16:52:53), which is great. The time between beginning to archive the new URL (16:52:53) and showing the success message (16:53:08) is taken up by the various archive methods I have enabled, all of which show progress indicators and are constrained mostly just by my network connection, so that is great. However, after it prints the hint about opening the index file or running the server, it hangs for about 2.5 minutes without showing any indication of what it is doing. Then it prints out the path to the index JSON and HTML files. I assume that during this gap it is rebuilding or verifying the index. Some sort of message to this effect would improve the user experience. It also might be odd for it to prompt me to view the index (via the hint) while it is still doing something to the index. That hint may be more appropriate as the final output after everything else is done. Tangentially: I also used `archivebox remove --delete ...` to delete a URL from my archive. This command also seemed to hang, taking much longer than needed to `rm` the directory and remove the entry from the index. I assume that it is also doing some sort of rebuilding or verification of the index here, after it has deleted the URL. But again there's no output to that effect, so as a user it feels like something is wrong and I am tempted to `ctrl+c` it. A simple "Rebuilding index..." message whenever this occurs would go a long way.
Author
Owner

@cdvv7788 commented on GitHub (Oct 7, 2020):

Well, we are in the middle of a big refactor. The json and html indexes are deprecated, but not yet removed. At the end of every command, they are being rebuild, which can still cause some delays. They will be completely removed from the process in a later iteration.
I need to review the rm command, but the issue is probably the same. I will create a PR removing those bits so you can further experiment the speed improvements. It is a good idea to document that in a decently sized archive.

<!-- gh-comment-id:705077717 --> @cdvv7788 commented on GitHub (Oct 7, 2020): Well, we are in the middle of a big refactor. The json and html indexes are deprecated, but not yet removed. At the end of every command, they are being rebuild, which can still cause some delays. They will be completely removed from the process in a later iteration. I need to review the rm command, but the issue is probably the same. I will create a PR removing those bits so you can further experiment the speed improvements. It is a good idea to document that in a decently sized archive.
Author
Owner

@pigmonkey commented on GitHub (Oct 8, 2020):

Ok. I'm happy to test anything against my archive.

<!-- gh-comment-id:705266738 --> @pigmonkey commented on GitHub (Oct 8, 2020): Ok. I'm happy to test anything against my archive.
Author
Owner

@cdvv7788 commented on GitHub (Oct 8, 2020):

@pigmonkey just added #502 . Please check how your index behaves with those changes. The remove command was indeed re-creating these indexes, which is why that could have been that slow.
Expected behavior: Your cli commands should be faster overall, with no long hang ups at the beginning or the end of the process.

Let me know how it goes.

<!-- gh-comment-id:705670714 --> @cdvv7788 commented on GitHub (Oct 8, 2020): @pigmonkey just added #502 . Please check how your index behaves with those changes. The `remove` command was indeed re-creating these indexes, which is why that could have been that slow. Expected behavior: Your `cli` commands should be faster overall, with no long hang ups at the beginning or the end of the process. Let me know how it goes.
Author
Owner

@pigmonkey commented on GitHub (Oct 8, 2020):

That completely solves the problem.

When running the latest master, the add command takes an average of 3 minutes to add a single URL, despite it only taking 10-20 seconds to archive that URL. When using #502, the entire command only adds about a second on top of the archive time.

$ time archivebox add https://github.com/pirate/ArchiveBox/issues/461
[i] [2020-10-08 23:51:20] ArchiveBox v0.4.21: archivebox add https://github.com/pirate/ArchiveBox/issues/461
    > /home/pigmonkey/tmp/bookmarks

[+] [2020-10-08 23:51:20] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1602201080-import.txt
    > Parsed 1 URLs from input (Plain Text)
    > Found 1 new URLs not already in index

[*] [2020-10-08 23:51:20] Writing 1 links to main index...
    √ /home/pigmonkey/tmp/bookmarks/index.sqlite3

[▶] [2020-10-08 23:51:20] Collecting content for 1 Snapshots in archive...

[+] [2020-10-08 23:51:20] "github.com/pirate/ArchiveBox/issues/461"
    https://github.com/pirate/ArchiveBox/issues/461
    > ./archive/1602201080.524471
      > title
      > favicon
      > wget
      > pdf
      > screenshot
      > dom
      > headers
      > archive_org

[√] [2020-10-08 23:51:34] Update of 1 pages complete (13.65 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors

    Hint: To view your archive index, open:
        /home/pigmonkey/tmp/bookmarks/index.html
    Or run the built-in webserver:
        archivebox server

real    0m14.607s
user    0m5.573s
sys     0m1.419s

Removing is also close to instant.

$ time archivebox remove --delete --yes https://github.com/pirate/ArchiveBox/issues/461
[i] [2020-10-08 23:52:40] ArchiveBox v0.4.21: archivebox remove --delete --yes https://github.com/pirate/ArchiveBox/issues/461
    > /home/pigmonkey/tmp/bookmarks

[*] Finding links in the archive index matching these exact patterns:
    https://github.com/pirate/ArchiveBox/issues/461

---------------------------------------------------------------------------------------------------
timestamp        | is_archived      | num_outputs      | url
"1602201080.524471" | true             | 0                | "https://github.com/pirate/ArchiveBox/issues/461"
---------------------------------------------------------------------------------------------------

[i] Found 1 matching URLs to remove.
    1 Links will be de-listed from the main index, and their archived content folders will be deleted from disk.
    (1 data folders with 0 archived files will be deleted!)

[√] Removed 1 out of 922 links from the archive index.
    Index now contains 921 links.

real    0m0.897s
user    0m0.742s
sys     0m0.152s

The add command still prints the hint suggesting that I load the HTML index. That should probably go away if that file is going to be out of date until I run archivebox list --html >! index.html.

<!-- gh-comment-id:705888911 --> @pigmonkey commented on GitHub (Oct 8, 2020): That completely solves the problem. When running the latest master, the `add` command takes an average of 3 minutes to add a single URL, despite it only taking 10-20 seconds to archive that URL. When using #502, the entire command only adds about a second on top of the archive time. ``` $ time archivebox add https://github.com/pirate/ArchiveBox/issues/461 [i] [2020-10-08 23:51:20] ArchiveBox v0.4.21: archivebox add https://github.com/pirate/ArchiveBox/issues/461 > /home/pigmonkey/tmp/bookmarks [+] [2020-10-08 23:51:20] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1602201080-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index [*] [2020-10-08 23:51:20] Writing 1 links to main index... √ /home/pigmonkey/tmp/bookmarks/index.sqlite3 [▶] [2020-10-08 23:51:20] Collecting content for 1 Snapshots in archive... [+] [2020-10-08 23:51:20] "github.com/pirate/ArchiveBox/issues/461" https://github.com/pirate/ArchiveBox/issues/461 > ./archive/1602201080.524471 > title > favicon > wget > pdf > screenshot > dom > headers > archive_org [√] [2020-10-08 23:51:34] Update of 1 pages complete (13.65 sec) - 0 links skipped - 1 links updated - 0 links had errors Hint: To view your archive index, open: /home/pigmonkey/tmp/bookmarks/index.html Or run the built-in webserver: archivebox server real 0m14.607s user 0m5.573s sys 0m1.419s ``` Removing is also close to instant. ``` $ time archivebox remove --delete --yes https://github.com/pirate/ArchiveBox/issues/461 [i] [2020-10-08 23:52:40] ArchiveBox v0.4.21: archivebox remove --delete --yes https://github.com/pirate/ArchiveBox/issues/461 > /home/pigmonkey/tmp/bookmarks [*] Finding links in the archive index matching these exact patterns: https://github.com/pirate/ArchiveBox/issues/461 --------------------------------------------------------------------------------------------------- timestamp | is_archived | num_outputs | url "1602201080.524471" | true | 0 | "https://github.com/pirate/ArchiveBox/issues/461" --------------------------------------------------------------------------------------------------- [i] Found 1 matching URLs to remove. 1 Links will be de-listed from the main index, and their archived content folders will be deleted from disk. (1 data folders with 0 archived files will be deleted!) [√] Removed 1 out of 922 links from the archive index. Index now contains 921 links. real 0m0.897s user 0m0.742s sys 0m0.152s ``` The `add` command still prints the hint suggesting that I load the HTML index. That should probably go away if that file is going to be out of date until I run `archivebox list --html >! index.html`.
Author
Owner

@cdvv7788 commented on GitHub (Oct 9, 2020):

Good point. I will remove the warning too.

<!-- gh-comment-id:706192790 --> @cdvv7788 commented on GitHub (Oct 9, 2020): Good point. I will remove the warning too.
Author
Owner

@cdvv7788 commented on GitHub (Nov 13, 2020):

This is solved on master. Please reopen if there is something still pending.

<!-- gh-comment-id:726796891 --> @cdvv7788 commented on GitHub (Nov 13, 2020): This is solved on master. Please reopen if there is something still pending.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#304
No description provided.