mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #91] Architecture: Use multiple cores to run link archiving in parallel #3084
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3084
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pirate on GitHub (Aug 30, 2018).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/91
Add a
--parallel=8cli option to enable using multiprocessing to download a large number of links in parallel. Default to number of cores on machine, allow--parallel=1to override it to 1 core.@pirate commented on GitHub (Aug 30, 2018):
Inspired by https://github.com/aurelg/linkbak I've had this on my mind for a while since it's super easy to implement, but @aurelg inspired me to actually make an issue for it.
@aurelg commented on GitHub (Aug 31, 2018):
The relevant code is in this file:
This is actually pretty easy. In your case, the only difficulty might be to handle the screen output/progression bar properly: if different workers are updating the screen at the same time, it may quickly become a bit messy.
@pirate commented on GitHub (Jan 23, 2019):
I think I can fix the parallel process stdout multiplexing problem with one of two solutions:
some simplified pseudocode:
@anarcat commented on GitHub (May 6, 2019):
parallel downloading is tricky - you need to be nice to remote hosts and not hit them too much. there's a per-host limit specified in RFCs which most web browsers override. most clients do 7 or 10 requests per domain, IIRC.
but since you're crawling different URLs, you have a more complicated problem to deal with and can fetch more than 7-10 simultaneous queries globally... you need to be careful because it requires coordination among the workers as well, or have a director that does the right thing. in my feed reader, I believe I just throw about that number of threads (with
multiprocessing.Poolandpool.apply_async, see the short design discussion) at all sites and hope for the best, but that's clearly inefficient.I came here because I was looking at a bug report about running multiple
archivebox addin parallel: I think this is broken right now (see #234 for an example failure) because theindex.jsongets corrupted or stolen between processes. it would be nice to have at least a lock file to prevent such problems from happening.@LaserWires commented on GitHub (Sep 21, 2019):
Most site mirroring apps incorporate download by proxy it is a very common feature which might be implemented into archivebox hence a large
--parallelvalue is of no issue with such a feature. A list list of proxies can enable an archive operation to use a very large--parallelvalue. Most webservers currently are very adequately suited with high bandwidth telecoms along with very capable hardware and users with archivebox overloading them isn't likely an issue.@pirate commented on GitHub (Sep 22, 2019):
It's not an issue of overloading archivebox, it's an issue of overloading / hitting rate-limits on the content servers, which piping through a proxy wont solve.
@karlicoss commented on GitHub (Oct 25, 2020):
Even more cores than 8 might make sense, because often things are blocked on IO with no throughput, e.g. pages that would timeout. Might need some careful scheduling, but would be very cool to have!
Another IMO useful thing is having some sort of "pipeline" concurrency, e.g. one executor only archives DOM and always runs in front. The other executors run behind and handle singlepage/screenshots/media/etc, i.e. slower, but not as essential bits. This might also make it easier to schedule the load depending on which archivers are IO/CPU bound.
@pirate commented on GitHub (Dec 10, 2020):
A quick update for everyone watching this, v0.5.0 is going to be released soon with improvements to how ArchiveResults are stored (we moved them into the SqliteDB). This was a necessary blocker to fix before we can get around to parallel archiving in the next version.
v0.5.0 will be faster, but it wont have built-in concurrent archiving support yet, that will be the primary focus for v0.6.0. The plan is to add a background task queue handler like dramatiq or more likely huey (because it has sqlite3 support so we don't need to run redis).
Once we have the background task worker system in place, we can implement a worker pool for Chrome/playwright and each of the other extractor methods. Then archiving can run in parallel by default, archiving like 5-10 sites at a time depending on the system resources available and how well the worker pool system performs for each extractor type. Huey and dramatic both have built-in rate limiting systems that will allow us to cap the number of concurrent requests going to each site or being handled by each extractor. It's still quite a bit of work left, but we're getting closer!
Having a background task system will also enable us to do many other cool things, like building the scheduled import system into the UI #578, using a single shared chrome process instead of relaunching chrome for each link, and many other small improvements to performance.
@pirate commented on GitHub (Apr 12, 2021):
With v0.6 released now we've taken another step towards the goal of using a message-passing architecture to fully support parallel archiving. v0.6 moves that last bit of ArchiveResult state into the SQLite3 db where it can be managed with migrations and kept ACID compliant.
The next step of the process is to implement a worker queue for DB writes, and have all writes made to Snapshot/ArchiveResult models processed in a single thread, opening up other threads to be able to do things in parallel without locking the db anymore. Message passing is a big change though, so expect it to come in increments, with about 3~6 months of work to go depending on how much free time I have for ArchiveBox.
Side note: the UX of v0.6 is >10x faster in many other ways though (web UI, indexing, management tasks, etc.), only archiving itself remains to be sped up now. You can also still attempt to run
arhcivebox addcommands in parallel, it's safe and works to speed up archiving a lot already, but you may encounter occasionaldatabase lockedwarnings that mean you have to restart stuck additions manually.@runkaiz commented on GitHub (May 3, 2021):
Sorry quick question, so I run archivebox in a docker container and currently would allocating it more than one CPU core or thread have any performance gains?
@pirate commented on GitHub (May 7, 2021):
@1105420698, allocating more than 1 cpu is definitely still advised, as django will use all available cores to handle incoming requests in parallel, and a few of the extractors already take advantage of multiple cores to render pages faster (e.g. chrome).
ArchiveBox is already fairly multicore-capable (e.g. you can run multiple
addorupdatethreads at the same time), it's just a few remaining edge cases and highly-parallel write scenarios that will be improved by the pending message queue refactoring work.@pirate commented on GitHub (Jun 30, 2021):
Ok I'm pretty set on using Huey at this point for the job scheduler, it can use SQLite, it comes with a great django admin dashboard, and it supports nested tasks and mutexes.
https://github.com/boxine/django-huey-monitor/#screenshots
Here's the approach I'm thinking of to massage all critical operations into a message-passing / queue / worker arrangement in rough pseudocode:
archivebox add --depth=1 'https://example.com/feed.xml'leads to these tasks being triggered and handled in this order ->WIP ignore this, just toying around with different patterns / styles to find something with good ergonomics:
I'm worried that the heavy reliance on mutexes and locking will lead to difficult-to-debug deadlock scenarios where parents span children that eat up all the worker slots, then are unable to complete, leading to the parent to timeout and force kill those workers prematurely.
I also reached out to the folks who are building
django-huey-monitoras it looks like a great fit for our job handling UI: https://github.com/boxine/django-huey-monitor/issues/40@jgoerzen commented on GitHub (Jul 5, 2021):
Over in #781 it was stated that parallel adds don't work yet. Over at https://github.com/ArchiveBox/ArchiveBox/wiki/Usage#large-archives there is an example of doing this that should probably be removed until this is fixed.
@pirate commented on GitHub (Jul 6, 2021):
It works better in some cases (fast SSDs) than others so it's still worth trying, shouldn't be dangerous to data integrity, it'll just lock up if it's on a slow filesystem. I added a note to the Usage page.
@pirate commented on GitHub (Apr 12, 2022):
Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting
Contributions/suggestions welcome there.