mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #296] Bugfix: hangs when importing .csv, hogs memory #3236
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3236
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @sandriaas on GitHub (Nov 19, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/296
Describe the bug
Importing .csv causes hang, hogging memory. My ram 8gb, swap 16gb.
The file is around 30MB and 190k list.when i try to look into task manager, it seems that archivebox is trying to archive all those list in batch?
Steps to reproduce
Screenshots or log output
Software versions
94dba22@sandriaas commented on GitHub (Nov 19, 2019):
Btw i forgot to capture the task manager with VSZ column. I already restarted my laptop, sorry
@pirate commented on GitHub (Dec 6, 2019):
190k is by far the biggest attempted import list I've seen so far, the most that's been tested before was 60k and it took a powerful machine to complete. Unfortunately ArchiveBox is not currently designed to ingest that many things at once and performance is low on the list of priorities (see the Roadmap wiki).
For now I suggest splitting up your list into separate files and importing them one by-one, each ~1000 links.
@pirate commented on GitHub (Dec 6, 2019):
I'm going to close this for now because I don't foresee being able to order this large of a list until at least v0.5 (which is a several versions away still).
Performance improvements will come slowly as a result of natural refactoring over time, not as one big PR or issue. You can follow multicore import support here: https://github.com/pirate/ArchiveBox/issues/91, thats the next major performance improvement planned.