[GH-ISSUE #296] Bugfix: hangs when importing .csv, hogs memory #1725

Closed
opened 2026-03-01 17:53:10 +03:00 by kerem · 3 comments
Owner

Originally created by @sandriaas on GitHub (Nov 19, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/296

Describe the bug

Importing .csv causes hang, hogging memory. My ram 8gb, swap 16gb.
The file is around 30MB and 190k list.when i try to look into task manager, it seems that archivebox is trying to archive all those list in batch?

Steps to reproduce

  1. Ran "docker-compose exec archivebox /bin/archive /data/sources/history.csv
  2. Saw this output during archiving 'parsing new links from output/sources/history.csv..."
  3. UI didn't show the thing I was expecting, it stuck with only the 1st output, and it hangs a lot, my ram is used %95+ and swap 30%+

Screenshots or log output

MVIMG_20191120_054615

Software versions

  • OS: manjaro 18.1.3 xfce
  • ArchiveBox version: 94dba22
  • Python version: 3.7.4
  • Chrome version: 80.0.0.3964
Originally created by @sandriaas on GitHub (Nov 19, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/296 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you :) --> #### Describe the bug Importing .csv causes hang, hogging memory. My ram 8gb, swap 16gb. The file is around 30MB and 190k list.when i try to look into task manager, it seems that archivebox is trying to archive all those list in batch? #### Steps to reproduce 1. Ran "docker-compose exec archivebox /bin/archive /data/sources/history.csv 2. Saw this output during archiving 'parsing new links from output/sources/history.csv..." 3. UI didn't show the thing I was expecting, it stuck with only the 1st output, and it hangs a lot, my ram is used %95+ and swap 30%+ #### Screenshots or log output ![MVIMG_20191120_054615](https://user-images.githubusercontent.com/21377475/69193411-3062d600-0b59-11ea-9e8d-51bb22d3cece.jpg) #### Software versions - OS: manjaro 18.1.3 xfce - ArchiveBox version: 94dba22 - Python version: 3.7.4 - Chrome version: 80.0.0.3964
Author
Owner

@sandriaas commented on GitHub (Nov 19, 2019):

Btw i forgot to capture the task manager with VSZ column. I already restarted my laptop, sorry

<!-- gh-comment-id:555755629 --> @sandriaas commented on GitHub (Nov 19, 2019): Btw i forgot to capture the task manager with VSZ column. I already restarted my laptop, sorry
Author
Owner

@pirate commented on GitHub (Dec 6, 2019):

190k is by far the biggest attempted import list I've seen so far, the most that's been tested before was 60k and it took a powerful machine to complete. Unfortunately ArchiveBox is not currently designed to ingest that many things at once and performance is low on the list of priorities (see the Roadmap wiki).

For now I suggest splitting up your list into separate files and importing them one by-one, each ~1000 links.

$ cd ~/path-to-archivebox/data/sources
$ cp ~/Desktop/big_import_list.txt ./
$ split big_import_list.txt links_chunk_
$ for chunk in links_chunk_*; do docker-compose exec archivebox /bin/archive /data/sources/$chunk; done
# this will run one import for each chunk, duplicates are automatically skipped so it's safe to re-run if cancelled partway through
<!-- gh-comment-id:562441560 --> @pirate commented on GitHub (Dec 6, 2019): 190k is by far the biggest attempted import list I've seen so far, the most that's been tested before was 60k and it took a powerful machine to complete. Unfortunately ArchiveBox is not currently designed to ingest that many things at once and performance is low on the list of priorities (see the Roadmap wiki). For now I suggest splitting up your list into separate files and importing them one by-one, each ~1000 links. ```shell-session $ cd ~/path-to-archivebox/data/sources $ cp ~/Desktop/big_import_list.txt ./ $ split big_import_list.txt links_chunk_ $ for chunk in links_chunk_*; do docker-compose exec archivebox /bin/archive /data/sources/$chunk; done # this will run one import for each chunk, duplicates are automatically skipped so it's safe to re-run if cancelled partway through ```
Author
Owner

@pirate commented on GitHub (Dec 6, 2019):

I'm going to close this for now because I don't foresee being able to order this large of a list until at least v0.5 (which is a several versions away still).

Performance improvements will come slowly as a result of natural refactoring over time, not as one big PR or issue. You can follow multicore import support here: https://github.com/pirate/ArchiveBox/issues/91, thats the next major performance improvement planned.

<!-- gh-comment-id:562442366 --> @pirate commented on GitHub (Dec 6, 2019): I'm going to close this for now because I don't foresee being able to order this large of a list until at least v0.5 (which is a several versions away still). Performance improvements will come slowly as a result of natural refactoring over time, not as one big PR or issue. You can follow multicore import support here: https://github.com/pirate/ArchiveBox/issues/91, thats the next major performance improvement planned.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1725
No description provided.