[GH-ISSUE #234] Architecture: Concurrent runs accidentally delete each other's temp files, leaving the index broken #1671

Closed
opened 2026-03-01 17:52:43 +03:00 by kerem · 4 comments
Owner

Originally created by @anarcat on GitHub (May 6, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/234

Describe the bug

As part of my ridiculously large archiving attempt (partly documented in #233), I have done a first batch of URL imports with the first 100 URLs found. For a reason I can't explain (maybe because I ran two archivebox add commands in parallel?), that eventually crashed with:

FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json'

No problem, I thought - I can resume! So I did that with

archivebox add --update-all

But that crashed as well, with:

TypeError: save_file_to_sources(..., path: str) got unexpected NoneType argument path=None

I suspect this is because --update-all actually expects a list of URLs to be passed, but the usage doesn't make that clear and we shouldn't be crashing there.

Steps to reproduce

  1. call archivebox add --update-all with no other URLs

Screenshots or log output

First, the original crash, not the subject of this bug report:

[...]
[+] [2019-05-06 21:39:14] "www.hjdskes.nl/projects/cage"                                                                                                           
    https://www.hjdskes.nl/projects/cage/                                                                                                                          
    > ./archive/1557178364.10                                                                                                                                      
      > title                                                                                                                                                      
      > favicon                                                                                                                                                    
      > wget                                                                                                                                                       
        Failed:                                                                                                                                                    
            TimeoutExpired Command 'wget' timed out after 60 seconds                                                                                               
        Run to see full output:                                                                                                                                    
            cd /srv/backup/archive/archivebox/archive/1557178364.10;                                                                                               
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=warc/1557178755 --page-requisites "--user-agent=ArchiveBox/0.4.1 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto https://www.hjdskes.nl/projects/cage/                                                                                                                   
                                                                                                                                                                   
      > pdf                                                                                                                                                        
      > screenshot                                                                                                                                                 
      > dom                                                                                                                                                        
      > media                                                                                                                                                      
      > archive_org                                                                                                                                                
    ! Failed to archive link: FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json'                                                                                                                                                      
                                                                                                                                                                   
Traceback (most recent call last):                                                                                                                                 
  File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module>                                                                                
    sys.exit(main())                                                                                                                                               
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main                                                
    archivebox.main(args=sys.argv[1:], stdin=sys.stdin)                                                                                                            
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main                                          
    pwd=pwd or OUTPUT_DIR,                                                                                                                                         
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand                                  
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore                                                                                      
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main                                      
    out_dir=pwd or OUTPUT_DIR,                                                                                                                                     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function                                   
    return func(*args, **kwargs)                                                                                                                                   
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 521, in add                                                    
    archive_link(link, out_dir=link.link_dir)                                                                                                                      
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function                                   
    return func(*args, **kwargs)                                                                                                                                   
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/extractors/__init__.py", line 85, in archive_link                             
    patch_main_index(link)                                                                                                                                         
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function                                   
    return func(*args, **kwargs)                                                                                                                                   
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/__init__.py", line 323, in patch_main_index                             
    write_json_main_index(patched_links, out_dir=out_dir)                                                                                                          
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function                                   
    return func(*args, **kwargs)                                                                                                                                   
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/json.py", line 77, in write_json_main_index                             
    atomic_write(main_index_json, os.path.join(out_dir, JSON_INDEX_FILENAME))                                                                                      
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/system.py", line 79, in atomic_write                                          
    os.rename(tmp_file, path)
FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json'             

Readding the list does nothing:

[1]anarcat@curie:archivebox(master)$ archivebox add wallabag-p1.list                                                                                               
    > ./sources/wallabag-p1.list-1557179130.txt                                                                                                                    
                                                                                                                                                                   
[*] [2019-05-06 21:45:30] Parsing new links from output/sources/wallabag-p1.list-1557179130.txt...                                                                 
    > Parsed 100 links as Plain Text (0 new links added)                                                                                                           
                                                                                                                                                                   
[*] [2019-05-06 21:45:30] Writing 101 links to main index...                                                                                                       
    √ /srv/backup/archive/archivebox/index.sqlite3                                                                                                                 
    √ /srv/backup/archive/archivebox/index.json                                                                                                                    
    √ /srv/backup/archive/archivebox/index.html                                                                                                                    
                                                                                                                                                                   
[▶] [2019-05-06 21:45:31] Updating content for 0 matching pages in archive...                                                                                      
                                                                                                                                                                   
[√] [2019-05-06 21:45:31] Update of 0 pages complete (0.00 sec)                                                                                                    
    - 0 links skipped                                                                                                                                              
    - 0 links updated                                                                                                                                              
    - 0 links had errors                                                                                                                                           
                                                                                                                                                                   
    To view your archive, open:                                                                                                                                    
        /srv/backup/archive/archivebox/index.html                                                                                                                  
    Or run the built-in webserver:                                                                                                                                 
        archivebox server                                                                                                                                          
                                                                                                                                                                   
[*] [2019-05-06 21:45:31] Writing 101 links to main index...                                                                                                       
    √ /srv/backup/archive/archivebox/index.sqlite3                                                                                                                 
    √ /srv/backup/archive/archivebox/index.json                                                                                                                    
    √ /srv/backup/archive/archivebox/index.html                                                                                                                    

Looking at -h, I noticed --update-all so I try that:

anarcat@curie:archivebox(master)$ archivebox add --update-all                                                                                                      
Traceback (most recent call last):                                                                                                                                 
  File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module>                                                                                
    sys.exit(main())                                                                                                                                               
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main                                                
    archivebox.main(args=sys.argv[1:], stdin=sys.stdin)                                                                                                            
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main                                          
    pwd=pwd or OUTPUT_DIR,                                                                                                                                         
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand                                  
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore                                                                                      
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main                                      
    out_dir=pwd or OUTPUT_DIR,                                                                                                                                     
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function                                   
    return func(*args, **kwargs)                                                                                                                                   
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 496, in add                                                    
    import_path = save_file_to_sources(import_path, out_dir=out_dir)                                                                                               
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 98, in typechecked_function                                    
    check_argument_type(arg_key, arg_val)                                                                                                                          
  File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 92, in check_argument_type                                     
    str(arg_val)[:64],                                                                                                                                             
TypeError: save_file_to_sources(..., path: str) got unexpected NoneType argument path=None   

The correct call is of course to retry with the same URLs:

anarcat@curie:archivebox(master)$ archivebox add --update-all wallabag-p1.list

which works, but it would actually be nice to (a) not crash when --update-all is passed without an argument (maybe just error in argument parsing more politely) and (b) eventually just do the right thing, which is probably to retry any failed URL from the database.

Software versions

  • OS: Debian buster 10 up to date
  • ArchiveBox version: 0.4.1 installed from pip
  • Python version: 3.7.3something
  • Chrome version: irrelevant?

Thanks for your hard work, and sorry for the flood of bug reports! :)

Originally created by @anarcat on GitHub (May 6, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/234 #### Describe the bug As part of my ridiculously large archiving attempt (partly documented in #233), I have done a first batch of URL imports with the first 100 URLs found. For a reason I can't explain (maybe because I ran two `archivebox add` commands in parallel?), that eventually crashed with: ``` FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json' ``` No problem, I thought - I can resume! So I did that with ``` archivebox add --update-all ``` But that crashed as well, with: ``` TypeError: save_file_to_sources(..., path: str) got unexpected NoneType argument path=None ``` I suspect this is because `--update-all` actually expects a list of URLs to be passed, but the usage doesn't make that clear and we shouldn't be crashing there. #### Steps to reproduce 1. call `archivebox add --update-all` with no other URLs #### Screenshots or log output First, the original crash, not the subject of this bug report: ``` [...] [+] [2019-05-06 21:39:14] "www.hjdskes.nl/projects/cage" https://www.hjdskes.nl/projects/cage/ > ./archive/1557178364.10 > title > favicon > wget Failed: TimeoutExpired Command 'wget' timed out after 60 seconds Run to see full output: cd /srv/backup/archive/archivebox/archive/1557178364.10; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=warc/1557178755 --page-requisites "--user-agent=ArchiveBox/0.4.1 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto https://www.hjdskes.nl/projects/cage/ > pdf > screenshot > dom > media > archive_org ! Failed to archive link: FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json' Traceback (most recent call last): File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module> sys.exit(main()) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main archivebox.main(args=sys.argv[1:], stdin=sys.stdin) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main pwd=pwd or OUTPUT_DIR, File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main out_dir=pwd or OUTPUT_DIR, File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 521, in add archive_link(link, out_dir=link.link_dir) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/extractors/__init__.py", line 85, in archive_link patch_main_index(link) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/__init__.py", line 323, in patch_main_index write_json_main_index(patched_links, out_dir=out_dir) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/index/json.py", line 77, in write_json_main_index atomic_write(main_index_json, os.path.join(out_dir, JSON_INDEX_FILENAME)) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/system.py", line 79, in atomic_write os.rename(tmp_file, path) FileNotFoundError: [Errno 2] No such file or directory: '/srv/backup/archive/archivebox/index.json.tmp' -> '/srv/backup/archive/archivebox/index.json' ``` Readding the list does nothing: ``` [1]anarcat@curie:archivebox(master)$ archivebox add wallabag-p1.list > ./sources/wallabag-p1.list-1557179130.txt [*] [2019-05-06 21:45:30] Parsing new links from output/sources/wallabag-p1.list-1557179130.txt... > Parsed 100 links as Plain Text (0 new links added) [*] [2019-05-06 21:45:30] Writing 101 links to main index... √ /srv/backup/archive/archivebox/index.sqlite3 √ /srv/backup/archive/archivebox/index.json √ /srv/backup/archive/archivebox/index.html [▶] [2019-05-06 21:45:31] Updating content for 0 matching pages in archive... [√] [2019-05-06 21:45:31] Update of 0 pages complete (0.00 sec) - 0 links skipped - 0 links updated - 0 links had errors To view your archive, open: /srv/backup/archive/archivebox/index.html Or run the built-in webserver: archivebox server [*] [2019-05-06 21:45:31] Writing 101 links to main index... √ /srv/backup/archive/archivebox/index.sqlite3 √ /srv/backup/archive/archivebox/index.json √ /srv/backup/archive/archivebox/index.html ``` Looking at `-h`, I noticed `--update-all` so I try that: ``` anarcat@curie:archivebox(master)$ archivebox add --update-all Traceback (most recent call last): File "/home/anarcat/.virtualenvs/archivebox/bin/archivebox", line 10, in <module> sys.exit(main()) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/__main__.py", line 10, in main archivebox.main(args=sys.argv[1:], stdin=sys.stdin) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox.py", line 58, in main pwd=pwd or OUTPUT_DIR, File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 55, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/cli/archivebox_add.py", line 55, in main out_dir=pwd or OUTPUT_DIR, File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 104, in typechecked_function return func(*args, **kwargs) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/main.py", line 496, in add import_path = save_file_to_sources(import_path, out_dir=out_dir) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 98, in typechecked_function check_argument_type(arg_key, arg_val) File "/home/anarcat/.virtualenvs/archivebox/lib/python3.7/site-packages/archivebox/util.py", line 92, in check_argument_type str(arg_val)[:64], TypeError: save_file_to_sources(..., path: str) got unexpected NoneType argument path=None ``` The correct call is of course to retry with the same URLs: ``` anarcat@curie:archivebox(master)$ archivebox add --update-all wallabag-p1.list ``` which works, but it would actually be nice to (a) not crash when `--update-all` is passed without an argument (maybe just error in argument parsing more politely) and (b) eventually just do the right thing, which is probably to retry any failed URL from the database. #### Software versions - OS: Debian buster 10 up to date - ArchiveBox version: 0.4.1 installed from pip - Python version: 3.7.3something - Chrome version: irrelevant? Thanks for your hard work, and sorry for the flood of bug reports! :)
kerem 2026-03-01 17:52:43 +03:00
Author
Owner

@pirate commented on GitHub (May 6, 2019):

I added something recently called atomic_write, and I think the behavior you're seeing is just a bug in my implementation that can be fixed quite easily. This is how atomic_write works right now:

def atomic_write(contents, path):
    try:
        # 1. create temp file
        # 2. write to temp file
        # 3. rename temp file over actual destination file
    finally:
        # if anything fails, delete temp file to clean up
        if os.path.exists(tmp_file):
            os.remove(tmp_file)

What you're encountering is the finally clause deleting a temp file that's being created by a different process. It can be fixed by making every temp file have a random, unique suffix such that two processes never attempt to modify the same temp file. After I push the fix I'll comment back and close this. I'll also improve testing and support for multicore runs in general in v0.4.0.

<!-- gh-comment-id:489823363 --> @pirate commented on GitHub (May 6, 2019): I added something recently called `atomic_write`, and I think the behavior you're seeing is just a bug in my implementation that can be fixed quite easily. This is how `atomic_write` works right now: ```python def atomic_write(contents, path): try: # 1. create temp file # 2. write to temp file # 3. rename temp file over actual destination file finally: # if anything fails, delete temp file to clean up if os.path.exists(tmp_file): os.remove(tmp_file) ``` What you're encountering is the `finally` clause deleting a temp file that's being created by a different process. It can be fixed by making every temp file have a random, unique suffix such that two processes never attempt to modify the same temp file. After I push the fix I'll comment back and close this. I'll also improve testing and support for multicore runs in general in v0.4.0.
Author
Owner

@anarcat commented on GitHub (May 6, 2019):

you might want to reuse existing code for this, e.g.

https://github.com/untitaker/python-atomicwrites
https://github.com/rec/safer

<!-- gh-comment-id:489825706 --> @anarcat commented on GitHub (May 6, 2019): you might want to reuse existing code for this, e.g. https://github.com/untitaker/python-atomicwrites https://github.com/rec/safer
Author
Owner

@pirate commented on GitHub (Jul 24, 2020):

This should all be fixed in the latest django version. (we ended up using python-atomicwrites)

git checkout django
git pull
docker build . -t archivebox
docker run -v $PWD/output:/data archivebox init
docker run -v $PWD/output:/data archivebox add 'https://example.com'

If you still see any issues, comment back and I'll reopen the ticket.
I still recommend running it single-threaded only for now, the next version will have much better multicore support since we'll be removing the index.json and index.html main indexes that cause so many locking issues and writing race-conditions.

<!-- gh-comment-id:663624070 --> @pirate commented on GitHub (Jul 24, 2020): This should all be fixed in the latest `django` version. (we ended up using python-atomicwrites) ```bash git checkout django git pull docker build . -t archivebox docker run -v $PWD/output:/data archivebox init docker run -v $PWD/output:/data archivebox add 'https://example.com' ``` If you still see any issues, comment back and I'll reopen the ticket. I still recommend running it single-threaded only for now, the next version will have much better multicore support since we'll be removing the index.json and index.html main indexes that cause so many locking issues and writing race-conditions.
Author
Owner

@pirate commented on GitHub (Apr 12, 2022):

Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting

Contributions/suggestions welcome there.

<!-- gh-comment-id:1097265415 --> @pirate commented on GitHub (Apr 12, 2022): Note I've added a new DB/filesystem troubleshooting area to the wiki that may help people arriving here from Google: https://github.com/ArchiveBox/ArchiveBox/wiki/Upgrading-or-Merging-Archives#database-troubleshooting Contributions/suggestions welcome there.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1671
No description provided.