[GH-ISSUE #611] Question: Adding any options to [ARCHIVE_METHOD_OPTIONS] causes wget to fail #3398

Closed
opened 2026-03-14 22:38:58 +03:00 by kerem · 4 comments
Owner

Originally created by @winteriscariot on GitHub (Jan 10, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/611

If I add anything to the [ARCHIVE_METHOD_OPTIONS] section it causes wget to throw an exception. Here is my current ArchiveBox.conf:

[GENERAL_CONFIG]
TIMEOUT = 120

[SERVER_CONFIG]
SECRET_KEY = <super_secret>

[ARCHIVE_METHOD_TOGGLES]
SAVE_ARCHIVE_DOT_ORG = FALSE
SAVE_PDF = FALSE
SAVE_SCREENSHOT = FALSE
SAVE_MEDIA = FALSE

[ARCHIVE_METHOD_OPTIONS]
COOKIES_FILE = /home/winteriscariot/cookies.txt
WGET_USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0"

[DEPENDENCY_CONFIG]
CHROME_BINARY = /usr/bin/chromium

If I just have one option (such as COOKIES_FILE) it still fails. It does NOT fail if I remove the COOKIES_FILE and WGET_USER_AGENT from the ArchiveBox.conf. With the above config I get the follow exception thrown, about halfway through the wget process:

$ archivebox add https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/
[i] [2021-01-10 20:20:08] ArchiveBox v0.5.3: archivebox add https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/
    > /mnt/storage/Archive

[+] [2021-01-10 20:20:08] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1610310008-import.txt
    > Parsed 1 URLs from input (Plain Text)                                                                                                                                                 
    > Found 1 new URLs not already in index

[*] [2021-01-10 20:20:08] Writing 1 links to main index...
    √ /mnt/storage/Archive/index.sqlite3                                                                                                                                                    

[▶] [2021-01-10 20:20:08] Starting archiving of 1 snapshots in index...

[+] [2021-01-10 20:20:08] "www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released"
    https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/
    > ./archive/1610310008.636074
      > title
      > favicon                                                                                                                                                                             
      > wget                                                                                                                                                                                
    ! Failed to archive link: Exception: Exception in archive_methods.save_wget(Link(url=https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/))               

Traceback (most recent call last):
  File "/usr/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 108, in archive_link
    result = method_function(link=link, out_dir=out_dir)
  File "/usr/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/archivebox/extractors/wget.py", line 115, in save_wget
    return ArchiveResult(
  File "<string>", line 12, in __init__
  File "/usr/lib/python3.9/site-packages/archivebox/index/schema.py", line 46, in __post_init__
    self.typecheck()
  File "/usr/lib/python3.9/site-packages/archivebox/index/schema.py", line 57, in typecheck
    assert all(isinstance(arg, str) and arg for arg in self.cmd)
AssertionError

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/archivebox", line 8, in <module>
    sys.exit(main())
  File "/usr/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 129, in main
    run_subcommand(
  File "/usr/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 69, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/usr/lib/python3.9/site-packages/archivebox/cli/archivebox_add.py", line 85, in main
    add(
  File "/usr/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/archivebox/main.py", line 593, in add
    archive_links(new_links, overwrite=False, **archive_kwargs)
  File "/usr/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 173, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/usr/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function
    return func(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 122, in archive_link
    raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
Exception: Exception in archive_methods.save_wget(Link(url=https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/))

I'm running on an up-to-date Arch Linux install (updated this morning to try and fix it) and I installed archivebox via pip, and is version 5.3.

wget version (just the default from the arch repos):

$ wget --version
GNU Wget 1.20.3 built on linux-gnu.

-cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls 
+ntlm +opie +psl +ssl/gnutls 

Wgetrc: 
    /etc/wgetrc (system)
Locale: 
    /usr/share/locale 
Compile: 
    gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc" 
    -DLOCALEDIR="/usr/share/locale" -I. -I../lib -I../lib 
    -D_FORTIFY_SOURCE=2 -I/usr/include/p11-kit-1 -DHAVE_LIBGNUTLS 
    -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe -fno-plt 
Link: 
    gcc -I/usr/include/p11-kit-1 -DHAVE_LIBGNUTLS -DNDEBUG 
    -march=x86-64 -mtune=generic -O2 -pipe -fno-plt 
    -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -lpcre2-8 -luuid 
    -lidn2 -lnettle -lgnutls -lz -lpsl ftp-opie.o gnutls.o http-ntlm.o 
    ../lib/libgnu.a /usr/lib/libunistring.so 

Unfortunately I'm not familiar enough with python to debug myself, or even have a good idea if this is a bug in archivebox or a config or dependency issue. is there something obvious that I should be looking at for this?

Any help would be awesome, thanks!

Originally created by @winteriscariot on GitHub (Jan 10, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/611 If I add anything to the [ARCHIVE_METHOD_OPTIONS] section it causes wget to throw an exception. Here is my current ArchiveBox.conf: ``` [GENERAL_CONFIG] TIMEOUT = 120 [SERVER_CONFIG] SECRET_KEY = <super_secret> [ARCHIVE_METHOD_TOGGLES] SAVE_ARCHIVE_DOT_ORG = FALSE SAVE_PDF = FALSE SAVE_SCREENSHOT = FALSE SAVE_MEDIA = FALSE [ARCHIVE_METHOD_OPTIONS] COOKIES_FILE = /home/winteriscariot/cookies.txt WGET_USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" [DEPENDENCY_CONFIG] CHROME_BINARY = /usr/bin/chromium ``` If I just have one option (such as COOKIES_FILE) it still fails. It does NOT fail if I remove the COOKIES_FILE and WGET_USER_AGENT from the ArchiveBox.conf. With the above config I get the follow exception thrown, about halfway through the wget process: ``` $ archivebox add https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/ [i] [2021-01-10 20:20:08] ArchiveBox v0.5.3: archivebox add https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/ > /mnt/storage/Archive [+] [2021-01-10 20:20:08] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1610310008-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index [*] [2021-01-10 20:20:08] Writing 1 links to main index... √ /mnt/storage/Archive/index.sqlite3 [▶] [2021-01-10 20:20:08] Starting archiving of 1 snapshots in index... [+] [2021-01-10 20:20:08] "www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released" https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/ > ./archive/1610310008.636074 > title > favicon > wget ! Failed to archive link: Exception: Exception in archive_methods.save_wget(Link(url=https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/)) Traceback (most recent call last): File "/usr/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 108, in archive_link result = method_function(link=link, out_dir=out_dir) File "/usr/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function return func(*args, **kwargs) File "/usr/lib/python3.9/site-packages/archivebox/extractors/wget.py", line 115, in save_wget return ArchiveResult( File "<string>", line 12, in __init__ File "/usr/lib/python3.9/site-packages/archivebox/index/schema.py", line 46, in __post_init__ self.typecheck() File "/usr/lib/python3.9/site-packages/archivebox/index/schema.py", line 57, in typecheck assert all(isinstance(arg, str) and arg for arg in self.cmd) AssertionError The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/bin/archivebox", line 8, in <module> sys.exit(main()) File "/usr/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 129, in main run_subcommand( File "/usr/lib/python3.9/site-packages/archivebox/cli/__init__.py", line 69, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/usr/lib/python3.9/site-packages/archivebox/cli/archivebox_add.py", line 85, in main add( File "/usr/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function return func(*args, **kwargs) File "/usr/lib/python3.9/site-packages/archivebox/main.py", line 593, in add archive_links(new_links, overwrite=False, **archive_kwargs) File "/usr/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function return func(*args, **kwargs) File "/usr/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 173, in archive_links archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir)) File "/usr/lib/python3.9/site-packages/archivebox/util.py", line 112, in typechecked_function return func(*args, **kwargs) File "/usr/lib/python3.9/site-packages/archivebox/extractors/__init__.py", line 122, in archive_link raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format( Exception: Exception in archive_methods.save_wget(Link(url=https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/)) ``` I'm running on an up-to-date Arch Linux install (updated this morning to try and fix it) and I installed archivebox via pip, and is version 5.3. wget version (just the default from the arch repos): ``` $ wget --version GNU Wget 1.20.3 built on linux-gnu. -cares +digest -gpgme +https +ipv6 +iri +large-file -metalink +nls +ntlm +opie +psl +ssl/gnutls Wgetrc: /etc/wgetrc (system) Locale: /usr/share/locale Compile: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc" -DLOCALEDIR="/usr/share/locale" -I. -I../lib -I../lib -D_FORTIFY_SOURCE=2 -I/usr/include/p11-kit-1 -DHAVE_LIBGNUTLS -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe -fno-plt Link: gcc -I/usr/include/p11-kit-1 -DHAVE_LIBGNUTLS -DNDEBUG -march=x86-64 -mtune=generic -O2 -pipe -fno-plt -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -lpcre2-8 -luuid -lidn2 -lnettle -lgnutls -lz -lpsl ftp-opie.o gnutls.o http-ntlm.o ../lib/libgnu.a /usr/lib/libunistring.so ``` Unfortunately I'm not familiar enough with python to debug myself, or even have a good idea if this is a bug in archivebox or a config or dependency issue. is there something obvious that I should be looking at for this? Any help would be awesome, thanks!
kerem closed this issue 2026-03-14 22:39:03 +03:00
Author
Owner

@winteriscariot commented on GitHub (Jan 10, 2021):

Here's stdout of archiving that same link with both ARCHIVE_METHOD_OPTIONS removed:

functional ArchiveBox.conf:

[GENERAL_CONFIG]
TIMEOUT = 120

[SERVER_CONFIG]
SECRET_KEY =

[ARCHIVE_METHOD_TOGGLES]
SAVE_ARCHIVE_DOT_ORG = FALSE
SAVE_PDF = FALSE
SAVE_SCREENSHOT = FALSE
SAVE_MEDIA = FALSE

[ARCHIVE_METHOD_OPTIONS]
#COOKIES_FILE = /home/winteriscariot/cookies.txt
#WGET_USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0"

[DEPENDENCY_CONFIG]
CHROME_BINARY = /usr/bin/chromium

Successful wget stdout:

$ archivebox add https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/
[i] [2021-01-10 20:28:55] ArchiveBox v0.5.3: archivebox add https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/
    > /mnt/storage/Archive

[+] [2021-01-10 20:28:55] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1610310535-import.txt
    > Parsed 1 URLs from input (Plain Text)                                                                                                                                                 
    > Found 1 new URLs not already in index

[*] [2021-01-10 20:28:55] Writing 1 links to main index...
    √ /mnt/storage/Archive/index.sqlite3                                                                                                                                                    

[▶] [2021-01-10 20:28:55] Starting archiving of 1 snapshots in index...

[+] [2021-01-10 20:28:55] "www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released"
    https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/
    > ./archive/1610310535.61918
      > title
      > favicon                                                                                                                                                                             
      > wget                                                                                                                                                                                
      > singlefile                                                                                                                                                                          
      > dom                                                                                                                                                                                 
      > readability                                                                                                                                                                         
      > mercury                                                                                                                                                                             
      > headers                                                                                                                                                                             
                                                                                                                                                                                            
[√] [2021-01-10 20:29:41] Update of 1 pages complete (45.69 sec)
    - 0 links skipped
    - 1 links updated
    - 0 links had errors

    Hint: To manage your archive in a Web UI, run:
        archivebox server 0.0.0.0:8000

Unfortunately I need my cookies.txt file for archiving some logged in sites, and at least one site throws a 403 whenever I connect with the default ArchiveBox user agent, so not using those options isn't a great choice.

I labeled this as a question mostly cuz I'm not sure if this is a bug or a config issue

<!-- gh-comment-id:757539020 --> @winteriscariot commented on GitHub (Jan 10, 2021): Here's stdout of archiving that same link with both ARCHIVE_METHOD_OPTIONS removed: functional ArchiveBox.conf: ``` [GENERAL_CONFIG] TIMEOUT = 120 [SERVER_CONFIG] SECRET_KEY = [ARCHIVE_METHOD_TOGGLES] SAVE_ARCHIVE_DOT_ORG = FALSE SAVE_PDF = FALSE SAVE_SCREENSHOT = FALSE SAVE_MEDIA = FALSE [ARCHIVE_METHOD_OPTIONS] #COOKIES_FILE = /home/winteriscariot/cookies.txt #WGET_USER_AGENT = "Mozilla/5.0 (X11; Linux x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" [DEPENDENCY_CONFIG] CHROME_BINARY = /usr/bin/chromium ``` Successful wget stdout: ``` $ archivebox add https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/ [i] [2021-01-10 20:28:55] ArchiveBox v0.5.3: archivebox add https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/ > /mnt/storage/Archive [+] [2021-01-10 20:28:55] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1610310535-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index [*] [2021-01-10 20:28:55] Writing 1 links to main index... √ /mnt/storage/Archive/index.sqlite3 [▶] [2021-01-10 20:28:55] Starting archiving of 1 snapshots in index... [+] [2021-01-10 20:28:55] "www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released" https://www.ghacks.net/2021/01/10/password-manager-keepass-2-47-has-been-released/ > ./archive/1610310535.61918 > title > favicon > wget > singlefile > dom > readability > mercury > headers [√] [2021-01-10 20:29:41] Update of 1 pages complete (45.69 sec) - 0 links skipped - 1 links updated - 0 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000 ``` Unfortunately I need my cookies.txt file for archiving some logged in sites, and at least one site throws a 403 whenever I connect with the default ArchiveBox user agent, so not using those options isn't a great choice. I labeled this as a question mostly cuz I'm not sure if this is a bug or a config issue
Author
Owner

@winteriscariot commented on GitHub (Jan 10, 2021):

Just confirmed that the same behavior occurs with the same config on a different Arch machine. Are my options maybe malformed?

<!-- gh-comment-id:757540013 --> @winteriscariot commented on GitHub (Jan 10, 2021): Just confirmed that the same behavior occurs with the same config on a different Arch machine. Are my options maybe malformed?
Author
Owner

@pirate commented on GitHub (Jan 11, 2021):

Strange, I haven't seen this failure mode before. Can you try removing the quotes from your User Agent config line?

<!-- gh-comment-id:758205311 --> @pirate commented on GitHub (Jan 11, 2021): Strange, I haven't seen this failure mode before. Can you try removing the quotes from your User Agent config line?
Author
Owner

@winteriscariot commented on GitHub (Jan 12, 2021):

It actually may be my cookies file that was causing this issue, not the wget useragent. I left the useragent and removed the cookie file and it works.

I'll have to verify that the cookies text file is formatted correctly. Closing, thanks!

<!-- gh-comment-id:758730093 --> @winteriscariot commented on GitHub (Jan 12, 2021): It actually may be my cookies file that was causing this issue, not the wget useragent. I left the useragent and removed the cookie file and it works. I'll have to verify that the cookies text file is formatted correctly. Closing, thanks!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3398
No description provided.