[GH-ISSUE #445] Bugfix: Archiving process dies when readability extractor falls back to download_url on a URL that's timing out #300

Closed
opened 2026-03-01 14:42:13 +03:00 by kerem · 1 comment
Owner

Originally created by @mpeteuil on GitHub (Aug 16, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/445

Describe the bug

The readability extractor attempts to re-use the previous work of the singlefile, wget and dom extractors. If there are no files that correspond to the URL for those extractors then it falls back to the download_url method. In cases where this happens and it leads to a timeout the archiving process dies as a result.

Steps to reproduce

  1. Enable the readability extractor and point it to the correct binary
  2. Run archivebox add on a URL that will lead to a time out (for example a nonexistent LAN IP, archivebox add "http://192.168.999.999").
  3. All extractors should time out. Once it gets to the readability extractor it will have to fall back to download_url from utils and this will kill the process due to an uncaught exception. The same thing should happen if only running the readability extractor by itself.

Screenshots or log output

> readability
! Failed to archive link: Exception: Exception in archive_methods.save_readability(Link(url=http://192.168.999.999/))

Traceback (most recent call last):
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connection.py", line 160, in _new_conn
    (self._dns_host, self.port), self.timeout, **extra_kw
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection
    raise err
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection
    sock.connect(sa)
socket.timeout: timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen
    chunked=chunked,
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connectionpool.py", line 392, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/Users/mpeteuil/.pyenv/versions/3.7.8/lib/python3.7/http/client.py", line 1262, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/Users/mpeteuil/.pyenv/versions/3.7.8/lib/python3.7/http/client.py", line 1308, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/Users/mpeteuil/.pyenv/versions/3.7.8/lib/python3.7/http/client.py", line 1257, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/Users/mpeteuil/.pyenv/versions/3.7.8/lib/python3.7/http/client.py", line 1028, in _send_output
    self.send(msg)
  File "/Users/mpeteuil/.pyenv/versions/3.7.8/lib/python3.7/http/client.py", line 968, in send
    self.connect()
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connection.py", line 187, in connect
    conn = self._new_conn()
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connection.py", line 167, in _new_conn
    % (self.host, self.timeout),
urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPConnection object at 0x115929f50>, 'Connection to 192.168.999.999 timed out. (connect timeout=60)')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connectionpool.py", line 727, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/util/retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.999.999', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x115929f50>, 'Connection to 192.168.999.999 timed out. (connect timeout=60)'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/extractors/__init__.py", line 91, in archive_link
    result = method_function(link=link, out_dir=out_dir)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/extractors/readability.py", line 65, in save_readability
    document = get_html(link, out_dir)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/extractors/readability.py", line 43, in get_html
    return download_url(link.url)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 164, in download_url
    timeout=timeout,
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/api.py", line 76, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/sessions.py", line 530, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/sessions.py", line 643, in send
    r = adapter.send(request, **kwargs)
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/adapters.py", line 504, in send
    raise ConnectTimeout(e, request=request)
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='192.168.999.999, port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x115929f50>, 'Connection to 192.168.999.999 timed out. (connect timeout=60)'))

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/bin/archivebox", line 11, in <module>
    load_entry_point('archivebox', 'console_scripts', 'archivebox')()
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/cli/__init__.py", line 126, in main
    pwd=pwd or OUTPUT_DIR,
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/cli/__init__.py", line 62, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/cli/archivebox_update.py", line 119, in main
    out_dir=pwd or OUTPUT_DIR,
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/main.py", line 700, in update
    archive_links(to_archive, overwrite=overwrite, out_dir=out_dir)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/extractors/__init__.py", line 150, in archive_links
    archive_link(link, overwrite=overwrite, methods=methods, out_dir=link.link_dir)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/Users/mpeteuil/projects/ArchiveBox/archivebox/extractors/__init__.py", line 104, in archive_link
    )) from e
Exception: Exception in archive_methods.save_readability(Link(url=http://192.168.999.999/))

Software versions

  • OS: macOS 10.15.6
  • ArchiveBox version: 26022fc
  • Python version: 3.7.8
Originally created by @mpeteuil on GitHub (Aug 16, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/445 #### Describe the bug The readability extractor attempts to re-use the previous work of the `singlefile`, `wget` and `dom` extractors. If there are no files that correspond to the URL for those extractors then it [falls back to the download_url](https://github.com/pirate/ArchiveBox/blob/v0.4.14/archivebox/extractors/readability.py#L43) method. In cases where this happens and it leads to a timeout the archiving process dies as a result. #### Steps to reproduce 1. Enable the readability extractor and point it to the correct binary 2. Run `archivebox add` on a URL that will lead to a time out (for example a nonexistent LAN IP, `archivebox add "http://192.168.999.999"`). 3. All extractors should time out. Once it gets to the readability extractor it will have to fall back to `download_url` from utils and this will kill the process due to an uncaught exception. The same thing should happen if only running the readability extractor by itself. #### Screenshots or log output ```sh > readability ! Failed to archive link: Exception: Exception in archive_methods.save_readability(Link(url=http://192.168.999.999/)) Traceback (most recent call last): File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connection.py", line 160, in _new_conn (self._dns_host, self.port), self.timeout, **extra_kw File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/util/connection.py", line 84, in create_connection raise err File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/util/connection.py", line 74, in create_connection sock.connect(sa) socket.timeout: timed out During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connectionpool.py", line 677, in urlopen chunked=chunked, File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connectionpool.py", line 392, in _make_request conn.request(method, url, **httplib_request_kw) File "/Users/mpeteuil/.pyenv/versions/3.7.8/lib/python3.7/http/client.py", line 1262, in request self._send_request(method, url, body, headers, encode_chunked) File "/Users/mpeteuil/.pyenv/versions/3.7.8/lib/python3.7/http/client.py", line 1308, in _send_request self.endheaders(body, encode_chunked=encode_chunked) File "/Users/mpeteuil/.pyenv/versions/3.7.8/lib/python3.7/http/client.py", line 1257, in endheaders self._send_output(message_body, encode_chunked=encode_chunked) File "/Users/mpeteuil/.pyenv/versions/3.7.8/lib/python3.7/http/client.py", line 1028, in _send_output self.send(msg) File "/Users/mpeteuil/.pyenv/versions/3.7.8/lib/python3.7/http/client.py", line 968, in send self.connect() File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connection.py", line 187, in connect conn = self._new_conn() File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connection.py", line 167, in _new_conn % (self.host, self.timeout), urllib3.exceptions.ConnectTimeoutError: (<urllib3.connection.HTTPConnection object at 0x115929f50>, 'Connection to 192.168.999.999 timed out. (connect timeout=60)') During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/adapters.py", line 449, in send timeout=timeout File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/connectionpool.py", line 727, in urlopen method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/urllib3/util/retry.py", line 439, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='192.168.999.999', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x115929f50>, 'Connection to 192.168.999.999 timed out. (connect timeout=60)')) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/Users/mpeteuil/projects/ArchiveBox/archivebox/extractors/__init__.py", line 91, in archive_link result = method_function(link=link, out_dir=out_dir) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/extractors/readability.py", line 65, in save_readability document = get_html(link, out_dir) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/extractors/readability.py", line 43, in get_html return download_url(link.url) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 164, in download_url timeout=timeout, File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/api.py", line 76, in get return request('get', url, params=params, **kwargs) File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/api.py", line 61, in request return session.request(method=method, url=url, **kwargs) File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/sessions.py", line 530, in request resp = self.send(prep, **send_kwargs) File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/sessions.py", line 643, in send r = adapter.send(request, **kwargs) File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/lib/python3.7/site-packages/requests/adapters.py", line 504, in send raise ConnectTimeout(e, request=request) requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='192.168.999.999, port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x115929f50>, 'Connection to 192.168.999.999 timed out. (connect timeout=60)')) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/mpeteuil/.local/share/virtualenvs/ArchiveBox-GFvPTHwb-/Users/mpeteuil/.pyenv/shims/python/bin/archivebox", line 11, in <module> load_entry_point('archivebox', 'console_scripts', 'archivebox')() File "/Users/mpeteuil/projects/ArchiveBox/archivebox/cli/__init__.py", line 126, in main pwd=pwd or OUTPUT_DIR, File "/Users/mpeteuil/projects/ArchiveBox/archivebox/cli/__init__.py", line 62, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/Users/mpeteuil/projects/ArchiveBox/archivebox/cli/archivebox_update.py", line 119, in main out_dir=pwd or OUTPUT_DIR, File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/main.py", line 700, in update archive_links(to_archive, overwrite=overwrite, out_dir=out_dir) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/extractors/__init__.py", line 150, in archive_links archive_link(link, overwrite=overwrite, methods=methods, out_dir=link.link_dir) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/Users/mpeteuil/projects/ArchiveBox/archivebox/extractors/__init__.py", line 104, in archive_link )) from e Exception: Exception in archive_methods.save_readability(Link(url=http://192.168.999.999/)) ``` #### Software versions - OS: macOS 10.15.6 - ArchiveBox version: 26022fc - Python version: 3.7.8
kerem 2026-03-01 14:42:13 +03:00
Author
Owner

@pirate commented on GitHub (Aug 18, 2020):

This should be fixed on the latest version, if you still have any issues comment back here and we'll reopen the ticket.

pip install --upgrade archivebox
# or if you use docker
docker pull nikisweeting/archivebox
<!-- gh-comment-id:675274264 --> @pirate commented on GitHub (Aug 18, 2020): This should be fixed on the latest version, if you still have any issues comment back here and we'll reopen the ticket. ```bash pip install --upgrade archivebox # or if you use docker docker pull nikisweeting/archivebox ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#300
No description provided.