[GH-ISSUE #533] JPEG image misunderstood as website #340

Closed
opened 2026-03-01 14:42:39 +03:00 by kerem · 1 comment
Owner

Originally created by @kedorlaomer on GitHub (Nov 12, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/533

My environment is as follows:

  • python3 --version says Python 3.8.6
  • I got archivebox via pip3 install
  • The version of archivebox is ArchiveBox v0.4.21 (do you need more of its output?)
  • The whole mess runs on alpine linux, if this is of any interest
  • anything more needed?

I did the following:

  1. python -m archivebox init
  2. python3 -m archivebox config --set SAVE_ARCHIVE_DOT_ORG=False
  3. python3 -m archivebox add --depth 1 --overwrite https://blog.haschek.at/

The first command ran nicely, the second one choked with the following output (I'm showing only the suffix I consider relevant):

[+] [2020-11-12 15:05:37] "www.pictshare.net/xu2gjzrm74.jpg"
    https://www.pictshare.net/xu2gjzrm74.jpg
    > ./archive/1605193454.385444
      > title
        Extractor failed:
             Unable to detect page title
        Run to see full output:
            cd /tmp/src/output/archive/1605193454.385444;
            curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.21 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.72.0 (x86_64-alpine-linux-musl)" https://www.pictshare.net/xu2gjzrm74.jpg

      > favicon
      > wget
      > singlefile
      > pdf
      > screenshot
      > dom
      > readability
    ! Failed to archive link: Exception: Exception in archive_methods.save_readability(Link(url=https://www.pictshare.net/xu2gjzrm74.jpg))

Traceback (most recent call last):
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 91, in archive_link
    result = method_function(link=link, out_dir=out_dir)
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/extractors/readability.py", line 109, in save_readability
    cmd=cmd,
UnboundLocalError: local variable 'cmd' referenced before assignment

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/__main__.py", line 11, in <module>
    main(args=sys.argv[1:], stdin=sys.stdin)
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/cli/__init__.py", line 122, in main
    run_subcommand(
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/cli/__init__.py", line 62, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/cli/archivebox_add.py", line 78, in main
    add(
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/main.py", line 574, in add
    archive_links(imported_links, overwrite=True, out_dir=out_dir)
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 150, in archive_links
    archive_link(link, overwrite=overwrite, methods=methods, out_dir=link.link_dir)
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/util.py", line 111, in typechecked_function
    return func(*args, **kwargs)
  File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 101, in archive_link
    raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
Exception: Exception in archive_methods.save_readability(Link(url=https://www.pictshare.net/xu2gjzrm74.jpg))

Thank you for your awesome project!

Originally created by @kedorlaomer on GitHub (Nov 12, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/533 My environment is as follows: * `python3 --version` says `Python 3.8.6` * I got archivebox via `pip3 install` * The version of archivebox is `ArchiveBox v0.4.21` (do you need more of its output?) * The whole mess runs on alpine linux, if this is of any interest * anything more needed? I did the following: 1. `python -m archivebox init` 2. `python3 -m archivebox config --set SAVE_ARCHIVE_DOT_ORG=False` 3. `python3 -m archivebox add --depth 1 --overwrite https://blog.haschek.at/` The first command ran nicely, the second one choked with the following output (I'm showing only the suffix I consider relevant): ``` [+] [2020-11-12 15:05:37] "www.pictshare.net/xu2gjzrm74.jpg" https://www.pictshare.net/xu2gjzrm74.jpg > ./archive/1605193454.385444 > title Extractor failed: Unable to detect page title Run to see full output: cd /tmp/src/output/archive/1605193454.385444; curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.21 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.72.0 (x86_64-alpine-linux-musl)" https://www.pictshare.net/xu2gjzrm74.jpg > favicon > wget > singlefile > pdf > screenshot > dom > readability ! Failed to archive link: Exception: Exception in archive_methods.save_readability(Link(url=https://www.pictshare.net/xu2gjzrm74.jpg)) Traceback (most recent call last): File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 91, in archive_link result = method_function(link=link, out_dir=out_dir) File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/extractors/readability.py", line 109, in save_readability cmd=cmd, UnboundLocalError: local variable 'cmd' referenced before assignment The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/__main__.py", line 11, in <module> main(args=sys.argv[1:], stdin=sys.stdin) File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/cli/__init__.py", line 122, in main run_subcommand( File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/cli/__init__.py", line 62, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/cli/archivebox_add.py", line 78, in main add( File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/main.py", line 574, in add archive_links(imported_links, overwrite=True, out_dir=out_dir) File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 150, in archive_links archive_link(link, overwrite=overwrite, methods=methods, out_dir=link.link_dir) File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/util.py", line 111, in typechecked_function return func(*args, **kwargs) File "/home/alexander/.local/lib/python3.8/site-packages/archivebox/extractors/__init__.py", line 101, in archive_link raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format( Exception: Exception in archive_methods.save_readability(Link(url=https://www.pictshare.net/xu2gjzrm74.jpg)) ``` Thank you for your awesome project!
kerem 2026-03-01 14:42:39 +03:00
Author
Owner

@cdvv7788 commented on GitHub (Nov 12, 2020):

This issue is already fixed on master. You can try using that version by cloning and then running pip install . in the project folder.
Please give that a try, and reopen if it still happens on that version.

<!-- gh-comment-id:726200479 --> @cdvv7788 commented on GitHub (Nov 12, 2020): This issue is already fixed on `master`. You can try using that version by cloning and then running `pip install .` in the project folder. Please give that a try, and reopen if it still happens on that version.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#340
No description provided.