[GH-ISSUE #374] Bugfix: django branch start_ts error on init #1766

Closed
opened 2026-03-01 17:53:29 +03:00 by kerem · 15 comments
Owner

Originally created by @drpfenderson on GitHub (Jul 20, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/374

Describe the bug

When attempting to archivebox init with version 0.4.3 in old archive, archivebox fails at Collecting links from any existing indexes and archive folders... with KeyError: 'start_ts'

Steps to reproduce

  1. Installed Django branch with git clone and pip install ..
  2. Navigated to old archive directory.
  3. Ran archivebox init
  4. archivebox goes through most of importing process, and then dies with the error listed below.

Screenshots or log output

Traceback (most recent call last):
  File "/home/USERNAME/.local/bin/archivebox", line 8, in <module>
    sys.exit(main())
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 126, in main
    pwd=pwd or OUTPUT_DIR,
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 62, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/archivebox_init.py", line 34, in main
    out_dir=pwd or OUTPUT_DIR,
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 108, in typechecked_function
    return func(*args, **kwargs)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/main.py", line 316, in init
    for link in load_main_index(out_dir=out_dir, warn=False)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 108, in typechecked_function
    return func(*args, **kwargs)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/__init__.py", line 250, in load_main_index
    all_links = list(parse_json_main_index(out_dir))
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 52, in parse_json_main_index
    yield Link.from_json(link_json)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 203, in from_json
    cast_result = ArchiveResult.from_json(json_result)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 62, in from_json
    info['start_ts'] = parse_date(info['start_ts'])
KeyError: 'start_ts'

Software versions

  • OS: Ubuntu 18.04.4 LTS
  • ArchiveBox version: 848977e
  • Python version: Python 3.7.8
Originally created by @drpfenderson on GitHub (Jul 20, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/374 #### Describe the bug When attempting to `archivebox init` with version 0.4.3 in old archive, archivebox fails at `Collecting links from any existing indexes and archive folders...` with `KeyError: 'start_ts'` #### Steps to reproduce 1. Installed Django branch with `git clone` and `pip install .`. 2. Navigated to old archive directory. 3. Ran `archivebox init` 4. archivebox goes through most of importing process, and then dies with the error listed below. #### Screenshots or log output ``` Traceback (most recent call last): File "/home/USERNAME/.local/bin/archivebox", line 8, in <module> sys.exit(main()) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 126, in main pwd=pwd or OUTPUT_DIR, File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 62, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/archivebox_init.py", line 34, in main out_dir=pwd or OUTPUT_DIR, File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 108, in typechecked_function return func(*args, **kwargs) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/main.py", line 316, in init for link in load_main_index(out_dir=out_dir, warn=False) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 108, in typechecked_function return func(*args, **kwargs) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/__init__.py", line 250, in load_main_index all_links = list(parse_json_main_index(out_dir)) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 52, in parse_json_main_index yield Link.from_json(link_json) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 203, in from_json cast_result = ArchiveResult.from_json(json_result) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 62, in from_json info['start_ts'] = parse_date(info['start_ts']) KeyError: 'start_ts' ``` #### Software versions - OS: Ubuntu 18.04.4 LTS - ArchiveBox version: 848977e - Python version: Python 3.7.8
kerem closed this issue 2026-03-01 17:53:29 +03:00
Author
Owner

@pirate commented on GitHub (Jul 20, 2020):

Interesting, what version was archive folder originally made with? Can you post a sample output/archive/<timestamp>/index.json file from your archive and I'll investigate further.

<!-- gh-comment-id:661296862 --> @pirate commented on GitHub (Jul 20, 2020): Interesting, what version was archive folder originally made with? Can you post a sample `output/archive/<timestamp>/index.json` file from your archive and I'll investigate further.
Author
Owner

@drpfenderson commented on GitHub (Jul 20, 2020):

The index.html says that it was created with version a3a048d4. Here is a gist containing the output of one of the most recent index.json files, with redacted personal info.

<!-- gh-comment-id:661304008 --> @drpfenderson commented on GitHub (Jul 20, 2020): The index.html says that it was created with version a3a048d4. [Here is a gist](https://gist.github.com/drpfenderson/f7ec110aac64320f34f57e2c7ef852d3) containing the output of one of the most recent index.json files, with redacted personal info.
Author
Owner

@pirate commented on GitHub (Jul 21, 2020):

Ah sorry I just noticed the error is actually in the main index parse, not the link details parse. Can you post a redacted/shortened version of your main index output/index.json file when you get a chance.

<!-- gh-comment-id:661912334 --> @pirate commented on GitHub (Jul 21, 2020): Ah sorry I just noticed the error is actually in the main index parse, not the link details parse. Can you post a redacted/shortened version of your main index `output/index.json` file when you get a chance.
Author
Owner

@drpfenderson commented on GitHub (Jul 21, 2020):

Here is a snippet from the beginning of the main index.json file. Here is another snipped from later in the file. Let me know if you would like/need more, or are looking for something in particular.

<!-- gh-comment-id:661934148 --> @drpfenderson commented on GitHub (Jul 21, 2020): [Here is a snippet](https://gist.github.com/drpfenderson/ae78c2d6a32263a07045c51da5a71a55) from the beginning of the main index.json file. [Here is another](https://gist.github.com/drpfenderson/50fbd2dc0566d64097e4bf85b4bfaa07) snipped from later in the file. Let me know if you would like/need more, or are looking for something in particular.
Author
Owner

@cdvv7788 commented on GitHub (Jul 22, 2020):

@drpfenderson can you please test with the django branch again? We pushed a change that should help with your issue.

<!-- gh-comment-id:662660256 --> @cdvv7788 commented on GitHub (Jul 22, 2020): @drpfenderson can you please test with the django branch again? We pushed a change that should help with your issue.
Author
Owner

@drpfenderson commented on GitHub (Jul 22, 2020):

Different set of errors. This is with 36124f2. I'm going to list how I did things, just in case I'm missing a step.

$ pip uninstall archivebox
/* it was successful */

$ git pull
On branch django
Your branch is up to date with 'origin/django'.

$ pip install .
/* successful again */

$ cd /archive_output
$ archivebox init

Traceback (most recent call last):
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 54, in parse_json_main_index
    yield Link.from_json(link_json)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 203, in from_json
    cast_result = ArchiveResult.from_json(json_result)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 62, in from_json
    info['start_ts'] = parse_date(info['start_ts'])
KeyError: 'start_ts'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/USERNAME/.local/bin/archivebox", line 8, in <module>
    sys.exit(main())
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 126, in main
    pwd=pwd or OUTPUT_DIR,
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 62, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/archivebox_init.py", line 34, in main
    out_dir=pwd or OUTPUT_DIR,
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function
    return func(*args, **kwargs)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/main.py", line 316, in init
    for link in load_main_index(out_dir=out_dir, warn=False)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function
    return func(*args, **kwargs)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/__init__.py", line 250, in load_main_index
    all_links = list(parse_json_main_index(out_dir))
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 57, in parse_json_main_index
    yield parse_json_link_details(str(detail_index_path))
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function
    return func(*args, **kwargs)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 105, in parse_json_link_details
    return Link.from_json(link_json)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 203, in from_json
    cast_result = ArchiveResult.from_json(json_result)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 62, in from_json
    info['start_ts'] = parse_date(info['start_ts'])
KeyError: 'start_ts'

EDIT: To clarify, the error is thrown at the same point in the init process, [*] Collecting links from any existing indexes and archive folders...

<!-- gh-comment-id:662677781 --> @drpfenderson commented on GitHub (Jul 22, 2020): Different set of errors. This is with 36124f2. I'm going to list how I did things, just in case I'm missing a step. ``` $ pip uninstall archivebox /* it was successful */ $ git pull On branch django Your branch is up to date with 'origin/django'. $ pip install . /* successful again */ $ cd /archive_output $ archivebox init Traceback (most recent call last): File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 54, in parse_json_main_index yield Link.from_json(link_json) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 203, in from_json cast_result = ArchiveResult.from_json(json_result) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 62, in from_json info['start_ts'] = parse_date(info['start_ts']) KeyError: 'start_ts' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/USERNAME/.local/bin/archivebox", line 8, in <module> sys.exit(main()) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 126, in main pwd=pwd or OUTPUT_DIR, File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 62, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/archivebox_init.py", line 34, in main out_dir=pwd or OUTPUT_DIR, File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function return func(*args, **kwargs) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/main.py", line 316, in init for link in load_main_index(out_dir=out_dir, warn=False) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function return func(*args, **kwargs) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/__init__.py", line 250, in load_main_index all_links = list(parse_json_main_index(out_dir)) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 57, in parse_json_main_index yield parse_json_link_details(str(detail_index_path)) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function return func(*args, **kwargs) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 105, in parse_json_link_details return Link.from_json(link_json) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 203, in from_json cast_result = ArchiveResult.from_json(json_result) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 62, in from_json info['start_ts'] = parse_date(info['start_ts']) KeyError: 'start_ts' ``` EDIT: To clarify, the error is thrown at the same point in the init process, `[*] Collecting links from any existing indexes and archive folders...`
Author
Owner

@drpfenderson commented on GitHub (Jul 23, 2020):

Saw you had made some changes, and pulled 4cb671a, rebuilt, ran on a copy of the archive. Finally went through the init process completely, though it listed a ton of the indexes with errors!

$ archivebox status
/* snip */
[*] Scanning archive data directories...
    /home/USERNAME/archivebox/archive/*
    Size: 32.7 GB across 128366 files in 138147 directories

    > indexed: 260                   (indexed links without checking archive status or data directory validity)
      > archived: 260                (indexed links that are archived with a valid data directory)
      > unarchived: 0                (indexed links that are unarchived with no data directory or an empty data directory)

    > present: 1761                  (dirs that actually exist in the archive/ folder)
      > valid: 260                   (dirs with a valid index matched to the main index and archived content)
      > invalid: 1501                (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized)
        > duplicate: 0               (dirs that conflict with other directories that have the same link URL or timestamp)
        > orphaned: 0                (dirs that contain a valid index but aren't listed in the main index)
        > corrupted: 0               (dirs that don't contain a valid index and aren't listed in the main index)
        > unrecognized: 1501         (dirs that don't contain recognizable archive data and aren't listed in the main index)

Here is a sampling of some of the indexes that it says are invalid. 1144634362. 1160641317. 1222742076.

<!-- gh-comment-id:663164791 --> @drpfenderson commented on GitHub (Jul 23, 2020): Saw you had made some changes, and pulled 4cb671a, rebuilt, ran on a copy of the archive. Finally went through the init process completely, though it listed a ton of the indexes with errors! ``` $ archivebox status /* snip */ [*] Scanning archive data directories... /home/USERNAME/archivebox/archive/* Size: 32.7 GB across 128366 files in 138147 directories > indexed: 260 (indexed links without checking archive status or data directory validity) > archived: 260 (indexed links that are archived with a valid data directory) > unarchived: 0 (indexed links that are unarchived with no data directory or an empty data directory) > present: 1761 (dirs that actually exist in the archive/ folder) > valid: 260 (dirs with a valid index matched to the main index and archived content) > invalid: 1501 (dirs that are invalid for any reason: corrupted/duplicate/orphaned/unrecognized) > duplicate: 0 (dirs that conflict with other directories that have the same link URL or timestamp) > orphaned: 0 (dirs that contain a valid index but aren't listed in the main index) > corrupted: 0 (dirs that don't contain a valid index and aren't listed in the main index) > unrecognized: 1501 (dirs that don't contain recognizable archive data and aren't listed in the main index) ``` Here is a sampling of some of the indexes that it says are invalid. [1144634362](https://gist.github.com/drpfenderson/4b0d1228ed7f77c40922e21e97137e53). [1160641317](https://gist.github.com/drpfenderson/a75425701d15614955d2116d7e73a95b). [1222742076](https://gist.github.com/drpfenderson/719c4bc87be16b80d7b6e992e4b8f4b5).
Author
Owner

@pirate commented on GitHub (Jul 23, 2020):

Perfect, thanks for those samples. It confirms our suspicion that you had a few links archived with a very old version before we introduced start_ts. We'll add a workaround that will handle that older schema and upgrade those files to the new style.

(Also thanks for the sponsorship @drpfenderson!)

<!-- gh-comment-id:663183357 --> @pirate commented on GitHub (Jul 23, 2020): Perfect, thanks for those samples. It confirms our suspicion that you had a few links archived with a *very* old version before we introduced `start_ts`. We'll add a workaround that will handle that older schema and upgrade those files to the new style. (Also thanks for the sponsorship @drpfenderson!)
Author
Owner

@cdvv7788 commented on GitHub (Jul 24, 2020):

@drpfenderson can you test again please? fingers crossed

<!-- gh-comment-id:663611257 --> @cdvv7788 commented on GitHub (Jul 24, 2020): @drpfenderson can you test again please? **fingers crossed**
Author
Owner

@drpfenderson commented on GitHub (Jul 24, 2020):

New error. :(

I hate being this problem person, but I do sincerely appreciate y'all continuing to support me with this severely out-of-date index. It's a very important, personal historical archive for me, and I would have just recreated it from scratch using the new version and the old list of links, but a lot of those sites in the original index are no longer online.

I uninstalled, deleted working dir, git pull, pip install ., copy archive, archivebox init. This is with 74ad79f :

[*] Collecting links from any existing indexes and archive folders...
Traceback (most recent call last):
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 54, in parse_json_main_index
    yield Link.from_json(link_json)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 228, in from_json
    cast_result = ArchiveResult.from_json(json_result, guess)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 87, in from_json
    info['start_ts'] = parse_date(info['start_ts'])
KeyError: 'start_ts'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 58, in parse_json_main_index
    yield parse_json_link_details(str(detail_index_path))
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function
    return func(*args, **kwargs)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 111, in parse_json_link_details
    return Link.from_json(link_json, guess)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 228, in from_json
    cast_result = ArchiveResult.from_json(json_result, guess)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 87, in from_json
    info['start_ts'] = parse_date(info['start_ts'])
KeyError: 'start_ts'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/USERNAME/.local/bin/archivebox", line 8, in <module>
    sys.exit(main())
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 126, in main
    pwd=pwd or OUTPUT_DIR,
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 62, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/archivebox_init.py", line 35, in main
    out_dir=pwd or OUTPUT_DIR,
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function
    return func(*args, **kwargs)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/main.py", line 314, in init
    for link in load_main_index(out_dir=out_dir, warn=False)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function
    return func(*args, **kwargs)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/__init__.py", line 253, in load_main_index
    all_links = list(parse_json_main_index(out_dir))
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 62, in parse_json_main_index
    yield Link.from_json(link_json, guess=True)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 228, in from_json
    cast_result = ArchiveResult.from_json(json_result, guess)
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 90, in from_json
    return cls(**info)
  File "<string>", line 11, in __init__
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 38, in __post_init__
    self.typecheck()
  File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 48, in typecheck
    assert isinstance(self.cmd, list)
AssertionError

Let me know if there is anything else I can provide for you.

Side question: I know I can run git rev-parse HEAD | head -c7 to get the current revision on the git directory, but is there a command to find out exactly which revision might be installed in pip? As sometimes they will not be in sync. This isn't really that important, was just curious.

<!-- gh-comment-id:663687112 --> @drpfenderson commented on GitHub (Jul 24, 2020): New error. :( I hate being this problem person, but I do sincerely appreciate y'all continuing to support me with this severely out-of-date index. It's a very important, personal historical archive for me, and I would have just recreated it from scratch using the new version and the old list of links, but a lot of those sites in the original index are no longer online. I uninstalled, deleted working dir, `git pull`, `pip install .`, copy archive, `archivebox init`. This is with 74ad79f : ``` [*] Collecting links from any existing indexes and archive folders... Traceback (most recent call last): File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 54, in parse_json_main_index yield Link.from_json(link_json) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 228, in from_json cast_result = ArchiveResult.from_json(json_result, guess) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 87, in from_json info['start_ts'] = parse_date(info['start_ts']) KeyError: 'start_ts' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 58, in parse_json_main_index yield parse_json_link_details(str(detail_index_path)) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function return func(*args, **kwargs) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 111, in parse_json_link_details return Link.from_json(link_json, guess) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 228, in from_json cast_result = ArchiveResult.from_json(json_result, guess) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 87, in from_json info['start_ts'] = parse_date(info['start_ts']) KeyError: 'start_ts' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/home/USERNAME/.local/bin/archivebox", line 8, in <module> sys.exit(main()) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 126, in main pwd=pwd or OUTPUT_DIR, File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/__init__.py", line 62, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/cli/archivebox_init.py", line 35, in main out_dir=pwd or OUTPUT_DIR, File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function return func(*args, **kwargs) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/main.py", line 314, in init for link in load_main_index(out_dir=out_dir, warn=False) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/util.py", line 109, in typechecked_function return func(*args, **kwargs) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/__init__.py", line 253, in load_main_index all_links = list(parse_json_main_index(out_dir)) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/json.py", line 62, in parse_json_main_index yield Link.from_json(link_json, guess=True) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 228, in from_json cast_result = ArchiveResult.from_json(json_result, guess) File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 90, in from_json return cls(**info) File "<string>", line 11, in __init__ File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 38, in __post_init__ self.typecheck() File "/home/USERNAME/.local/lib/python3.7/site-packages/archivebox/index/schema.py", line 48, in typecheck assert isinstance(self.cmd, list) AssertionError ``` Let me know if there is anything else I can provide for you. **Side question**: I know I can run `git rev-parse HEAD | head -c7` to get the current revision on the git directory, but is there a command to find out exactly which revision might be installed in pip? As sometimes they will not be in sync. This isn't really that important, was just curious.
Author
Owner

@cdvv7788 commented on GitHub (Jul 24, 2020):

@drpfenderson one more try please.
Also, if you install it with pip install -e . you will always have installed the version of the code you are currently running (no need to pip install after changing branches i.e.)

<!-- gh-comment-id:663705610 --> @cdvv7788 commented on GitHub (Jul 24, 2020): @drpfenderson one more try please. Also, if you install it with `pip install -e .` you will always have installed the version of the code you are currently running (no need to pip install after changing branches i.e.)
Author
Owner

@drpfenderson commented on GitHub (Jul 24, 2020):

(Mostly) successful import with 5582d8a! Thanks again for the quick responses and fixes. I would say that this specific bug is crushed, but want to make sure the next part of the error is unrelated first.

[*] Collecting links from any existing indexes and archive folders...
    √ Loaded 1369 links from existing main index.
    √ Added 7 orphaned links from existing archive directories.
    ! Skipped adding 149 invalid link data directories.

Not sure what that exactly means, but 149 is much easier to handle. From what I can tell of a random sampling, the properly-loaded 1369 links works fine. But the archivebox init listed specific pages with errors, so I looked a few up. They are all in the index, but partially corrupted. The main archive for each, found clicking the title of the link from the archive index, loads just fine. However, when clicking the Files link, where it shows the various versions that were captured, it actually loads the files for a different link. Testing the other link's version sometimes points to the first link, but sometimes it's a loop. Link A Files (click) > Link B files appear, Link B Files (click) > Link A files appear. But also, in some cases. Link A Files (click) > Link B files appear. Link B Files (click) > Link C files appear. Link C Files (click) > Link A files appear. Very weird.

If this error is unrelated, please feel free to mark this as closed, and I can definitely file a separate report.

<!-- gh-comment-id:663735922 --> @drpfenderson commented on GitHub (Jul 24, 2020): (Mostly) successful import with 5582d8a! Thanks again for the quick responses and fixes. I would say that this specific bug is crushed, but want to make sure the next part of the error is unrelated first. ``` [*] Collecting links from any existing indexes and archive folders... √ Loaded 1369 links from existing main index. √ Added 7 orphaned links from existing archive directories. ! Skipped adding 149 invalid link data directories. ``` Not sure what that exactly means, but 149 is much easier to handle. From what I can tell of a random sampling, the properly-loaded 1369 links works fine. But the `archivebox init` listed specific pages with errors, so I looked a few up. They are all in the index, but partially corrupted. The main archive for each, found clicking the title of the link from the archive index, loads just fine. However, when clicking the **Files** link, where it shows the various versions that were captured, it actually loads the files for a different link. Testing the other link's version sometimes points to the first link, but sometimes it's a loop. Link A Files (click) > Link B files appear, Link B Files (click) > Link A files appear. But also, in some cases. Link A Files (click) > Link B files appear. Link B Files (click) > Link C files appear. Link C Files (click) > Link A files appear. Very weird. If this error is unrelated, please feel free to mark this as closed, and I can definitely file a separate report.
Author
Owner

@pirate commented on GitHub (Jul 25, 2020):

Ah I have seen that issue before, I think it might've been caused by a previous version actually. If I remember correctly the last time we saw this bug it was caused the timestamp deduplication code switching the timestamps of two existing links during the deduping process.

The older versions didn't display the info on startup about which links were invalid/valid, so it's possible it just went un-noticed. Do you happen to have a backup from before you did the upgrade with archivebox init? If so, you can check if the swapped timestamps were present previously, and that would help us rule out a bug in this version. If not, no worries, we have some things we can try to make the new version auto-fix this type of situation.

<!-- gh-comment-id:663870461 --> @pirate commented on GitHub (Jul 25, 2020): Ah I have seen that issue before, I think it might've been caused by a previous version actually. If I remember correctly the last time we saw this bug it was caused the timestamp deduplication code switching the timestamps of two existing links during the deduping process. The older versions didn't display the info on startup about which links were invalid/valid, so it's possible it just went un-noticed. Do you happen to have a backup from before you did the upgrade with `archivebox init`? If so, you can check if the swapped timestamps were present previously, and that would help us rule out a bug in this version. If not, no worries, we have some things we can try to make the new version auto-fix this type of situation.
Author
Owner

@drpfenderson commented on GitHub (Jul 27, 2020):

Ah! I believe you are correct. The error exists in those links in my previous/backup version. The paths of the "loops" I mentioned align with the progression of a directory structure. For example:

        X /.archivebox-output/archive-working/archive/1222742097 [1222742097] www.famfamfam.com/lab/icons/silk "famfamfam.com: Silk Icons"
        X /.archivebox-output/archive-working/archive/1222742097.0 [1222742097.0] www.drawspace.com "Drawspace.com - Drawing lessons"

To clarify, I click 1222742097 from the index, and it loads 1222742097.0, and vice-versa. So then, yes! This must be that duplication error you recognize from a previous version, and it sounds like this specific bug is closed.

Do you have any advice for the other error, or maybe link to an issue # if it already exists? Even if it requires me hand-editing all of the incorrect ones in nano, I would be super happy.

<!-- gh-comment-id:664617304 --> @drpfenderson commented on GitHub (Jul 27, 2020): Ah! I believe you are correct. The error exists in those links in my previous/backup version. The paths of the "loops" I mentioned align with the progression of a directory structure. For example: ``` X /.archivebox-output/archive-working/archive/1222742097 [1222742097] www.famfamfam.com/lab/icons/silk "famfamfam.com: Silk Icons" X /.archivebox-output/archive-working/archive/1222742097.0 [1222742097.0] www.drawspace.com "Drawspace.com - Drawing lessons" ``` To clarify, I click 1222742097 from the index, and it loads 1222742097.0, and vice-versa. So then, yes! This must be that duplication error you recognize from a previous version, and it sounds like this specific bug is closed. Do you have any advice for the other error, or maybe link to an issue # if it already exists? Even if it requires me hand-editing all of the incorrect ones in nano, I would be super happy.
Author
Owner

@pirate commented on GitHub (Jul 27, 2020):

Awesome, that's a relief to hear. We were worried it was a regression from the latest version. I'm going to close this issue for now but I'll keep responding to your comments here, don't worry.

If you post a ZIP (or email me email) of a handful of those swapped folders I'll write you a bash script that fixes it.

<!-- gh-comment-id:664669509 --> @pirate commented on GitHub (Jul 27, 2020): Awesome, that's a relief to hear. We were worried it was a regression from the latest version. I'm going to close this issue for now but I'll keep responding to your comments here, don't worry. If you post a ZIP (or email me <img width="184" alt="email" src="https://user-images.githubusercontent.com/511499/88598047-8a942f00-d036-11ea-8fac-08ffd6c44d1a.png">) of a handful of those swapped folders I'll write you a bash script that fixes it.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1766
No description provided.