[GH-ISSUE #544] Question / Bug : What's the intended use of --index-only on the update command ? #3365

Closed
opened 2026-03-14 22:23:43 +03:00 by kerem · 4 comments
Owner

Originally created by @jdcaballerov on GitHub (Nov 21, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/544

The update command code path starts in archivebox_update.py calling the update function as follows:

    update(
        resume=command.resume,
        only_new=command.only_new,
        index_only=command.index_only,
        overwrite=command.overwrite,
        filter_patterns_str=filter_patterns_str,
        filter_patterns=command.filter_patterns,
        filter_type=command.filter_type,
        status=command.status,
        after=command.after,
        before=command.before,
        out_dir=pwd or OUTPUT_DIR,
    )

No capture of the return value. If --index-only is passed the update function in archivbox.main.py is executed as follows:


    check_data_folder(out_dir=out_dir)
    check_dependencies()
    new_links: List[Link] = [] # TODO: Remove input argument: only_new

    # Step 1: Filter for selected_links
    matching_snapshots = list_links(
        filter_patterns=filter_patterns,
        filter_type=filter_type,
        before=before,
        after=after,
    )

    matching_folders = list_folders(
        links=matching_snapshots,
        status=status,
        out_dir=out_dir,
    )
    all_links = [link for link in matching_folders.values() if link]

    if index_only:
        return all_links

Then nothing is happening with the output of this codepath. What's the current or intended use?

Originally created by @jdcaballerov on GitHub (Nov 21, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/544 The update command code path starts in `archivebox_update.py` calling the `update` function as follows: ```python3 update( resume=command.resume, only_new=command.only_new, index_only=command.index_only, overwrite=command.overwrite, filter_patterns_str=filter_patterns_str, filter_patterns=command.filter_patterns, filter_type=command.filter_type, status=command.status, after=command.after, before=command.before, out_dir=pwd or OUTPUT_DIR, ) ``` No capture of the return value. If `--index-only` is passed the `update` function in `archivbox.main.py` is executed as follows: ```python3 check_data_folder(out_dir=out_dir) check_dependencies() new_links: List[Link] = [] # TODO: Remove input argument: only_new # Step 1: Filter for selected_links matching_snapshots = list_links( filter_patterns=filter_patterns, filter_type=filter_type, before=before, after=after, ) matching_folders = list_folders( links=matching_snapshots, status=status, out_dir=out_dir, ) all_links = [link for link in matching_folders.values() if link] if index_only: return all_links ``` Then nothing is happening with the output of this codepath. What's the current or intended use?
kerem closed this issue 2026-03-14 22:23:48 +03:00
Author
Owner

@pirate commented on GitHub (Nov 21, 2020):

It's supposed to only write the json and HTML index files for the links without running any extractors.

<!-- gh-comment-id:731610003 --> @pirate commented on GitHub (Nov 21, 2020): It's supposed to only write the json and HTML index files for the links without running any extractors.
Author
Owner

@cdvv7788 commented on GitHub (Nov 21, 2020):

@pirate this flag is for updating the index-only, not for generating the legacy indexes. I need to review if something happened with it during the migration.

<!-- gh-comment-id:731612894 --> @cdvv7788 commented on GitHub (Nov 21, 2020): @pirate this flag is for updating the index-only, not for generating the legacy indexes. I need to review if something happened with it during the migration.
Author
Owner

@pirate commented on GitHub (Nov 22, 2020):

Correct, by index files I meant it should only update the data/archive/<timestamp/index.{json,html} files (which we are still using and plan to keep), not the old main index in data/index.{json,html}. The --index-only flag is useful to update those files because sometimes when we change the format or HTML/CSS styling of the link details pages in an update it doesn't update those files automatically. archivebox update --index-only is a way to manually update those detail indexes to the latest styling / format.

<!-- gh-comment-id:731782497 --> @pirate commented on GitHub (Nov 22, 2020): Correct, by index files I meant it should only update the `data/archive/<timestamp/index.{json,html}` files (which we are still using and plan to keep), not the old main index in `data/index.{json,html}`. The `--index-only` flag is useful to update those files because sometimes when we change the format or HTML/CSS styling of the link details pages in an update it doesn't update those files automatically. `archivebox update --index-only` is a way to manually update those detail indexes to the latest styling / format.
Author
Owner

@cdvv7788 commented on GitHub (Dec 5, 2020):

@pirate github.com/ArchiveBox/ArchiveBox@1b8abc0961/archivebox/main.py (L451-)#L488 I think that the mentioned behavior is not what was originally intended. This process was aborting the execution of the update function if --index-only was present after writing to the main index, not to the detail in every archive.
I will refactor this to behave as you described.

<!-- gh-comment-id:739321316 --> @cdvv7788 commented on GitHub (Dec 5, 2020): @pirate https://github.com/ArchiveBox/ArchiveBox/blob/1b8abc09616175a1a4180211e8c72a5de7dcfdbf/archivebox/main.py#L451-#L488 I think that the mentioned behavior is not what was originally intended. This process was aborting the execution of the `update` function if `--index-only` was present after writing to the main index, not to the detail in every archive. I will refactor this to behave as you described.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3365
No description provided.