[GH-ISSUE #776] Question: Integrating Archivebox into my workflow #2000

Closed
opened 2026-03-01 17:55:44 +03:00 by kerem · 2 comments
Owner

Originally created by @zblesk on GitHub (Jun 29, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/776

Hello,

there are a few things I'd like Archivebox to do, as part of my personal infrastructure and workflow.
I understand that the HTTP API won't be here anytime soon, so I will work around that.

These are the 3 things I need to do:

  1. adding new links to Archivebox as my system ingests them.
  2. for any link, answer the question "do I have this link archived?" and if yes, also returning the links to the archived version(s).
  3. adding the text of the downloaded pages (article view only) to my search index

I think I can fix #1 by creating a txt endpoint that will contain the links to be archived, and having Archivebox fetch from that.
For #2 and #3, I have looked at the DB briefly.
Seems I can use core_snapshot to fetch a list of URLs I have in my archive.
But how do I find which kinds of archival were successful? Is it somewhere in the DB, or do I have to crawl the filesystem and check "manually"?

And what about the URLs? Will they always be in the form of archive/<folder name>/readability/content.html (where the last 2 parts are specific for each archive type)?

Is there anything else I should keep in mind?

Thank you!

Originally created by @zblesk on GitHub (Jun 29, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/776 Hello, there are a few things I'd like Archivebox to do, as part of my personal infrastructure and workflow. I understand that the HTTP API won't be here anytime soon, so I will work around that. These are the 3 things I need to do: 1. adding new links to Archivebox as my system ingests them. 2. for any link, answer the question "do I have this link archived?" and if yes, also returning the links to the archived version(s). 3. adding the text of the downloaded pages (article view only) to my search index I think I can fix #1 by creating a txt endpoint that will contain the links to be archived, and having Archivebox fetch from that. For #2 and #3, I have looked at the DB briefly. Seems I can use `core_snapshot` to fetch a list of URLs I have in my archive. But how do I find which kinds of archival were successful? Is it somewhere in the DB, or do I have to crawl the filesystem and check "manually"? And what about the URLs? Will they always be in the form of `archive/<folder name>/readability/content.html` (where the last 2 parts are specific for each archive type)? Is there anything else I should keep in mind? Thank you!
kerem 2026-03-01 17:55:44 +03:00
Author
Owner

@pirate commented on GitHub (Jun 29, 2021):

These are the 3 things I need to do:

  1. archivebox schedule --every=day --depth=1 ./path/to/links.txt
  2. archivebox list --csv=timestamp,is_archived 'https://example.com/url/to/check' | grep true
  3. cat ./archive/<timestamp>/**/*.{txt,html,htm} | <cmd to add to search index>

But how do I find which kinds of archival were successful? Is it somewhere in the DB, or do I have to crawl the filesystem and check "manually"?

You can use either the CLI archivebox list --csv=timestamp,is_archived 'https://example.com/url/to/check' | grep true or you can check the SQL db using the core_archiveresult table. It contains all the results you see on the Log page in the UI, and you can select using the foreign key to the snapshot and look for any succesful ArchiveResult entries (which means it's succeeded in archiving).

And what about the URLs? Will they always be in the form of archive/<folder name>/readability/content.html (where the last 2 parts are specific for each archive type)?

Yes, however some are stored directly in the root of the archive subfolder like: archive/<folder name>/singlefile.html (for legacy reasons).

Is there anything else I should keep in mind?

ArchiveBox has full-text indexing built in already, why not call out to that directly instead of building your own index on top of it?

archivebox list --filter-type=search 'some text to search'

This uses Sonic if you have it set up, or ripgrep as the fallback if you don't have Sonic enabled.

<!-- gh-comment-id:870746323 --> @pirate commented on GitHub (Jun 29, 2021): > These are the 3 things I need to do: 1. `archivebox schedule --every=day --depth=1 ./path/to/links.txt` 2. `archivebox list --csv=timestamp,is_archived 'https://example.com/url/to/check' | grep true` 3. `cat ./archive/<timestamp>/**/*.{txt,html,htm} | <cmd to add to search index>` > But how do I find which kinds of archival were successful? Is it somewhere in the DB, or do I have to crawl the filesystem and check "manually"? You can use either the CLI `archivebox list --csv=timestamp,is_archived 'https://example.com/url/to/check' | grep true` or you can check the SQL db using the `core_archiveresult` table. It contains all the results you see on the `Log` page in the UI, and you can select using the foreign key to the snapshot and look for any succesful `ArchiveResult` entries (which means it's succeeded in archiving). > And what about the URLs? Will they always be in the form of `archive/<folder name>/readability/content.html` (where the last 2 parts are specific for each archive type)? Yes, however some are stored directly in the root of the archive subfolder like: `archive/<folder name>/singlefile.html` (for legacy reasons). > Is there anything else I should keep in mind? ArchiveBox has full-text indexing built in already, why not call out to that directly instead of building your own index on top of it? `archivebox list --filter-type=search 'some text to search'` This uses [Sonic](https://github.com/valeriansaliou/sonic) if you have it [set up](https://github.com/ArchiveBox/ArchiveBox/blob/dev/docker-compose.yml#L30), or `ripgrep` as the fallback if you don't have Sonic enabled.
Author
Owner

@zblesk commented on GitHub (Jun 29, 2021):

Thank you for the quick response, and for the specific commands!
I will probably end up not using them directly; I'd like to avoid having my application execute chains of bash commands on the host system, especially when both the app and archivebox are running in Docker.
(However, I know I can easily give my app access to the sqlite file for reading.)
Still, I can look at the source code for those commands for inspiration, if needed.

ArchiveBox has full-text indexing built in already, why not call out to that directly instead of building your own index on top of it?

I've... not noticed this, to be honest.
It won't really solve my use-case - I'm building an universal search that will search through all of my various data sources, so it's not just Archivebox.
I haven't done that before, and am using Elasticsearch because that one seems popular; but now that you've told me about it, I'll take a look at Sonic, it looks promising and might even end up suiting my modest needs better.

And thanks for your work on Archivebox! It's great.

<!-- gh-comment-id:870928969 --> @zblesk commented on GitHub (Jun 29, 2021): Thank you for the quick response, and for the specific commands! I will probably end up not using them directly; I'd like to avoid having my application execute chains of bash commands on the host system, especially when both the app and archivebox are running in Docker. (However, I know I can easily give my app access to the sqlite file for reading.) Still, I can look at the source code for those commands for inspiration, if needed. > ArchiveBox has full-text indexing built in already, why not call out to that directly instead of building your own index on top of it? I've... not noticed this, to be honest. It won't really solve my use-case - I'm building an universal search that will search through all of my various data sources, so it's not just Archivebox. I haven't done that before, and am using Elasticsearch because that one seems popular; but now that you've told me about it, I'll take a look at Sonic, it looks promising and might even end up suiting my modest needs better. And thanks for your work on Archivebox! It's great.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2000
No description provided.