mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #776] Question: Integrating Archivebox into my workflow #490
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#490
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @zblesk on GitHub (Jun 29, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/776
Hello,
there are a few things I'd like Archivebox to do, as part of my personal infrastructure and workflow.
I understand that the HTTP API won't be here anytime soon, so I will work around that.
These are the 3 things I need to do:
I think I can fix #1 by creating a txt endpoint that will contain the links to be archived, and having Archivebox fetch from that.
For #2 and #3, I have looked at the DB briefly.
Seems I can use
core_snapshotto fetch a list of URLs I have in my archive.But how do I find which kinds of archival were successful? Is it somewhere in the DB, or do I have to crawl the filesystem and check "manually"?
And what about the URLs? Will they always be in the form of
archive/<folder name>/readability/content.html(where the last 2 parts are specific for each archive type)?Is there anything else I should keep in mind?
Thank you!
@pirate commented on GitHub (Jun 29, 2021):
archivebox schedule --every=day --depth=1 ./path/to/links.txtarchivebox list --csv=timestamp,is_archived 'https://example.com/url/to/check' | grep truecat ./archive/<timestamp>/**/*.{txt,html,htm} | <cmd to add to search index>You can use either the CLI
archivebox list --csv=timestamp,is_archived 'https://example.com/url/to/check' | grep trueor you can check the SQL db using thecore_archiveresulttable. It contains all the results you see on theLogpage in the UI, and you can select using the foreign key to the snapshot and look for any succesfulArchiveResultentries (which means it's succeeded in archiving).Yes, however some are stored directly in the root of the archive subfolder like:
archive/<folder name>/singlefile.html(for legacy reasons).ArchiveBox has full-text indexing built in already, why not call out to that directly instead of building your own index on top of it?
archivebox list --filter-type=search 'some text to search'This uses Sonic if you have it set up, or
ripgrepas the fallback if you don't have Sonic enabled.@zblesk commented on GitHub (Jun 29, 2021):
Thank you for the quick response, and for the specific commands!
I will probably end up not using them directly; I'd like to avoid having my application execute chains of bash commands on the host system, especially when both the app and archivebox are running in Docker.
(However, I know I can easily give my app access to the sqlite file for reading.)
Still, I can look at the source code for those commands for inspiration, if needed.
I've... not noticed this, to be honest.
It won't really solve my use-case - I'm building an universal search that will search through all of my various data sources, so it's not just Archivebox.
I haven't done that before, and am using Elasticsearch because that one seems popular; but now that you've told me about it, I'll take a look at Sonic, it looks promising and might even end up suiting my modest needs better.
And thanks for your work on Archivebox! It's great.