mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1497] Sonic indexing not working on v0.7.2 bare metal install (needed pip install archivebox[sonic]) #2392
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2392
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @pirate on GitHub (Aug 27, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1497
Moved from: https://github.com/ArchiveBox/ArchiveBox/issues/1170#issuecomment-2313002236
@virtadpt posted:
I'm seeing the same problem with a bare-metal install of Archivebox (Python v3.11.6, Archivebox v0.7.2 installed with pip into a venv, Sonic v1.4.0 running on the same machine, listening on port 1491/tcp (and can be connected to with telnet)).
My ArchiveBox.conf file:
I turned off a couple of things to minimize the number of variables to keep track of when debugging.
Setting
loglevel=debugfor Sonic just results in this over and over:...so it doesn't look like Archivebox is even contacting the Sonic server to send it stuff to index. Running
DEBUG=True archivebox initin my archive directory doesn't result in anything interesting inlogs/errors.log("> /home/drwho/archivebox/bin/archivebox init; TS=2024-08-27__16:09:32 VERSION=0.7.2 IN_DOCKER=False IS_TTY=True") or any novel output to the terminal compared to withoutDEBUG=True(i.e., only "Verifying and updating existing ArchiveBox collection to v0.7.2...", "Verifying archive folder structure...", "Verifying main SQL index and running any migrations needed...", and so forth but no actual debugging output (the same thing happens if I putDEBUG=Truein ArchiveBox.conf)).Originally posted by @virtadpt in https://github.com/ArchiveBox/ArchiveBox/issues/1170#issuecomment-2313002236
@virtadpt commented on GitHub (Aug 27, 2024):
@pirate It does not. I even tried running Sonic in one terminal and
archivebox update --index-onlyin another ( both in debug mode) and Sonic never received any connections from archivebox.Readability... I thought that had been installed with
archivebox setupbut kept erroring out, which is why I turned it off. I just installed nodejs-readability-cli, turned it on in ArchiveBox.conf, and setREADABILITY_BINARY = /usr/bin/readableto make sure it could be found.archivebox update --index-onlyis running right now and will take a while.Incidentally, that seems like a thing that should be in the "how to configure search" documentation. Just what Readability is doesn't seem to be in the docs anywhere (and searching for it turns up a bunch of AI slop about readability in the context of writing papers and advertising and stuff).
@pirate commented on GitHub (Aug 27, 2024):
@virtadpt I need three things to help:
archivebox versionarchivebox add 'https://example.com/#123456'(I need to see which methods are working / any errors)archivebox update --index-only+ check if that sends stuff to SonicPlease note if you have
SAVE_READABILITY = Falsethere may not be anything to send to Sonic if any of the other text-based output methods fail, as Sonic only indexes specific text-based output types.@virtadpt commented on GitHub (Aug 27, 2024):
@virtadpt commented on GitHub (Aug 27, 2024):
@virtadpt commented on GitHub (Aug 27, 2024):
archivebox update --index-onlyis executing right now. When it's done I'll paste the output.Incidentally, I installed Readability (I think) and enabled it (
archivebox config --set READABILITY_BINARY=/us r/bin/readable) before this --index-only run.@pirate commented on GitHub (Aug 27, 2024):
Ok so because both single-file and readability were missing, you were running on steam as far as full-text generation goes. You were relying on only htmltotext and mercury, and mercury is deprecated, the company that built it unfortunatley abandoned the project about a year ago so it hasn't been updated in a long time.
It's not something I document explicitly because search is constantly improving and evolving as we add, remove, and re-configure extractors. The new-ish
htmltotextextractor was added as a fallback to cover this exact circumstance where there are no working JS-based extractors for example.It looks like htmltotext and mercury both worked in your example though, so lets see why they're not sending anything to sonic.
Can you post the contents of
~/ArchiveBox/archive/1724785476.865335/index.jsonplease, it will show exactly what texts were sent to sonic (if any). In the past almost all of the issues people have run into with sonic have been one of these two:I want to really rule out those two common causes before diving into more intricate debugging.
Also note v0.8.x is bringing lots of improvements to the dependency management (single-file, readability, sonic, etc. should all auto-install more easily and reliably in the future).
@virtadpt commented on GitHub (Aug 27, 2024):
Oh! Something landed in logs/errors.log when I added that URL for example.com:
@virtadpt commented on GitHub (Aug 27, 2024):
Contents of ~/ArchiveBox/archive/1724785476.865335/index.json:
@virtadpt commented on GitHub (Aug 27, 2024):
Incidentally, I'm really looking forward to v0.8.x because the REST API should be part of that.
@virtadpt commented on GitHub (Aug 27, 2024):
Is
archivebox update --index-onlysupposed to create those /index.*/ files? I seem to recall reading in a couple of closed tickets' comments that this is deprecated and should not happen anymore.@virtadpt commented on GitHub (Aug 27, 2024):
It just failed on me. Stack trace:
Is
archivebox setupsupposed to install the module sonic-client or pysonic-channel or python-sonic-client or something? Or is that something thatpip install archiveboxis supposed to do? Or...?Basically, did I make a procedural mistake (when originally installing Archive Box), an operational mistake (when I installed and set up Sonic, did I not blow away the venv and reinstall Archive Box because
archivebox setupneeds to be re-run whenever something changes), or is it a bug?@virtadpt commented on GitHub (Aug 27, 2024):
Okay. So I installed the sonic-client module (
pip install sonic-client) and re-ranarchivebox update --index-only. It took just over an hour to execute. However, it ran to completion as expected, with no errors report. Moreover, the empty ArchiveBox/sonic/ directory structure now has 21 megabytes of indices in it (whereas before it was completely empty). Sonic's logs have only one new entry ("(WARN) - took a lot of time: 476ms to process channel message") but that seems to be it.@pirate commented on GitHub (Aug 28, 2024):
Aha yeah you figured it out, you needed
pip install sonic-clientorpip install archivebox[sonic]to use Sonic on a bare metal install (Docker comes with it included) (archivebox setupdoes not install it, that's only for extractor dependencies). I've just updated the docs to make sure that's clear.This is fixed already in v0.8.x
github.com/ArchiveBox/ArchiveBox@6a4e568d1b (diff-5fabc1178e)It is indeed not supposed to create those, you are correct this is a bug. I'll fix it in v0.8, thanks for reporting!
https://github.com/ArchiveBox/ArchiveBox/issues/1500
Note in your
~/ArchiveBox/archive/1724785476.865335/index.jsonit shows that it produced no indexable text for that URL, you can see all the"index_texts": null,and"index_texts": [],entries, so that page wont have any text added to the Sonic index.