mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #654] Bugfix: Error: Search Backend only searching default admin search fields #410
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#410
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @alsokpisz on GitHub (Feb 15, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/654
Describe the bug
Bug:
The bug occurs when I attempt to search any query. An error message appears saying:
"Error from the search backend, only showing results from default admin search fields -Error:[Errno -3] Temporary failure in name resolution."If the search query is a word in the title of a website, it will return results with that word in it.
If it is only in the wget snapshot of the item, it will not return that item.
Context:
I am running ArchiveBox using on Windows 10 with docker-compose and have launched the web UI which I am successfully accessing at
http://127.0.0.1:8000. As far as I can tell, all the snapshots are functional and there are no pending links. The output directory is on an external hard drive, but there have been no issues reading/writing from this drive (except for speed, though I can't tell if that's just how the Django Web UI is or not).Relevant Info:
Bug seems similar to @jdcaballerov comment when search enabled but backend failed in his testing (see screenshot 4).
Steps to reproduce
mkdir archivebox && cd archiveboxcurl -O https://raw.githubusercontent.com/ArchiveBox/ArchiveBox/master/docker-compose.ymlEdit the docker-compose yml's
volumessection to read:(unsure if external drive-specific setup needed to reproduce, so wanted to include)
Open a Windows Terminal in administrator mode, navigate to
D:/archivebox/, open a Git Bash tab and run the following:> docker-compose up -d> docker-compose run archivebox init> docker-compose run archivebox manage createsuperuser> docker-compose run archivebox add 'https://www.dailydot.com/parsec/fandom/dieselpunk-steampunk-beginners-guide/'Navigate to
http://127.0.0.1:8000and search "beginners' (see screenshot 1). Because it is in the title, it will show up. The error message will also show up.Search "biopunk" (see screenshot 2). Even though it is in the wget file, it will not show up (see screenshot 3). The error message will show up. I have not done extensive testing on whether different filetype snapshots will get searched or not, but I don't think it picks any of them up if they are not in title.
Screenshots or log output
Screenshot 1:

Screenshot 2:

Screenshot 3:

Screenshot 4:

ArchiveBox version
>docker -versionDocker version 20.10.2, build 2291f61>docker-compose --versiondocker-compose version 1.27.4, build 40524192@jdcaballerov commented on GitHub (Feb 15, 2021):
@alsokpisz The error message describes a mis configured dns in the docker compose setup. If the search backend can't be queried the search will only occur in the url and title, the admin fields.
@pirate commented on GitHub (Feb 15, 2021):
@alsokpisz as @jdcaballerov mentioned this is likely a DNS resolving issue inside your docker-compose network. Docker on macOS is infamous for having container DNS issues, so I wouldn't be surprised if Docker on Windows is plagued by similar bugs.
First please make sure you have Sonic's
config.cfgfile present in./etc/sonicnext to yourdocker-compose.ymlfile (if not, create that dir and download the config file within):Then confirm that Sonic is up and running and accessible from the archivebox container, can you run these python commands manually and report back what output you get:
@alsokpisz commented on GitHub (Feb 15, 2021):
Several errors in the terminal (see screenshot 1).
Is the tree I've set up above not correct?
EDIT:
I hard-coded the volume specifier again in the Sonic section (screenshot 2), and everything starts up fine (screenshot 3). Of note, the error messages do not appear anymore on search query, but the search is not working correctly still.
> docker-compose run archivebox C:/Python37/python(which I think would be the equivalent command to launch it with Python just brings up (screenshot 4). Sorry if I'm missing something obvious, I don't see why the way you wrote that command wouldn't cause a subargument issue.EDIT 2 (one hot cup of coffee later):
> docker pshas both services running. I make a Python file with the code you posted above.> py SONICTEST.pyResult:

Screenshots

Screenshot 1.
Screenshot 2.

Screenshot 3.

Screenshot 4.

@pirate commented on GitHub (Feb 16, 2021):
Inside of docker is always linux, so having a Windows path in this docker command doesn't make sense:
docker-compose run archivebox C:/Python37/pythonRun it verbatim as I posted above, and paste in the script line by line (don't make a file):
@alsokpisz commented on GitHub (Feb 16, 2021):
Causes a subargument issue.
@pirate commented on GitHub (Feb 16, 2021):
try
docker-compose run archivebox shell@alsokpisz commented on GitHub (Feb 16, 2021):
Result:
Received b'CONNECTED <sonic-server v1.3.0>\r\nENDED 'I tried this when
docker psstill shows both services running.@pirate commented on GitHub (Feb 17, 2021):
Great! That means both the inter-container DNS and the TCP socket to the sonic container are working. Try this next in
docker-compose run archivebox shell:@alsokpisz commented on GitHub (Feb 17, 2021):
Results:
['3ad870d4-82b5-4974-a6ce-ee8cc6a235fa', '5d6734a5-1b9d-418a-a215-8e1e1dbdb8e5', '74696c7d-4421-46b8-8f35-9f1c9537ee1b', 'fca5096f-13da-4d94-8afd-5d742d7b3fb4', '6f009e8d-947a-4fa7-94d7-f21a94c2b525']EDIT: While troubleshooting why mass import links never seem to get the Chrome headless stuff to capture (pdf, scrnshot, dom) I essentially re-imported all of my links. Two new folders appeared in
archivebox/data:fstandkv. The search can getwgettext now only in admin mode, not in the signed out mode.@pirate commented on GitHub (Feb 17, 2021):
Ok, getting closer, sound like Sonic is working and connected but it's not getting text to index. Can you try running this to force a re-index:
Then you can test full-text search from the CLI like so:
If it works from the Admin and the CLI then we can try and track down why the public index isn't working. If it's broken on the CLI then there's still an issue with the Sonic backend we have figure out. Thanks for bearing with me here!
@alsokpisz commented on GitHub (Feb 18, 2021):
Seems like any page which is a .pdf, or .jpg causes this error during the index command:
And then it hangs.
Sometimes I'd get just the one error. I think this happened after I got rid of the .pdf links. I jotted it down but didn't write any context with it.
After removing all the .pdf/.jpg links, there are no errors in the terminal when I run the re-index command, but it will still spend ages on random pages. Notably stuff with 'weirder' components like live webcam feeds or something. I removed those one by one until it managed to get through the 80ish bookmarks in less than 10 minutes.
The terminal results were the same ones as the admin search, but the public search still didn't work.
@jdcaballerov commented on GitHub (Feb 18, 2021):
@alsokpisz the public search view is not connected to the search backend for security and performance reasons.
@alsokpisz commented on GitHub (Feb 18, 2021):
Well that's that then I suppose.
Is there a way to set a "timeout" per link on the
docker-compose run archivebox update --index-onlycommand? So it will skip links it spends more than say, a minute trying to index?@pirate commented on GitHub (Apr 6, 2021):
Ok this should be somewhat improved in
f67a5a2. It will be out with the next v0.6 release soon.You can also try it early by adding this line to your docker-compose config:
build: https://github.com/ArchiveBox/ArchiveBox.git#dev.Comment back here if you're still having issues with indexing failures/hanging and I'll reopen the issue.
@ghost commented on GitHub (Nov 12, 2021):
This is what I get when I follow the troubleshooting:
>>> print('Received', repr(data)) Traceback (most recent call last): File "<stdin>", line 1, in <module> NameError: name 'data' is not definedAnyone know what I'm doing incorrectly?
@pirate commented on GitHub (Nov 12, 2021):
Looks like you messed up the indentation, make sure to copy paste that whole block together above, or remove the extra newline before that print to be doubly sure. @jdqw210