mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #643] Questions: Alternate title source, regex woes, updating single links, blacklisting updates, and piping lists to delete/remove links in Docker #400
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#400
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @drpfenderson on GitHub (Feb 1, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/643
Y'all have been so amazingly helpful with bugs and errors, but every once and a while I find a tidbit that my perusing of the issues/documentation can't seem to meet. I'm sure I'm also overlooking some things. Each issue never seems big enough to ask about on their own, so I've been compiling a few of my queries from the past year into one post. These are definitely not feature requests, more "I think I'm missing something" or not understanding how a command works. Thanks for any clarification you can provide!
I'm having a recurring problem with Cloudflare/whatever blocking wget as a method, so it fails to pull the title of a link instead just giving me "Access denied | www.example.com used Cloudflare to restrict access". However, singlefile and other methods seem to pull the information and page just fine. Is there a way to force archivebox to pull that information from one of the other capture sources without just disabling all wget captures? Or possibly blacklisting WGET on certain urls?
docker-compose.yml. This is my code:I usually pipe my list to it with
archivebox add < ~/urls.txt, as well as attempted witharchivebox add --depth=1 https://pinboard.in/u:username, but it still adds the blacklisted url to the archive:I've tried a couple variations that I have validated (using regex101), but either run into
ERROR: Invalid interpolation format for "environment" option in serviceerror, or it just ignores it and adds the urls from that domain anyways. Another one that validates, but fails to prevent the url from being archived:URL_BLACKLIST='http(s)?:\/\/(.+)?(pinboard\.in)\/.*'Is this a bug that I should report, a docker-compose limitation, or something wrong with my regex? I would love to just point this at my private pinboard feed with a cron task, without picking up those extra links to RSS spec or pinboard homepage.Is there a way to update a specific, single link from the command-line? (would that be utilizing the web-interface?) There are some 404s or missing pages sprinkled through my archive, some of them temporary, and I would love to update just those specific links without hitting all errored/unfinished links.
I guess that leads to the question, is it possible to set a page to not be checked/rearchived, even if it has an error? I have certain older links that have partial archives or something, so I don't want to remove them, but the links don't exist on the web anymore and will just indefinitely 404.
Is it possible to pipe list of urls to remove/delete from docker? I've tried using various
stdoutmethods, like<orcat, with no luck. When it gets to "Do you want to process with removing these? y/n", it just pops back over to the command-line without letting me choose. I've tried using the programyesas well, but no dice.@pirate commented on GitHub (Feb 2, 2021):
The title is actually fetched using
requests.get, not with wget, but I can see why you want to switch it to Singlefile. If we switch over to playwright and get the headless browser scripting side of things worked out,titlewill be one of the first extractors we'll switch over to JS. That probably wont be for at least 4+ months though.You can test your regex works for ArchiveBox by using the python3 shell:
If it works in the shell but not in archivebox, then it's a Docker/env escaping issue, try setting it with
docker-compose run archivebox config --set URL_BLACKLIST='...'instead of using environment variables.The

Archivebutton in the UI is also equivalent to runningarchivebox updateon itNot yet, follow these:
@drpfenderson commented on GitHub (Feb 2, 2021):
docker-compose.ymlfile isn't sufficient. I set it inconfigby passing it like you recommended, it wrote it to my.conffile, and that seemed to work! Though I did have to put quotes around the env variable, likedocker-compose run archivebox config --set "URL_BLACKLIST='...'", otherwise it threw an error.SAVE_GIT=Falseand other env variables when running from the webui. It respects those now, so either of these methods is perfect. Fantastic!--yestag is precisely what I was missing. Worked!Thanks so much for the time/energy you've taken to help me. Y'all are the best.