mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1363] parser=auto will almost always just fall back to parser=generic_txt, needs to let the first parser to find URLS win #2345
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2345
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @jimwins on GitHub (Feb 27, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1363
When you run:
archivebox add --parser=rss --depth=1 https://example.org/example.rssIt fails, because what happens is that the URL is written to
sources/{{id}}-import.txton a line by itself, and then this code tries to read that file with the specified parser:github.com/ArchiveBox/ArchiveBox@f02b27920c/archivebox/main.py (L622-L642)If I am reading and understanding the code correctly, that first call to
parse_links_from_sourceshould actually includeparser="url_list"because that's always what the file it is reading will be.Then the call to
parse_links_from_sourceon line 640 should includeparser= parserso the parser specified on the command line is the one used to parse the contents of the URL that was originally provided.If I am understanding things correctly, using
--parser=rssor other non-text parsers with--depth=0(the default) will always just fall over because it will try to read the URL list using the parser.@pirate commented on GitHub (Feb 27, 2024):
Here's how it should work:
--depth=0+--parser=None+ passed blob containing URL(s) via stdin/args--depth=0+--parser=None+ passed local file path(s) via stdin/args--depth=0+--parser=rss+ passed URL(s) or local file path(s) via stdin/args--depth=0+--parser=rss+ passed a single raw RSS blob directly via stdin--depth=1+--parser=rss+ passed reference(s) to RSS file(s) via stdin/args--depth=1+--parser=rss+ passed a single raw RSS blob directly via stdinThis code is brittle and has had many fixes over the years. There are many different edge cases around weird ways users can pass things (e.g. passing local XML file paths inside a single TXT file who's path is passed via STDIN to archivebox).
I should really document all the behaviors and update the test cases at some point... I wrote some tests for it but at this point some are outdated / missing coverage of some edge cases.
I intend to clean it up someday when I add
--depth=2and--filter=<filter pattern> e.g. 'url.netloc == 'example.com'support.@jimwins commented on GitHub (Feb 27, 2024):
Okay, that makes more sense. I think it would help to explicitly throw an error when --parser is specified when we get one or more URLs passed on the command line, so I'll look into creating a patch that does that and write some test cases.
(I started down this rabbit hole when I was trying to write test cases for my patch for #1171. That should be easy now that I understand how it is supposed to work!)
@jimwins commented on GitHub (Feb 27, 2024):
Another behavior I question that I ran across in writing tests.
--parser=autotries all of the parsers and picks the one that returns the most URLs, but because thegeneric_txtparser just hoovers up anything that looks like a URL, it gets used even if a more specific parser is successful. This contradicts the following comment in the code:github.com/ArchiveBox/ArchiveBox@f02b27920c/archivebox/parsers/init.py#L138-L141This is problematic with RSS feeds or other XML formats, for example, since they may include URLs for namespaces that the user probably doesn't really intend to archive.
Looks like this behavior was introduced with the commit for the new generic_html parser.
I would have thought it would be problematic there, too, since
generic_txtwould pick up anything URL-like in the HTML file which will be more thangeneric_htmlwill find in just the<a href="">tags.@pirate commented on GitHub (Feb 29, 2024):
Ah yeah, I know about this shortcoming of
generic_txtsuperseding everything else, but at the time we were getting more issues about URLs failing to parse than too many URLs being parsed, so this was the result of a hastyfuckit, parse everything at all costsbad decision.Great job spelunking through the git history to find the context of that change! And I greatly appreciate your willingness to stick with debugging this even though it's clearly one of the poorest designed/highest technical debt areas of the codebase. I sent you a Github Sponsorship as a token of my appreciation since you're going above and beyond casual contributor level here.
I think now that the dedicated parsers are improving, we can walk back that clobbery design decision and return it to the "first one that succeeds" approach.
Assumptions to verify:
@jimwins commented on GitHub (Mar 1, 2024):
That makes more sense. I'll redo my RSS parser pull request to not muck around with this area, and work on adding test cases for how
--parser=autoshould work (and make it work that way).Still thinking about the complicated interactions of
--parser,--depth, input fromstdin, and URLs vs. files on the command line. Right now I'm thinking that it may make sense to add some more explicit commands that don't try to be so clever and thenaddcan be as messy as it wants without cluttering up the underlying implementation as much.@jimwins commented on GitHub (Mar 4, 2024):
(Thanks for the sponsorship, it was very unexpected!)
This is a long comment, and gets longer every time I try to make it shorter. The 'tl;dr' is that the way archivebox tries to do things now leaves ambiguity and I fear building a tower of workarounds that is going to implode. You can see that in this:
Either stdin is parsed as an RSS file, or the filenames read from stdin is parsed as RSS, I'm not sure you can really make a reliable determination without doing some hacky and dangerous things.1
I pushed test cases5 that implement all of the scenarios you outlined in this comment, and with the current dev branch, only 4 of the 19 pass. After fixing
parser == 'auto'to favor the first successful parser, that only improves to 9 of 19. Making the change to interpret--parseras I outline below gets us to 15 of 19. The four that still fail are due to the ambiguous behavior above.6 (I haven't committed that change because I'm not happy with it yet.)Here's how I am thinking about it:
We start with an empty list of URLs to archive.7
We get a list of zero or more filenames or URLs on the command line. That's
args = Optional[list[str]].If
not args, we read stdin and run it through the parser. It could be RSS, JSON, or just a blob of text with URLs that thegeneric_txtparser will figure out whenparser = "auto", and so we add those to our list of URLs to archive.8Now
for arg in args, we have to figure out what each argument is.If
argis a file, we don't have a URL so it can't be added to our list of URLs, so we will parse the contents and add the URLs contained within to our list of URLs to archive.If
arglooks like a URL, now we have to figure out the user's intent. It could either be a resource that the user wanted to archive, or something like an RSS file where they want us to parse it and archive all the URLs mentioned.The intent is "obvious" for an RSS feed, but ambiguous for an HTML file, because maybe it's an export file of bookmarks.
I don't think we want to always download and parse the resource, because with
--parser=autowe'll probably always find something in any URL we're given, but for most URLs the user probably really intends to archive that specific resource. (This also would conflict with--index-onlywhich I think shows an intent to not download URLs absent other signals.)That means that
archivebox add https://example.com/index.rsswill archivehttps://example.com/index.rssand not parse the RSS file (and archive the URLs it finds).However, I think when the user has specified a parser (
parser != "auto"), we can use that as a signal that we really want to run every URL inargsthrough the parser. So when--parser=whateveris specified, we don't archive each URL, we download it and run it through the parser.9Rough code outline without error checking or lots of other option handling:
Note that
--depthisn't part of this yet. To add that, thinking about it recursively makes more sense to me. Let's wrap all of the above in a function calledadd(stdin: Optional[IO] = None, args: List[str], depth: int=0, parser: str="auto").Now we can turn our last loop into:
We can also use
depthas an additional signal of intent from the user on what to do with arguments they originally gave on the command line, so we can expand that tore.findall(URL_REGEX, arg) and (parser != "auto" or depth > 0).This can support any arbitrary depth, but we'll want to add some options to control two parts of that -- filtering URLs (so we can restrict the crawl to a specific site, for example) and deciding which leaves of the tree we're walking we actually archive. Back to the RSS example, when we do
archivebox --depth=1 https://example.org/test.rsswe don't want to archive the RSS page, but when we doarchivebox --depth=1 https://example.org/page1.htmlwe probably do, or pushing things out, when we doarchivebox add --depth=2 https://example.org/subscriptions.opml(where that's an OPML file of RSS subscriptions), we probably are okay with just archiving thedepth = 0links, but if we were to doarchivebox add --depth=2 https://example.org/we probably want archive every page, so maybe we need an option something like--archive-depth=N.I haven't really captured the problematic interplay of
--depthand--parserhere, either and how it works with stdin. Because under this sketch, if you runecho -e "https://example.com/1.json\nhttps://example.com/2.json" | archivebox add --parser=json"if will fall over because stdin is not in JSON format. I think thatparserwants to be a stack. With that, I could do something likearchivebox add --depth=2 --parser=opml,rss https://example.com/feeds.opmlto archive every link in each RSS file listed in the OPML file. Orecho -e "https://example.com/1.json\nhttps://example.com/2.json" | archivebox add --parser=url_list,json".The
save_text_as_sourcefunction is part of the culprit here. As it works now, it gets stdin as a string, splits it into lines, reads the contents of any line that has a path that exists, and then writes back out a file that is stdin with every file it read appended. This explains that hack in the JSON parser. But this is just madness and means multiple files on the command line or read via stdin just can't work, or a mix of files or URLs, or whatever. I haven't yet added test cases that show these failures. ↩︎Filenames can have spaces. And imagine
cat compromised-file | archivebox addwhere someone has figured out how to inject/etc/passwdor some other should-be-secure-path into some automated pipeline. It seems like gratuitously dangerous behavior to leave as a default. ↩︎This gets back to my idea that this really should be two commands.
archivebox archive [url]which only ever archives the URLs (and only URLs) specified (and also takes a list of URLs, and only a list, on stdin) and knows nothing about "parsers," andarchivebox import [url_or_filename_or_stdin]which is like what I'm suggestingarchivebox add --parser=autocould do.4 ↩︎Another idea to reduce ambiguity - don't implicitly read stdin when we have no arguments, require
-to be supplied as a filename. ↩︎Very much a work in progress. In particular, the tests really should be making better use of fixtures to simplify the individual test cases and dump more useful information when they fail. ↩︎
And one (
test_add_file_json) only succeeds because of the combined hacks ofsave_text_as_sourceand the JSON parser handling a first line of junk. If that test case listed two JSON files on the command line, it would fail. ↩︎In my thinking about this, and in tests, I'm using
archivebox add --index-onlybecause I think ofarchivebox addas two steps: collect the URLs to add to the archive, and then add them to a queue of URLs that need the actual snapshot to be taken or updated. That means I don't have to think about--overwrite,--update,--update-all, and--extractbecause those all impact the second step, not the first. ↩︎If we want to also handle filenames via stdin we could add a parser that is only used when parsing stdin that is like
url_listfor files. (This is kind of what happens now insidesave_text_as_sourcebut I would say it only really works accidentally in narrow cases.) Call itfile_listorurl_or_file_listif you want to handle both URLs and filenames on stdin at the same time. But the more I think about it, the more I think handling filenames on stdin is not worth the headaches.2 ↩︎We could distinguish between the unspecified parser and
--parser=autoso thatarchivebox add --parser=auto http://example.com/resourceadded the list of URLs parsed from thehttp://example.com/resourcebut not that URL.3 ↩︎@pirate commented on GitHub (Apr 11, 2024):
After thinking about this a bunch I agree with all of it, especially these:
Firstly I think I'm going to banish all filesystem paths being used as input unless they start with
file://data/sources/orfile://data/archive/. This allows us to keep piping filesystem paths as an internal mechanism without the security risk of importing from anywhere in the filesystem. If people want to import other paths they can always pipe them in manually using<stdin redirection.I also think we should mandate that args can only ever be lists of URIs or snapshot_ids, you can never pass text to be parsed as an arg. These two assertions can be enforced at runtime, and should greatly simplify all our logic.
I've written a new extractor called
out_linksthat parses the hrefs/urls in a bytes blob to out_links.json during archiving, allowing a snapshot of a URL to be the starting point to get its contents instead of using a helper func to download remote files. We can move all the parsers to be extractors based on this base extractor (e.g. rss_outlinks, md_outlinks, jsonl_outlinks).I propose creating three new commands, each with a smaller scope, designed to be chained with unix pipes:
archivebox snapshot [url1] [url2] [url3]echo -e 'https://example.com/a\nhttps://example.com/b\n' | archivebox snapshotTakes only url_list as args or JSONL†/url_list via stdin, creates Snapshots (and Tags if needed as side-effect) for each, and echos out one created snapshot id(s) per line.
archivebox extract --extractors=[all]|screenshot|pdf|...|rss_outlinks|html_outlinks|... --csv=url,timestamp,title,tags [snapshot_id2] [snapshot_id2] [snapshot_id3]Takes a single snapshot_id or list of snapshot ids as args or stdin, runs the specified extractors on them, and saves the output files into the snapshot directory for each (creates an ArchiveResult for each output file). Echos out each snapshot id after running extractors on it (so it can be easily chained with other commands).
archivebox crawl [--parser=auto|html|rss|jsonl|...] [url1|snapshot_id1] [url2|snapshot_id2]Takes a single starting point url (or snapshot_id) as an arg, returns outlink URLs. Runs:
# if url is passed instead of snapshot_id, create a snapshot for it:hop_0_snapshot_id=$(archivebox snapshot [url] --csv=id)# if parsed outlinks are not present in snapshot dir, extract them using chosen parser:archivebox extract --extract=auto_outlinks $hop_0_snapshot_id# return parsed outlinks as jsonl or url_listcat ./archive/$hop_0_snapshot_id/auto_outlinks.jsonl# does not handle recursing, only echos out URLs one hop away from starting pointThis lets us keep
archivebox addbut make it just a shortcut for running combos of the above commands.For example:
archivebox add 'http://example.com'can become:archivebox add --depth=1 'http://example.com'can become:archivebox add --parser=rss 'https://example.com/some/feed.xml'can become:archivebox add --parser=rss --depth=1 < some_feed.xmlcan become:Does that sound reasonable?
†: I'm imagining all these commands can yield/ingest bare snapshot IDs or URLs, but it would be nice if they also support a normalized JSONL format for archivebox objects like this so we can pass metadata (like tags, bookmarked_ts, etc.) with each line:
A format like this is extensible as we pass more metadata/object-types in the future, it's streamable (can process each line independently), and it's easy to debug with
jq.@pirate commented on GitHub (Oct 6, 2024):
Work on fixing this has finally started in https://github.com/ArchiveBox/ArchiveBox/releases/tag/v0.8.5-rc.
I'm moving forward with the new design as posted above. Snapshots are now uniquely identified by an ABID (essentially a fancy UUID), there is a new Crawl model to track crawls, and there are new pipeable CLI comands started for
crawl,snapshot, andextract.@AramZS commented on GitHub (Nov 30, 2024):
I can confirm that the rss parser doesn't seem to parse RSS feeds. I have an RSS feed I've built specifically of other URLs intended to be processed by ArchiveBox here. The design is a little different than a normal RSS feed in that the
linkelements are all different TLD+1s from the base domain of the feed. I'm just digging into this project, but glad to help if I can talk to expected formats for RSS or anything like that. Def glad to help test stuff too.@pirate commented on GitHub (Dec 1, 2024):
There's a lot of parser improvement work going on currently in 0.8.6rc0 (dev) but it's still WIP! Will keep posting updates here.