[GH-ISSUE #1363] parser=auto will almost always just fall back to parser=generic_txt, needs to let the first parser to find URLS win #2345

Open
opened 2026-03-01 17:58:20 +03:00 by kerem · 10 comments
Owner

Originally created by @jimwins on GitHub (Feb 27, 2024).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1363

When you run:

archivebox add --parser=rss --depth=1 https://example.org/example.rss

It fails, because what happens is that the URL is written to sources/{{id}}-import.txt on a line by itself, and then this code tries to read that file with the specified parser:

github.com/ArchiveBox/ArchiveBox@f02b27920c/archivebox/main.py (L622-L642)

If I am reading and understanding the code correctly, that first call to parse_links_from_source should actually include parser="url_list" because that's always what the file it is reading will be.

Then the call to parse_links_from_source on line 640 should include parser= parser so the parser specified on the command line is the one used to parse the contents of the URL that was originally provided.

If I am understanding things correctly, using --parser=rss or other non-text parsers with --depth=0 (the default) will always just fall over because it will try to read the URL list using the parser.

Originally created by @jimwins on GitHub (Feb 27, 2024). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1363 When you run: `archivebox add --parser=rss --depth=1 https://example.org/example.rss` It fails, because what happens is that the URL is written to `sources/{{id}}-import.txt` on a line by itself, and then this code tries to read that file with the specified parser: https://github.com/ArchiveBox/ArchiveBox/blob/f02b27920c41a9a1182da4d1871f7ba693c20c3a/archivebox/main.py#L622-L642 If I am reading and understanding the code correctly, that first call to `parse_links_from_source` should actually include `parser="url_list"` because that's always what the file it is reading will be. Then the call to `parse_links_from_source` on line 640 should include `parser= parser` so the parser specified on the command line is the one used to parse the contents of the URL that was originally provided. If I am understanding things correctly, using `--parser=rss` or other non-text parsers with `--depth=0` (the default) will always just fall over because it will try to read the URL list using the parser.
Author
Owner

@pirate commented on GitHub (Feb 27, 2024):

Here's how it should work:

--depth=0 + --parser=None + passed blob containing URL(s) via stdin/args

# treats provided args/stdin as a single any-format blob containing URLs
archivebox add --depth=0 'https://website.one' 'https://website.two'
echo -e 'https://website.one\nhttps://website.two' | archivebox add --depth=0
echo -e '<rss>...some urls</rss>' | archivebox add --depth=0
>>> archivebox.main.add('https://website.one\nhttps://website.two', depth=0)
... saves raw args/stdin into local `sources/import_stdin.txt` first
... parses sources/import_stdin.txt using autodetected best parser
... then archive each url found as a new Snapshot

--depth=0 + --parser=None + passed local file path(s) via stdin/args

# parse provided args/stdin as file paths pointing to any-format blobs containing URLs
archivebox add --depth=0 ~/Downloads/some/rss.xml
echo '~/Downloads/some/rss.xml' | archivebox add --depth=0
>>> archivebox.main.add('~/Downloads/some/rss.xml', depth=0)
... saves raw args/stdin into local `sources/import_stdin.txt` first
... copies each local file into its own sources/some_file.txt
... for each new sources/ file, parse using autodetected best parser
... then archive each url found as a new Snapshot

--depth=0 + --parser=rss + passed URL(s) or local file path(s) via stdin/args

# parse provided args/stdin as URLs or local file paths (treat as reference(s) to some RSS)

archivebox add --depth=0 --parser=rss 'https://dropbox.tech/feed'
archivebox add --depth=0 --parser=rss '~/Downloads/some/rss/feed.xml'
echo 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss
echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss
... saves raw args/stdin string into a local `sources/import__stdin.txt` first
... if args/stdin are raw URLs or file paths, download/copy the content each arg *points to* into a corresponding file, e.g. 
    `sources/example_com_some_rss_feed.xml`
... then parse each new sources/ file as rss to get a list of urls to archive
... then archive each first hop url as a new Snapshot

--depth=0 + --parser=rss + passed a single raw RSS blob directly via stdin

# parse provided args/stdin as RSS directly (using RSS parser to find URLs within)
curl -s 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss
cat ~/Downloads/some/rss/feed.xml | archivebox add --depth=0 --parser=rss
# not allowed: archivebox add --depth=0 --parser=rss '<rss>...</rss>'
>>> archivebox.main.add('<rss>...</rss>', depth=0, parser=rss)
... saves stdin into sources/import_stdin.txt file
... parses sources/import_stdin.txt as rss to get list of URLs
... creates a Snapshot for each found URL

--depth=1 + --parser=rss + passed reference(s) to RSS file(s) via stdin/args

# parse provided args/stdin as URLs or local file paths (treat as reference(s) to some RSS)
archivebox add --depth=1 --parser=rss 'https://dropbox.tech/feed'
archivebox add --depth=1 --parser=rss ~/Downloads/some/rss/feed.xml
echo 'https://dropbox.tech/feed' | archivebox add --depth=1 --parser=rss
echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=1 --parser=rss
>>> archivebox.main.add('https://example.com/some/rss/feed.xml', depth=1, parser=rss)
... saves raw args/stdin string into a local `sources/import__urls.txt` first
... if args/stdin are raw URLs or file paths, download/copy the content each arg *points to* into a corresponding file, e.g. 
    `sources/example_com_some_rss_feed.xml`
... then parse each new sources/ file as rss to get a list of urls to archive
... then archive each first hop url as a new Snapshot
... then crawl each url to get a list of URLs a second hop away (depth=1)
... then archive all those subsequent URLs

--depth=1 + --parser=rss + passed a single raw RSS blob directly via stdin

# parse provided stdin as RSS blob directly (using RSS parser to find URLs within)
curl -s 'https://dropbox.tech/feed' | archivebox add --depth=1 --parser=rss
cat ~/Downloads/some/rss/feed.xml | archivebox add --depth=1 --parser=rss
# not allowed: archivebox add --depth=1 --parser=rss '<rss>...</rss>'  (this should fail)
>>> archivebox.main.add('<rss xmlns:content...>...</rss>', depth=1, parser='rss')
... saves stdin into a local file first sources/stdin_urls.rss
... parse local file as rss to get a list of urls to archive
... then archive each first hop url as a new Snapshot
... then crawl each url to get a list of URLs a second hop away (depth=1)
... then archive all those subsequent URLs

This code is brittle and has had many fixes over the years. There are many different edge cases around weird ways users can pass things (e.g. passing local XML file paths inside a single TXT file who's path is passed via STDIN to archivebox).

I should really document all the behaviors and update the test cases at some point... I wrote some tests for it but at this point some are outdated / missing coverage of some edge cases.

I intend to clean it up someday when I add --depth=2 and --filter=<filter pattern> e.g. 'url.netloc == 'example.com' support.

<!-- gh-comment-id:1966177173 --> @pirate commented on GitHub (Feb 27, 2024): Here's how it should work: ### `--depth=0` + `--parser=None` + passed blob containing URL(s) via stdin/args ```bash # treats provided args/stdin as a single any-format blob containing URLs archivebox add --depth=0 'https://website.one' 'https://website.two' echo -e 'https://website.one\nhttps://website.two' | archivebox add --depth=0 echo -e '<rss>...some urls</rss>' | archivebox add --depth=0 >>> archivebox.main.add('https://website.one\nhttps://website.two', depth=0) ... saves raw args/stdin into local `sources/import_stdin.txt` first ... parses sources/import_stdin.txt using autodetected best parser ... then archive each url found as a new Snapshot ``` ### `--depth=0` + `--parser=None` + passed local file path(s) via stdin/args ```bash # parse provided args/stdin as file paths pointing to any-format blobs containing URLs archivebox add --depth=0 ~/Downloads/some/rss.xml echo '~/Downloads/some/rss.xml' | archivebox add --depth=0 >>> archivebox.main.add('~/Downloads/some/rss.xml', depth=0) ... saves raw args/stdin into local `sources/import_stdin.txt` first ... copies each local file into its own sources/some_file.txt ... for each new sources/ file, parse using autodetected best parser ... then archive each url found as a new Snapshot ``` --- ### `--depth=0` + `--parser=rss` + passed URL(s) or local file path(s) via stdin/args ```bash # parse provided args/stdin as URLs or local file paths (treat as reference(s) to some RSS) archivebox add --depth=0 --parser=rss 'https://dropbox.tech/feed' archivebox add --depth=0 --parser=rss '~/Downloads/some/rss/feed.xml' echo 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss ... saves raw args/stdin string into a local `sources/import__stdin.txt` first ... if args/stdin are raw URLs or file paths, download/copy the content each arg *points to* into a corresponding file, e.g. `sources/example_com_some_rss_feed.xml` ... then parse each new sources/ file as rss to get a list of urls to archive ... then archive each first hop url as a new Snapshot ``` --- ### `--depth=0` + `--parser=rss` + passed a single raw RSS blob directly via stdin ```bash # parse provided args/stdin as RSS directly (using RSS parser to find URLs within) curl -s 'https://dropbox.tech/feed' | archivebox add --depth=0 --parser=rss cat ~/Downloads/some/rss/feed.xml | archivebox add --depth=0 --parser=rss # not allowed: archivebox add --depth=0 --parser=rss '<rss>...</rss>' >>> archivebox.main.add('<rss>...</rss>', depth=0, parser=rss) ... saves stdin into sources/import_stdin.txt file ... parses sources/import_stdin.txt as rss to get list of URLs ... creates a Snapshot for each found URL ``` --- ### `--depth=1` + `--parser=rss` + passed reference(s) to RSS file(s) via stdin/args ```bash # parse provided args/stdin as URLs or local file paths (treat as reference(s) to some RSS) archivebox add --depth=1 --parser=rss 'https://dropbox.tech/feed' archivebox add --depth=1 --parser=rss ~/Downloads/some/rss/feed.xml echo 'https://dropbox.tech/feed' | archivebox add --depth=1 --parser=rss echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=1 --parser=rss >>> archivebox.main.add('https://example.com/some/rss/feed.xml', depth=1, parser=rss) ... saves raw args/stdin string into a local `sources/import__urls.txt` first ... if args/stdin are raw URLs or file paths, download/copy the content each arg *points to* into a corresponding file, e.g. `sources/example_com_some_rss_feed.xml` ... then parse each new sources/ file as rss to get a list of urls to archive ... then archive each first hop url as a new Snapshot ... then crawl each url to get a list of URLs a second hop away (depth=1) ... then archive all those subsequent URLs ``` ### `--depth=1` + `--parser=rss` + passed a single raw RSS blob directly via stdin ```bash # parse provided stdin as RSS blob directly (using RSS parser to find URLs within) curl -s 'https://dropbox.tech/feed' | archivebox add --depth=1 --parser=rss cat ~/Downloads/some/rss/feed.xml | archivebox add --depth=1 --parser=rss # not allowed: archivebox add --depth=1 --parser=rss '<rss>...</rss>' (this should fail) >>> archivebox.main.add('<rss xmlns:content...>...</rss>', depth=1, parser='rss') ... saves stdin into a local file first sources/stdin_urls.rss ... parse local file as rss to get a list of urls to archive ... then archive each first hop url as a new Snapshot ... then crawl each url to get a list of URLs a second hop away (depth=1) ... then archive all those subsequent URLs ``` --- This code is brittle and has had many fixes over the years. There are many different edge cases around weird ways users can pass things (e.g. passing local XML file paths inside a single TXT file who's path is passed via STDIN to archivebox). I should really document all the behaviors and update the test cases at some point... I wrote some tests for it but at this point some are outdated / missing coverage of some edge cases. I intend to clean it up someday when I add `--depth=2` and `--filter=<filter pattern> e.g. 'url.netloc == 'example.com'` support.
Author
Owner

@jimwins commented on GitHub (Feb 27, 2024):

Okay, that makes more sense. I think it would help to explicitly throw an error when --parser is specified when we get one or more URLs passed on the command line, so I'll look into creating a patch that does that and write some test cases.

(I started down this rabbit hole when I was trying to write test cases for my patch for #1171. That should be easy now that I understand how it is supposed to work!)

<!-- gh-comment-id:1966894746 --> @jimwins commented on GitHub (Feb 27, 2024): Okay, that makes more sense. I think it would help to explicitly throw an error when --parser is specified when we get one or more URLs passed on the command line, so I'll look into creating a patch that does that and write some test cases. (I started down this rabbit hole when I was trying to write test cases for my patch for #1171. That should be easy now that I understand how it is supposed to work!)
Author
Owner

@jimwins commented on GitHub (Feb 27, 2024):

Another behavior I question that I ran across in writing tests. --parser=auto tries all of the parsers and picks the one that returns the most URLs, but because the generic_txt parser just hoovers up anything that looks like a URL, it gets used even if a more specific parser is successful. This contradicts the following comment in the code:

github.com/ArchiveBox/ArchiveBox@f02b27920c/archivebox/parsers/init.py#L138-L141

This is problematic with RSS feeds or other XML formats, for example, since they may include URLs for namespaces that the user probably doesn't really intend to archive.

Looks like this behavior was introduced with the commit for the new generic_html parser.

I would have thought it would be problematic there, too, since generic_txt would pick up anything URL-like in the HTML file which will be more than generic_html will find in just the <a href=""> tags.

<!-- gh-comment-id:1967309015 --> @jimwins commented on GitHub (Feb 27, 2024): Another behavior I question that I ran across in writing tests. `--parser=auto` tries all of the parsers and picks the one that returns the most URLs, but because the `generic_txt` parser just hoovers up anything that looks like a URL, it gets used even if a more specific parser is successful. This contradicts the following comment in the code: https://github.com/ArchiveBox/ArchiveBox/blob/f02b27920c41a9a1182da4d1871f7ba693c20c3a/archivebox/parsers/__init__.py#L138-L141 This is problematic with RSS feeds or other XML formats, for example, since they may include URLs for namespaces that the user probably doesn't really intend to archive. Looks like this behavior was introduced with [the commit for the new generic_html parser](https://github.com/ArchiveBox/ArchiveBox/commit/15efb2d5ed1163fb5f6388646fb167efa7dd1afa). I would have thought it would be problematic there, too, since `generic_txt` would pick up anything URL-like in the HTML file which will be more than `generic_html` will find in just the `<a href="">` tags.
Author
Owner

@pirate commented on GitHub (Feb 29, 2024):

Ah yeah, I know about this shortcoming of generic_txt superseding everything else, but at the time we were getting more issues about URLs failing to parse than too many URLs being parsed, so this was the result of a hasty fuckit, parse everything at all costs bad decision.

Great job spelunking through the git history to find the context of that change! And I greatly appreciate your willingness to stick with debugging this even though it's clearly one of the poorest designed/highest technical debt areas of the codebase. I sent you a Github Sponsorship as a token of my appreciation since you're going above and beyond casual contributor level here.

I think now that the dedicated parsers are improving, we can walk back that clobbery design decision and return it to the "first one that succeeds" approach.

Assumptions to verify:

  • a parser returning 0 URLs should count as a failure, even if it didn't throw an Exception
  • there are no cases where a parser might only return a subset of the URLs without erroring, while a later parser might find more (excluding garbage atom/schema URLs, etc. found in XML headers)
  • parser list is sorted from most-specific to least-specific
  • manually specifying a parser should try only that one, with no fallbacks
<!-- gh-comment-id:1970294015 --> @pirate commented on GitHub (Feb 29, 2024): Ah yeah, I know about this shortcoming of `generic_txt` superseding everything else, but at the time we were getting more issues about URLs failing to parse than too many URLs being parsed, so this was the result of a hasty `fuckit, parse everything at all costs` bad decision. Great job spelunking through the git history to find the context of that change! And I greatly appreciate your willingness to stick with debugging this even though it's clearly one of the poorest designed/highest technical debt areas of the codebase. I sent you a Github Sponsorship as a token of my appreciation since you're going above and beyond casual contributor level here. I think now that the dedicated parsers are improving, we can walk back that clobbery design decision and return it to the "first one that succeeds" approach. Assumptions to verify: - a parser returning 0 URLs should count as a failure, even if it didn't throw an Exception - there are no cases where a parser might only return a subset of the URLs without erroring, while a later parser might find more (excluding garbage atom/schema URLs, etc. found in XML headers) - parser list is sorted from most-specific to least-specific - manually specifying a parser should try only that one, with no fallbacks
Author
Owner

@jimwins commented on GitHub (Mar 1, 2024):

That makes more sense. I'll redo my RSS parser pull request to not muck around with this area, and work on adding test cases for how --parser=auto should work (and make it work that way).

Still thinking about the complicated interactions of --parser, --depth, input from stdin, and URLs vs. files on the command line. Right now I'm thinking that it may make sense to add some more explicit commands that don't try to be so clever and then add can be as messy as it wants without cluttering up the underlying implementation as much.

<!-- gh-comment-id:1972410330 --> @jimwins commented on GitHub (Mar 1, 2024): That makes more sense. I'll redo my RSS parser pull request to not muck around with this area, and work on adding test cases for how `--parser=auto` should work (and make it work that way). Still thinking about the complicated interactions of `--parser`, `--depth`, input from `stdin`, and URLs vs. files on the command line. Right now I'm thinking that it may make sense to add some more explicit commands that don't try to be so clever and then `add` can be as messy as it wants without cluttering up the underlying implementation as much.
Author
Owner

@jimwins commented on GitHub (Mar 4, 2024):

(Thanks for the sponsorship, it was very unexpected!)

This is a long comment, and gets longer every time I try to make it shorter. The 'tl;dr' is that the way archivebox tries to do things now leaves ambiguity and I fear building a tower of workarounds that is going to implode. You can see that in this:

$ echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss
$ cat '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss

Either stdin is parsed as an RSS file, or the filenames read from stdin is parsed as RSS, I'm not sure you can really make a reliable determination without doing some hacky and dangerous things.1

I pushed test cases5 that implement all of the scenarios you outlined in this comment, and with the current dev branch, only 4 of the 19 pass. After fixing parser == 'auto' to favor the first successful parser, that only improves to 9 of 19. Making the change to interpret --parser as I outline below gets us to 15 of 19. The four that still fail are due to the ambiguous behavior above.6 (I haven't committed that change because I'm not happy with it yet.)

Here's how I am thinking about it:

We start with an empty list of URLs to archive.7

We get a list of zero or more filenames or URLs on the command line. That's args = Optional[list[str]].

If not args, we read stdin and run it through the parser. It could be RSS, JSON, or just a blob of text with URLs that the generic_txt parser will figure out when parser = "auto", and so we add those to our list of URLs to archive.8

Now for arg in args, we have to figure out what each argument is.

If arg is a file, we don't have a URL so it can't be added to our list of URLs, so we will parse the contents and add the URLs contained within to our list of URLs to archive.

If arg looks like a URL, now we have to figure out the user's intent. It could either be a resource that the user wanted to archive, or something like an RSS file where they want us to parse it and archive all the URLs mentioned.

$ archivebox add https://example.com/feed.rss https://example.com/content.html

The intent is "obvious" for an RSS feed, but ambiguous for an HTML file, because maybe it's an export file of bookmarks.

I don't think we want to always download and parse the resource, because with --parser=auto we'll probably always find something in any URL we're given, but for most URLs the user probably really intends to archive that specific resource. (This also would conflict with --index-only which I think shows an intent to not download URLs absent other signals.)

That means that archivebox add https://example.com/index.rss will archive https://example.com/index.rss and not parse the RSS file (and archive the URLs it finds).

However, I think when the user has specified a parser (parser != "auto"), we can use that as a signal that we really want to run every URL in args through the parser. So when --parser=whatever is specified, we don't archive each URL, we download it and run it through the parser.9

Rough code outline without error checking or lots of other option handling:

urls_to_archive = []

if not len(args):
    urls_to_archive += parse_file(stdin, parser)

for arg in args:
    if parser == "auto" and re.findall(URL_REGEX, arg):
        urls_to_archive += [ arg ]
    else:
        urls_to_archive += parse_url_or_filename(arg, parser)

for url in urls_to_archive:
    queue_for_archive(url)

Note that --depth isn't part of this yet. To add that, thinking about it recursively makes more sense to me. Let's wrap all of the above in a function called add(stdin: Optional[IO] = None, args: List[str], depth: int=0, parser: str="auto").

Now we can turn our last loop into:

for url in urls_to_archive:
    if already_queued(url):
        continue # skip URLs we've already queued
    queue_for_archive(url)
    if depth > 0:
        add(args= parse_url_or_filename(url), depth= depth - 1, parser="auto")

We can also use depth as an additional signal of intent from the user on what to do with arguments they originally gave on the command line, so we can expand that to re.findall(URL_REGEX, arg) and (parser != "auto" or depth > 0).

This can support any arbitrary depth, but we'll want to add some options to control two parts of that -- filtering URLs (so we can restrict the crawl to a specific site, for example) and deciding which leaves of the tree we're walking we actually archive. Back to the RSS example, when we do archivebox --depth=1 https://example.org/test.rss we don't want to archive the RSS page, but when we do archivebox --depth=1 https://example.org/page1.html we probably do, or pushing things out, when we do archivebox add --depth=2 https://example.org/subscriptions.opml (where that's an OPML file of RSS subscriptions), we probably are okay with just archiving the depth = 0 links, but if we were to do archivebox add --depth=2 https://example.org/ we probably want archive every page, so maybe we need an option something like --archive-depth=N.

I haven't really captured the problematic interplay of --depth and --parser here, either and how it works with stdin. Because under this sketch, if you run echo -e "https://example.com/1.json\nhttps://example.com/2.json" | archivebox add --parser=json" if will fall over because stdin is not in JSON format. I think that parser wants to be a stack. With that, I could do something like archivebox add --depth=2 --parser=opml,rss https://example.com/feeds.opml to archive every link in each RSS file listed in the OPML file. Or echo -e "https://example.com/1.json\nhttps://example.com/2.json" | archivebox add --parser=url_list,json".


  1. The save_text_as_source function is part of the culprit here. As it works now, it gets stdin as a string, splits it into lines, reads the contents of any line that has a path that exists, and then writes back out a file that is stdin with every file it read appended. This explains that hack in the JSON parser. But this is just madness and means multiple files on the command line or read via stdin just can't work, or a mix of files or URLs, or whatever. I haven't yet added test cases that show these failures. ↩︎

  2. Filenames can have spaces. And imagine cat compromised-file | archivebox add where someone has figured out how to inject /etc/passwd or some other should-be-secure-path into some automated pipeline. It seems like gratuitously dangerous behavior to leave as a default. ↩︎

  3. This gets back to my idea that this really should be two commands. archivebox archive [url] which only ever archives the URLs (and only URLs) specified (and also takes a list of URLs, and only a list, on stdin) and knows nothing about "parsers," and archivebox import [url_or_filename_or_stdin] which is like what I'm suggesting archivebox add --parser=auto could do.4 ↩︎

  4. Another idea to reduce ambiguity - don't implicitly read stdin when we have no arguments, require - to be supplied as a filename. ↩︎

  5. Very much a work in progress. In particular, the tests really should be making better use of fixtures to simplify the individual test cases and dump more useful information when they fail. ↩︎

  6. And one (test_add_file_json) only succeeds because of the combined hacks of save_text_as_source and the JSON parser handling a first line of junk. If that test case listed two JSON files on the command line, it would fail. ↩︎

  7. In my thinking about this, and in tests, I'm using archivebox add --index-only because I think of archivebox add as two steps: collect the URLs to add to the archive, and then add them to a queue of URLs that need the actual snapshot to be taken or updated. That means I don't have to think about --overwrite, --update, --update-all, and --extract because those all impact the second step, not the first. ↩︎

  8. If we want to also handle filenames via stdin we could add a parser that is only used when parsing stdin that is like url_list for files. (This is kind of what happens now inside save_text_as_source but I would say it only really works accidentally in narrow cases.) Call it file_list or url_or_file_list if you want to handle both URLs and filenames on stdin at the same time. But the more I think about it, the more I think handling filenames on stdin is not worth the headaches.2 ↩︎

  9. We could distinguish between the unspecified parser and --parser=auto so that archivebox add --parser=auto http://example.com/resource added the list of URLs parsed from the http://example.com/resource but not that URL.3 ↩︎

<!-- gh-comment-id:1977430568 --> @jimwins commented on GitHub (Mar 4, 2024): (Thanks for the sponsorship, it was very unexpected!) This is a long comment, and gets longer every time I try to make it shorter. The 'tl;dr' is that the way archivebox tries to do things now leaves ambiguity and I fear building a tower of workarounds that is going to implode. You can see that in this: ```sh $ echo '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss $ cat '~/Downloads/some/rss/feed.xml' | archivebox add --depth=0 --parser=rss ``` Either stdin is parsed as an RSS file, or the filenames read from stdin is parsed as RSS, I'm not sure you can really make a reliable determination without doing some hacky and dangerous things.[^save_text_as_source] [^save_text_as_source]: The `save_text_as_source` function is part of the culprit here. As it works now, it gets stdin as a string, splits it into lines, reads the contents of any line that has a path that exists, and then writes back out a file that is stdin with every file it read appended. This explains that hack in the JSON parser. But this is just madness and means multiple files on the command line or read via stdin just can't work, or a mix of files or URLs, or whatever. I haven't yet added test cases that show these failures. I [pushed test cases](https://github.com/jimwins/ArchiveBox/tree/issue-1363)[^wip] that implement all of the scenarios you outlined in [this comment](https://github.com/ArchiveBox/ArchiveBox/issues/1363#issuecomment-1966177173), and with the current dev branch, only 4 of the 19 pass. After fixing `parser == 'auto'` to favor the first successful parser, that only improves to 9 of 19. Making the change to interpret `--parser` as I outline below gets us to 15 of 19. The four that still fail are due to the ambiguous behavior above.[^lucky] (I haven't committed that change because I'm not happy with it yet.) [^wip]: Very much a work in progress. In particular, the tests really should be making better use of fixtures to simplify the individual test cases and dump more useful information when they fail. [^lucky]: And one (`test_add_file_json`) only succeeds because of the combined hacks of `save_text_as_source` and the JSON parser handling a first line of junk. If that test case listed two JSON files on the command line, it would fail. Here's how I am thinking about it: We start with an empty list of URLs to archive.[^steps] [^steps]: In my thinking about this, and in tests, I'm using `archivebox add --index-only` because I think of `archivebox add` as two steps: collect the URLs to add to the archive, and then add them to a queue of URLs that need the actual snapshot to be taken or updated. That means I don't have to think about `--overwrite`, `--update`, `--update-all`, and `--extract` because those all impact the second step, not the first. We get a list of zero or more filenames or URLs on the command line. That's `args = Optional[list[str]]`. If `not args`, we read stdin and run it through the parser. It could be RSS, JSON, or just a blob of text with URLs that the `generic_txt` parser will figure out when `parser = "auto"`, and so we add those to our list of URLs to archive.[^file_list] [^file_list]: If we want to also handle filenames via stdin we could add a parser that is only used when parsing stdin that is like `url_list` for files. (This is kind of what happens now inside `save_text_as_source` but I would say it only really works accidentally in narrow cases.) Call it `file_list` or `url_or_file_list` if you want to handle both URLs and filenames on stdin at the same time. But the more I think about it, the more I think handling filenames on stdin is not worth the headaches.[^filenames] [^filenames]: Filenames can have spaces. And imagine `cat compromised-file | archivebox add` where someone has figured out how to inject `/etc/passwd` or some other should-be-secure-path into some automated pipeline. It seems like gratuitously dangerous behavior to leave as a default. Now `for arg in args`, we have to figure out what each argument is. If `arg` is a file, we don't have a URL so it can't be added to our list of URLs, so we will parse the contents and add the URLs contained within to our list of URLs to archive. If `arg` looks like a URL, now we have to figure out the user's intent. It could either be a resource that the user wanted to archive, or something like an RSS file where they want us to parse it and archive all the URLs mentioned. ```sh $ archivebox add https://example.com/feed.rss https://example.com/content.html ``` The intent is "obvious" for an RSS feed, but ambiguous for an HTML file, because maybe it's an export file of bookmarks. I don't think we want to always download and parse the resource, because with `--parser=auto` we'll probably always find something in any URL we're given, but for most URLs the user probably really intends to archive that specific resource. (This also would conflict with `--index-only` which I think shows an intent to not download URLs absent other signals.) That means that `archivebox add https://example.com/index.rss` will archive `https://example.com/index.rss` and **not** parse the RSS file (and archive the URLs it finds). However, I think when the user has specified a parser (`parser != "auto"`), we can use that as a signal that we really want to run every URL in `args` through the parser. So when `--parser=whatever` is specified, we don't archive each URL, we download it and run it through the parser.[^default] [^default]: We could distinguish between the unspecified parser and `--parser=auto` so that `archivebox add --parser=auto http://example.com/resource` added the list of URLs parsed from the `http://example.com/resource` but not that URL.[^import] [^import]: This gets back to my idea that this really should be two commands. `archivebox archive [url]` which only ever archives the URLs (and only URLs) specified (and also takes a list of URLs, and only a list, on stdin) and knows nothing about "parsers," and `archivebox import [url_or_filename_or_stdin]` which is like what I'm suggesting `archivebox add --parser=auto` could do.[^stdin] [^stdin]: Another idea to reduce ambiguity - don't implicitly read stdin when we have no arguments, require `-` to be supplied as a filename. Rough code outline without error checking or lots of other option handling: ```python urls_to_archive = [] if not len(args): urls_to_archive += parse_file(stdin, parser) for arg in args: if parser == "auto" and re.findall(URL_REGEX, arg): urls_to_archive += [ arg ] else: urls_to_archive += parse_url_or_filename(arg, parser) for url in urls_to_archive: queue_for_archive(url) ``` Note that ``--depth`` isn't part of this yet. To add that, thinking about it recursively makes more sense to me. Let's wrap all of the above in a function called `add(stdin: Optional[IO] = None, args: List[str], depth: int=0, parser: str="auto")`. Now we can turn our last loop into: ```python for url in urls_to_archive: if already_queued(url): continue # skip URLs we've already queued queue_for_archive(url) if depth > 0: add(args= parse_url_or_filename(url), depth= depth - 1, parser="auto") ``` We can also use ``depth`` as an additional signal of intent from the user on what to do with arguments they originally gave on the command line, so we can expand that to `re.findall(URL_REGEX, arg) and (parser != "auto" or depth > 0)`. This can support any arbitrary depth, but we'll want to add some options to control two parts of that -- filtering URLs (so we can restrict the crawl to a specific site, for example) and deciding which leaves of the tree we're walking we actually archive. Back to the RSS example, when we do `archivebox --depth=1 https://example.org/test.rss` we don't want to archive the RSS page, but when we do `archivebox --depth=1 https://example.org/page1.html` we probably do, or pushing things out, when we do `archivebox add --depth=2 https://example.org/subscriptions.opml` (where that's an OPML file of RSS subscriptions), we probably are okay with just archiving the `depth = 0` links, but if we were to do `archivebox add --depth=2 https://example.org/` we probably want archive every page, so maybe we need an option something like `--archive-depth=N`. I haven't really captured the problematic interplay of `--depth` and `--parser` here, either and how it works with stdin. Because under this sketch, if you run `echo -e "https://example.com/1.json\nhttps://example.com/2.json" | archivebox add --parser=json"` if will fall over because stdin is not in JSON format. I think that `parser` wants to be a stack. With that, I could do something like `archivebox add --depth=2 --parser=opml,rss https://example.com/feeds.opml` to archive every link in each RSS file listed in the OPML file. Or `echo -e "https://example.com/1.json\nhttps://example.com/2.json" | archivebox add --parser=url_list,json"`.
Author
Owner

@pirate commented on GitHub (Apr 11, 2024):

After thinking about this a bunch I agree with all of it, especially these:

We could distinguish between the unspecified parser and --parser=auto so that archivebox add --parser=auto http://example.com/resource added the list of URLs parsed from the http://example.com/resource but not that URL.8

This gets back to my idea that this really should be two commands. archivebox archive [url] which only ever archives the URLs (and only URLs) specified (and also takes a list of URLs, and only a list, on stdin) and knows nothing about "parsers," and archivebox import [url_or_filename_or_stdin] which is like what I'm suggesting archivebox add --parser=auto could do.9

Firstly I think I'm going to banish all filesystem paths being used as input unless they start with file://data/sources/ or file://data/archive/. This allows us to keep piping filesystem paths as an internal mechanism without the security risk of importing from anywhere in the filesystem. If people want to import other paths they can always pipe them in manually using < stdin redirection.

I also think we should mandate that args can only ever be lists of URIs or snapshot_ids, you can never pass text to be parsed as an arg. These two assertions can be enforced at runtime, and should greatly simplify all our logic.

I've written a new extractor called out_links that parses the hrefs/urls in a bytes blob to out_links.json during archiving, allowing a snapshot of a URL to be the starting point to get its contents instead of using a helper func to download remote files. We can move all the parsers to be extractors based on this base extractor (e.g. rss_outlinks, md_outlinks, jsonl_outlinks).


I propose creating three new commands, each with a smaller scope, designed to be chained with unix pipes:

  • archivebox snapshot [url1] [url2] [url3]
    echo -e 'https://example.com/a\nhttps://example.com/b\n' | archivebox snapshot
    Takes only url_list as args or JSONL†/url_list via stdin, creates Snapshots (and Tags if needed as side-effect) for each, and echos out one created snapshot id(s) per line.

  • archivebox extract --extractors=[all]|screenshot|pdf|...|rss_outlinks|html_outlinks|... --csv=url,timestamp,title,tags [snapshot_id2] [snapshot_id2] [snapshot_id3]
    Takes a single snapshot_id or list of snapshot ids as args or stdin, runs the specified extractors on them, and saves the output files into the snapshot directory for each (creates an ArchiveResult for each output file). Echos out each snapshot id after running extractors on it (so it can be easily chained with other commands).

  • archivebox crawl [--parser=auto|html|rss|jsonl|...] [url1|snapshot_id1] [url2|snapshot_id2]
    Takes a single starting point url (or snapshot_id) as an arg, returns outlink URLs. Runs:
    # if url is passed instead of snapshot_id, create a snapshot for it:
    hop_0_snapshot_id=$(archivebox snapshot [url] --csv=id)
    # if parsed outlinks are not present in snapshot dir, extract them using chosen parser:
    archivebox extract --extract=auto_outlinks $hop_0_snapshot_id
    # return parsed outlinks as jsonl or url_list
    cat ./archive/$hop_0_snapshot_id/auto_outlinks.jsonl
    # does not handle recursing, only echos out URLs one hop away from starting point

This lets us keep archivebox add but make it just a shortcut for running combos of the above commands.

For example:

archivebox add 'http://example.com' can become:

archivebox snapshot 'https://example.com' > hop_0_snapshot_id.txt
archivebox extract --extractors=all < hop_0_snapshot_id.txt

# aka: archivebox snapshot 'https://example.com' | archivebox extract

archivebox add --depth=1 'http://example.com' can become:

archivebox crawl --parser=auto 'https://example.com' > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_id.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt

# aka: archivebox crawl 'https://example.com' | archivebox snapshot | archivebox extract

archivebox add --parser=rss 'https://example.com/some/feed.xml' can become:

archivebox crawl --parser=rss 'https://example.com/some/feed.xml' > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_ids.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt

# aka: archivebox crawl --parser=rss 'https://example.com/some/feed.xml' | archivebox snapshot | archivebox extract

archivebox add --parser=rss --depth=1 < some_feed.xml can become:

# save stdin to file://data/sources/add_stdin_rss.txt
archivebox crawl --parser=rss file://data/sources/add_stdin_rss.txt > hop_0_outlinks.txt
archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_ids.txt
archivebox extract --extractors=all < hop_0_snapshot_ids.txt

for snapshot_id in $(cat hop_0_snapshot_ids.txt); 
    archivebox crawl file://data/archive/$snapshot_id/outlinks.json > hop_1_outlinks.txt
    archivebox snapshot < hop_1_outlinks.txt > hop_1_snapshot_ids.txt
    archivebox extract --extracotrs=all < hop_1_snapshot_ids.txt
done

# aka:
archivebox crawl --parser=rss 'file://data/sources/add_stdin_rss.txt'  # outputs urls found in file
  | archivebox snapshot   # creates a snapshot for each first hop url
  | archivebox extract    # saves all the content for each first hop, outputs snapshot ids
  | archivebox crawl      # gets outlinks urls from first hop snapshots
  | archivebox snapshot   # creates snapshots for each outlink
  | archivebox extract    # saves the rest of the content for each outlink

Does that sound reasonable?


†: I'm imagining all these commands can yield/ingest bare snapshot IDs or URLs, but it would be nice if they also support a normalized JSONL format for archivebox objects like this so we can pass metadata (like tags, bookmarked_ts, etc.) with each line:

{"url": "https://example.com/some/article.html", "tags": "sometagxyz,sometagabc"}
{"type": "Snapshot", "url": "https://example.com/some/article.html", "tags": "sometagxyz,sometagabc", "id": "124235353425"}
{"type": "Tag", "name": "sometagabc"}

A format like this is extensible as we pass more metadata/object-types in the future, it's streamable (can process each line independently), and it's easy to debug with jq.

<!-- gh-comment-id:2050736969 --> @pirate commented on GitHub (Apr 11, 2024): After thinking about this a bunch I agree with all of it, especially these: > We could distinguish between the unspecified parser and --parser=auto so that archivebox add --parser=auto http://example.com/resource added the list of URLs parsed from the http://example.com/resource but not that URL.[8](https://github.com/ArchiveBox/ArchiveBox/issues/1363#user-content-fn-import-ee9fdfa532d10394b7bb05f4d3577be5) [↩](https://github.com/ArchiveBox/ArchiveBox/issues/1363#user-content-fnref-default-ee9fdfa532d10394b7bb05f4d3577be5) > This gets back to my idea that this really should be two commands. archivebox archive [url] which only ever archives the URLs (and only URLs) specified (and also takes a list of URLs, and only a list, on stdin) and knows nothing about "parsers," and archivebox import [url_or_filename_or_stdin] which is like what I'm suggesting archivebox add --parser=auto could do.[9](https://github.com/ArchiveBox/ArchiveBox/issues/1363#user-content-fn-stdin-ee9fdfa532d10394b7bb05f4d3577be5) [↩](https://github.com/ArchiveBox/ArchiveBox/issues/1363#user-content-fnref-import-ee9fdfa532d10394b7bb05f4d3577be5) Firstly I think I'm going to banish all filesystem paths being used as input unless they start with `file://data/sources/` or `file://data/archive/`. This allows us to keep piping filesystem paths as an internal mechanism without the security risk of importing from anywhere in the filesystem. If people want to import other paths they can always pipe them in manually using `<` stdin redirection. I also think we should mandate that args can only ever be lists of URIs or snapshot_ids, you can never pass text to be parsed as an arg. These two assertions can be enforced at runtime, and should greatly simplify all our logic. I've written a new extractor called `out_links` that parses the hrefs/urls in a bytes blob to out_links.json during archiving, allowing a snapshot of a URL to be the starting point to get its contents **instead of using a helper func to download remote files**. We can move all the parsers to be extractors based on this base extractor (e.g. rss_outlinks, md_outlinks, jsonl_outlinks). --- I propose creating three new commands, each with a smaller scope, designed to be chained with unix pipes: - `archivebox snapshot [url1] [url2] [url3]` `echo -e 'https://example.com/a\nhttps://example.com/b\n' | archivebox snapshot` Takes only url_list as args or JSONL†/url_list via stdin, creates Snapshots (and Tags if needed as side-effect) for each, and echos out one created snapshot id(s) per line. - `archivebox extract --extractors=[all]|screenshot|pdf|...|rss_outlinks|html_outlinks|... --csv=url,timestamp,title,tags [snapshot_id2] [snapshot_id2] [snapshot_id3]` Takes a single snapshot_id or list of snapshot ids as args or stdin, runs the specified extractors on them, and saves the output files into the snapshot directory for each (creates an ArchiveResult for each output file). Echos out each snapshot id after running extractors on it (so it can be easily chained with other commands). - `archivebox crawl [--parser=auto|html|rss|jsonl|...] [url1|snapshot_id1] [url2|snapshot_id2]` Takes a single starting point url (or snapshot_id) as an arg, returns outlink URLs. Runs: `# if url is passed instead of snapshot_id, create a snapshot for it:` ` hop_0_snapshot_id=$(archivebox snapshot [url] --csv=id)` `# if parsed outlinks are not present in snapshot dir, extract them using chosen parser:` ` archivebox extract --extract=auto_outlinks $hop_0_snapshot_id` `# return parsed outlinks as jsonl or url_list` `cat ./archive/$hop_0_snapshot_id/auto_outlinks.jsonl` `# does not handle recursing, only echos out URLs one hop away from starting point` This lets us keep `archivebox add` but make it just a shortcut for running combos of the above commands. For example: `archivebox add 'http://example.com'` can become: ```bash archivebox snapshot 'https://example.com' > hop_0_snapshot_id.txt archivebox extract --extractors=all < hop_0_snapshot_id.txt # aka: archivebox snapshot 'https://example.com' | archivebox extract ``` `archivebox add --depth=1 'http://example.com'` can become: ```bash archivebox crawl --parser=auto 'https://example.com' > hop_0_outlinks.txt archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_id.txt archivebox extract --extractors=all < hop_0_snapshot_ids.txt # aka: archivebox crawl 'https://example.com' | archivebox snapshot | archivebox extract ``` `archivebox add --parser=rss 'https://example.com/some/feed.xml'` can become: ```bash archivebox crawl --parser=rss 'https://example.com/some/feed.xml' > hop_0_outlinks.txt archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_ids.txt archivebox extract --extractors=all < hop_0_snapshot_ids.txt # aka: archivebox crawl --parser=rss 'https://example.com/some/feed.xml' | archivebox snapshot | archivebox extract ``` `archivebox add --parser=rss --depth=1 < some_feed.xml` can become: ```bash # save stdin to file://data/sources/add_stdin_rss.txt archivebox crawl --parser=rss file://data/sources/add_stdin_rss.txt > hop_0_outlinks.txt archivebox snapshot < hop_0_outlinks.txt > hop_0_snapshot_ids.txt archivebox extract --extractors=all < hop_0_snapshot_ids.txt for snapshot_id in $(cat hop_0_snapshot_ids.txt); archivebox crawl file://data/archive/$snapshot_id/outlinks.json > hop_1_outlinks.txt archivebox snapshot < hop_1_outlinks.txt > hop_1_snapshot_ids.txt archivebox extract --extracotrs=all < hop_1_snapshot_ids.txt done # aka: archivebox crawl --parser=rss 'file://data/sources/add_stdin_rss.txt' # outputs urls found in file | archivebox snapshot # creates a snapshot for each first hop url | archivebox extract # saves all the content for each first hop, outputs snapshot ids | archivebox crawl # gets outlinks urls from first hop snapshots | archivebox snapshot # creates snapshots for each outlink | archivebox extract # saves the rest of the content for each outlink ``` Does that sound reasonable? --- †: I'm imagining all these commands can yield/ingest bare snapshot IDs or URLs, but it would be nice if they also support a normalized JSONL format for archivebox objects like this so we can pass metadata (like tags, bookmarked_ts, etc.) with each line: ```jsonl {"url": "https://example.com/some/article.html", "tags": "sometagxyz,sometagabc"} {"type": "Snapshot", "url": "https://example.com/some/article.html", "tags": "sometagxyz,sometagabc", "id": "124235353425"} {"type": "Tag", "name": "sometagabc"} ``` A format like this is extensible as we pass more metadata/object-types in the future, it's streamable (can process each line independently), and it's easy to debug with `jq`.
Author
Owner

@pirate commented on GitHub (Oct 6, 2024):

Work on fixing this has finally started in https://github.com/ArchiveBox/ArchiveBox/releases/tag/v0.8.5-rc.

I'm moving forward with the new design as posted above. Snapshots are now uniquely identified by an ABID (essentially a fancy UUID), there is a new Crawl model to track crawls, and there are new pipeable CLI comands started for crawl, snapshot, and extract.

<!-- gh-comment-id:2395235579 --> @pirate commented on GitHub (Oct 6, 2024): Work on fixing this has finally started in https://github.com/ArchiveBox/ArchiveBox/releases/tag/v0.8.5-rc. I'm moving forward with the new design as posted above. Snapshots are now uniquely identified by an ABID (essentially a fancy UUID), there is a new Crawl model to track crawls, and there are new pipeable CLI comands started for `crawl`, `snapshot`, and `extract`.
Author
Owner

@AramZS commented on GitHub (Nov 30, 2024):

I can confirm that the rss parser doesn't seem to parse RSS feeds. I have an RSS feed I've built specifically of other URLs intended to be processed by ArchiveBox here. The design is a little different than a normal RSS feed in that the link elements are all different TLD+1s from the base domain of the feed. I'm just digging into this project, but glad to help if I can talk to expected formats for RSS or anything like that. Def glad to help test stuff too.

<!-- gh-comment-id:2509398962 --> @AramZS commented on GitHub (Nov 30, 2024): I can confirm that the rss parser doesn't seem to parse RSS feeds. I have an RSS feed I've built specifically of other URLs intended to be processed by ArchiveBox [here](https://aramzs.xyz/amplify/rss/index.xml). The design is a little different than a normal RSS feed in that the `link` elements are all different TLD+1s from the base domain of the feed. I'm just digging into this project, but glad to help if I can talk to expected formats for RSS or anything like that. Def glad to help test stuff too.
Author
Owner

@pirate commented on GitHub (Dec 1, 2024):

There's a lot of parser improvement work going on currently in 0.8.6rc0 (dev) but it's still WIP! Will keep posting updates here.

<!-- gh-comment-id:2509511304 --> @pirate commented on GitHub (Dec 1, 2024): There's a lot of parser improvement work going on currently in 0.8.6rc0 (dev) but it's still WIP! Will keep posting updates here.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2345
No description provided.