mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[GH-ISSUE #968] Bug: Fails to parse list of URLs txt file #3621
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3621
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @rossvor on GitHub (Apr 20, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/968
Describe the bug
I can't seem to get archivebox to add any URLs from simple txt file with a newline separated list of URLs.
Based on error message it fails to parse it. I may be doing something wrong.
Steps to reproduce
archivebox add /tmp/urls.txtScreenshots or log output
Here's the output I get:
ArchiveBox version
@pirate commented on GitHub (Apr 20, 2022):
Are you sure your URLs have schemes at the front? They have to be fully qualified URLs (e.g.
google.com/exampledoens't work buthttps://google.com/exampledoes).Can you post a redacted snippet of your actual urls.txt file?
You can also try and force a specific parser with
archivebox add --parse=generic_txt /tmp/urls.txtorarchivebox add --parse=url_list /tmp/urls.txt. Also check the contents ofsources/1650470713-import.txtto see how ArchiveBox is interpreting the file.@rossvor commented on GitHub (Apr 20, 2022):
Yep, can confirm that file has fully qualified URLs.
Sure.
I've tried setting parser explicitly as you suggested, none of them picked up the URLs, with slightly varying errors.
archivebox add --parse=generic_txt /tmp/urls.txtResult:
archivebox add: error: argument --parser: invalid choice: 'generic_txt'archivebox add --parse=url_list /tmp/urls.txtResult:
[X] No links found using URL List parserContents of
sources/1650479354-import.txt(with all the above variations of parsers) is just file path itself, so I guess it tries to interpret path as URL instead of a path.Contents of
sources/1650479354-import.txt:/tmp/urls.txtI can confirm that using input redirection does work fine, so this works:
archivebox add < /tmp/urls.txt@pirate commented on GitHub (Apr 20, 2022):
Try with
--depth=1and passing the file path as the first argument.@rossvor commented on GitHub (Apr 20, 2022):
Doesn't seem to change the error
@rossvor commented on GitHub (Apr 20, 2022):
I've also tried this using on a fresh docker image based installation and it fails similarly:
/tmp/ff/urls.txt being the same simple file:
@pirate commented on GitHub (Apr 21, 2022):
Ah sorry I forgot I removed loading directly from a file path in a previous version because it conflicted with the new
--depth=1implementation!I'll reopen and merge your original PR https://github.com/ArchiveBox/ArchiveBox/pull/967. For future reference stdin redirection is indeed necessary, or passing
--depth=1 /path/to/file.txtalso works.