mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-26 01:26:00 +03:00
[GH-ISSUE #39] Link Parsing: Add support for plain-text list of URLs #3046
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3046
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @nodiscc on GitHub (Aug 3, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/39
Currently the input file is required to be in the Netscape HTML export format. I have lists of URLs I'd like to archive, in plain-text format. Converting them to HTML should be doable, but I wish bookmark-archiver would accept plain text files, such as
./archive.py urls.txtIt would be reasonable to filter the file, retaining actual
http(s)?://...URLs@pirate commented on GitHub (Aug 3, 2017):
Yeah I thought about this, but BA doesn't just take URLS it needs titles and timestamps too to work properly. Not sure how to do this easily, since it would require adding support for parsing Titles out of the downloaded HTML, which is not something I'm keen on doing (it would require re-architecting a fair bit of the code too).
@nodiscc commented on GitHub (Aug 3, 2017):
I could write a basic txt -> Netscape HTML converter like
(just a quick bash proof of concept but I could try to rewrite it in Python)
and let BA work from there . This would add a first pass to find all links in the text file, gather their page titles, and write a BA compatible HTML file. The downside is that it would double the number of page downloads, maybe they can be cached? At least it would solve my problem without requiring refactoring.
Another option is to generate an HTML file without page titles :/
What do you think?
Edit: updated script
@pirate commented on GitHub (Aug 5, 2017):
Actually adding timestamps isn't needed, the script handles deduping them already. If you just set all the timestamps to 1, it will parse them as 1.1, 1.2, 1.3, 1.4, ...
The tricky issue is titles, but I predict your parser will work for most sites.
@sbrl commented on GitHub (Nov 26, 2018):
Great script, @nodiscc! I've taken the liberty of improving it slightly. Here's a gist of the result of 1/2 hour of tinkering: https://gist.github.com/sbrl/95d1c23e18def900aeca35c2f2e57f24
@pirate commented on GitHub (Jan 11, 2019):
This should be working on master now as of
github.com/pirate/ArchiveBox@cf9d1875c7.Give it a try and let me know if breaks for any reason.
@sbrl commented on GitHub (Jan 12, 2019):
Looks like it's working great for me! I've written a pair of support scripts for my own use:
archive-customarchive-url@pirate commented on GitHub (Jan 14, 2019):
Actually I like your user agent setup with the version included, I'm going to make that the default for everyone. I'll let you know once it's pushed so you can take it out of your script.
I'll also look I to making it possible to archive individual URLs without having to make a file, maybe by allowing passing links via stdin.
@pirate commented on GitHub (Jan 14, 2019):
Ok passing links in via stdin is now supported, and the new WGET_USER_AGENT is added as well. Thanks for your help!
300b5c6@sbrl commented on GitHub (Jan 14, 2019):
Thanks so much! And thanks for the extra options & features :D