[GH-ISSUE #39] Link Parsing: Add support for plain-text list of URLs #1536

Closed
opened 2026-03-01 17:51:32 +03:00 by kerem · 9 comments
Owner

Originally created by @nodiscc on GitHub (Aug 3, 2017).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/39

Currently the input file is required to be in the Netscape HTML export format. I have lists of URLs I'd like to archive, in plain-text format. Converting them to HTML should be doable, but I wish bookmark-archiver would accept plain text files, such as

./archive.py urls.txt

It would be reasonable to filter the file, retaining actual http(s)?://... URLs

Originally created by @nodiscc on GitHub (Aug 3, 2017). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/39 Currently the input file is required to be in the Netscape HTML export format. I have lists of URLs I'd like to archive, in plain-text format. Converting them to HTML should be doable, but I wish bookmark-archiver would accept plain text files, such as `./archive.py urls.txt` It would be reasonable to filter the file, retaining actual `http(s)?://...` URLs
Author
Owner

@pirate commented on GitHub (Aug 3, 2017):

Yeah I thought about this, but BA doesn't just take URLS it needs titles and timestamps too to work properly. Not sure how to do this easily, since it would require adding support for parsing Titles out of the downloaded HTML, which is not something I'm keen on doing (it would require re-architecting a fair bit of the code too).

<!-- gh-comment-id:320022537 --> @pirate commented on GitHub (Aug 3, 2017): Yeah I thought about this, but BA doesn't just take URLS it needs titles and timestamps too to work properly. Not sure how to do this easily, since it would require adding support for parsing Titles out of the downloaded HTML, which is not something I'm keen on doing (it would require re-architecting a fair bit of the code too).
Author
Owner

@nodiscc commented on GitHub (Aug 3, 2017):

I could write a basic txt -> Netscape HTML converter like

#!/bin/bash
set -o errexit
set -o nounset
##############
textfile="links.txt"
output_html_file="output-netscape.html"
##############
date=$(date +%s)
# find all URLs in file, 
# remove trailing "**" if necessary (markdown bold formatting)
urllist=$(egrep --only-matching 'http(s)?\://[^ "\*\*"]*' "$textfile")
for pageurl in $urllist; do
    date=$(( $date + 1 ))
    pagetitle=$(wget "$pageurl" -q -O - | awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}') #extract webpage title
    if [ "$pagetitle" = "" ]; then pagetitle="$pageurl"; fi
    htmlentry="<DT><A HREF=\"$pageurl\" ADD_DATE=\"$date\">$pagetitle</A>"
    echo "[info] Processing $pageurl"
    echo "$htmlentry" | tee -a "$output_html_file"
done
echo "[info] Completed. output saved to $output_html_file"

(just a quick bash proof of concept but I could try to rewrite it in Python)

and let BA work from there . This would add a first pass to find all links in the text file, gather their page titles, and write a BA compatible HTML file. The downside is that it would double the number of page downloads, maybe they can be cached? At least it would solve my problem without requiring refactoring.

Another option is to generate an HTML file without page titles :/

<DT><A HREF=\"$pageurl\" ADD_DATE=\"$date\">$pageurl</A>

What do you think?

Edit: updated script

<!-- gh-comment-id:320066412 --> @nodiscc commented on GitHub (Aug 3, 2017): I could write a basic txt -> Netscape HTML converter like ```bash #!/bin/bash set -o errexit set -o nounset ############## textfile="links.txt" output_html_file="output-netscape.html" ############## date=$(date +%s) # find all URLs in file, # remove trailing "**" if necessary (markdown bold formatting) urllist=$(egrep --only-matching 'http(s)?\://[^ "\*\*"]*' "$textfile") for pageurl in $urllist; do date=$(( $date + 1 )) pagetitle=$(wget "$pageurl" -q -O - | awk -vRS="</title>" '/<title>/{gsub(/.*<title>|\n+/,"");print;exit}') #extract webpage title if [ "$pagetitle" = "" ]; then pagetitle="$pageurl"; fi htmlentry="<DT><A HREF=\"$pageurl\" ADD_DATE=\"$date\">$pagetitle</A>" echo "[info] Processing $pageurl" echo "$htmlentry" | tee -a "$output_html_file" done echo "[info] Completed. output saved to $output_html_file" ``` (just a quick bash proof of concept but I could try to rewrite it in Python) and let BA work from there . This would add a first pass to find all links in the text file, gather their page titles, and write a BA compatible HTML file. The downside is that it would double the number of page downloads, maybe they can be cached? At least it would solve my problem without requiring refactoring. Another option is to generate an HTML file without page titles :/ <DT><A HREF=\"$pageurl\" ADD_DATE=\"$date\">$pageurl</A> What do you think? Edit: updated script
Author
Owner

@pirate commented on GitHub (Aug 5, 2017):

Actually adding timestamps isn't needed, the script handles deduping them already. If you just set all the timestamps to 1, it will parse them as 1.1, 1.2, 1.3, 1.4, ...

The tricky issue is titles, but I predict your parser will work for most sites.

<!-- gh-comment-id:320386592 --> @pirate commented on GitHub (Aug 5, 2017): Actually adding timestamps isn't needed, the script handles deduping them already. If you just set all the timestamps to 1, it will parse them as 1.1, 1.2, 1.3, 1.4, ... The tricky issue is titles, but I predict your parser will work for most sites.
Author
Owner

@sbrl commented on GitHub (Nov 26, 2018):

Great script, @nodiscc! I've taken the liberty of improving it slightly. Here's a gist of the result of 1/2 hour of tinkering: https://gist.github.com/sbrl/95d1c23e18def900aeca35c2f2e57f24

<!-- gh-comment-id:441770686 --> @sbrl commented on GitHub (Nov 26, 2018): Great script, @nodiscc! I've taken the liberty of improving it slightly. Here's a gist of the result of 1/2 hour of tinkering: https://gist.github.com/sbrl/95d1c23e18def900aeca35c2f2e57f24
Author
Owner

@pirate commented on GitHub (Jan 11, 2019):

This should be working on master now as of github.com/pirate/ArchiveBox@cf9d1875c7.

Give it a try and let me know if breaks for any reason.

<!-- gh-comment-id:453442477 --> @pirate commented on GitHub (Jan 11, 2019): This should be working on master now as of https://github.com/pirate/ArchiveBox/commit/cf9d1875c74cff7639c9e6d00036518956a69165. Give it a try and let me know if breaks for any reason.
Author
Owner

@sbrl commented on GitHub (Jan 12, 2019):

Looks like it's working great for me! I've written a pair of support scripts for my own use:

archive-custom

#!/usr/bin/env bash

export FETCH_MEDIA=true;
export FETCH_SCREENSHOT=false;
export FETCH_PDF=false;
export FETCH_DOM=false;
export RESOLUTION="1920x1080"; # Only used for screenshots
export WGET_USER_AGENT="ArchiveBox/$(git rev-parse HEAD | head -c7) (+https://github.com/pirate/ArchiveBox/) wget/$(wget --version | head -n1 | cut -d" " -f3)";
export OUTPUT_PERMISSIONS=644;

./archive $@

archive-url

#!/usr/bin/env bash
temp_file=$(mktemp archivebox-url-XXXXXXXXXX.txt)
echo $1 >"${temp_file}"
export ONLY_NEW=true;
./archive-custom "${temp_file}"
rm "${temp_file}";
<!-- gh-comment-id:453754119 --> @sbrl commented on GitHub (Jan 12, 2019): Looks like it's working great for me! I've written a pair of support scripts for my own use: ## `archive-custom` ``` #!/usr/bin/env bash export FETCH_MEDIA=true; export FETCH_SCREENSHOT=false; export FETCH_PDF=false; export FETCH_DOM=false; export RESOLUTION="1920x1080"; # Only used for screenshots export WGET_USER_AGENT="ArchiveBox/$(git rev-parse HEAD | head -c7) (+https://github.com/pirate/ArchiveBox/) wget/$(wget --version | head -n1 | cut -d" " -f3)"; export OUTPUT_PERMISSIONS=644; ./archive $@ ``` ## `archive-url` ``` #!/usr/bin/env bash temp_file=$(mktemp archivebox-url-XXXXXXXXXX.txt) echo $1 >"${temp_file}" export ONLY_NEW=true; ./archive-custom "${temp_file}" rm "${temp_file}"; ```
Author
Owner

@pirate commented on GitHub (Jan 14, 2019):

Actually I like your user agent setup with the version included, I'm going to make that the default for everyone. I'll let you know once it's pushed so you can take it out of your script.

I'll also look I to making it possible to archive individual URLs without having to make a file, maybe by allowing passing links via stdin.

<!-- gh-comment-id:454193721 --> @pirate commented on GitHub (Jan 14, 2019): Actually I like your user agent setup with the version included, I'm going to make that the default for everyone. I'll let you know once it's pushed so you can take it out of your script. I'll also look I to making it possible to archive individual URLs without having to make a file, maybe by allowing passing links via stdin.
Author
Owner

@pirate commented on GitHub (Jan 14, 2019):

Ok passing links in via stdin is now supported, and the new WGET_USER_AGENT is added as well. Thanks for your help! 300b5c6

echo 'https://example.com' | ./archive
WGET_USER_AGENT =        os.getenv('WGET_USER_AGENT',        'ArchiveBox/{GIT_SHA} (+https://github.com/pirate/ArchiveBox/) wget/{WGET_VERSION}')
<!-- gh-comment-id:454201125 --> @pirate commented on GitHub (Jan 14, 2019): Ok passing links in via stdin is now supported, and the new WGET_USER_AGENT is added as well. Thanks for your help! 300b5c6 ```bash echo 'https://example.com' | ./archive ``` ```python WGET_USER_AGENT = os.getenv('WGET_USER_AGENT', 'ArchiveBox/{GIT_SHA} (+https://github.com/pirate/ArchiveBox/) wget/{WGET_VERSION}') ```
Author
Owner

@sbrl commented on GitHub (Jan 14, 2019):

Thanks so much! And thanks for the extra options & features :D

<!-- gh-comment-id:454206472 --> @sbrl commented on GitHub (Jan 14, 2019): Thanks so much! And thanks for the extra options & features :D
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1536
No description provided.