[GH-ISSUE #233] PSA: archiving large wallabag collections #1670

Open
opened 2026-03-01 17:52:43 +03:00 by kerem · 0 comments
Owner

Originally created by @anarcat on GitHub (May 6, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/233

Wiki Page URL

I'm not sure where to put this.. I found a mention of Wallabag in https://github.com/pirate/ArchiveBox/wiki/Usage#CLI-Usage but also https://github.com/pirate/ArchiveBox#can-import-links-from-many-formats

Suggested Edit

I'm not sure how to phrase this either, but in the former, it says I can export my list of URLs from wallabag using the "Export" button. It doesn't say which format should be used but, in my case, i have over 10 000 links archived in Wallabag, which makes it impractical, to say the least, to export. Any of the buttons just fails with a blank page.

So I made this horrid script to pull all the links into JSON files. It's atrocious, but it works.

page=1; while http --check-status GET 'https://lib3.net/wallabag/api/entries.json?perPage=100&page='$page 'Authorization:Bearer [REDACTED]' > entries-p$page.json; do
    sleep 1
    page=$(($page +1))
   echo "fetching page $page"
done

How to get the Bearer token is explained in "How to create my first app" in the very intuitively named "API clients management" section, e.g. http://wallabag.example.com/developer/howto/first-app

This loop will never complete, because the http client is too dumb to return proper exit codes when it hits a 404, and I was too lazy to fix that in script. Just let it run for a while until you start seeing files like this appear --check-status FTW!

Then the real fun begins! While we say archivebox can parse stuff from wallabag, it can't actually parse that JSON!

$ archivebox add pages/entries-p2.json 
    > ./sources/entries-p2.json-1557178820.txt

[*] [2019-05-06 21:40:20] Parsing new links from output/sources/entries-p2.json-1557178820.txt...
    > Parsed 0 links as Failed to parse (0 new links added)                                                                                                            

[*] [2019-05-06 21:40:20] Writing 101 links to main index...
    √ /srv/backup/archive/archivebox/index.sqlite3                                                                                                                     
    √ /srv/backup/archive/archivebox/index.json                                                                                                                        
    √ /srv/backup/archive/archivebox/index.html                                                                                                                        

[▶] [2019-05-06 21:40:21] Updating content for 0 matching pages in archive...

[√] [2019-05-06 21:40:21] Update of 0 pages complete (0.00 sec)
    - 0 links skipped
    - 0 links updated
    - 0 links had errors

    To view your archive, open:
        /srv/backup/archive/archivebox/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2019-05-06 21:40:21] Writing 101 links to main index...
    √ /srv/backup/archive/archivebox/index.sqlite3                                                                                                                     
    √ /srv/backup/archive/archivebox/index.json                                                                                                                        
    √ /srv/backup/archive/archivebox/index.html                         

undettered, and unwilling to redownload everything again, I made this stupid Python script to generate a list of URLs:

#!/usr/bin/python3

import json
import sys


def find_urls(fp):
    try:
        blob = json.load(fp)
    except json.decoder.JSONDecodeError:
        return
    if '_embedded' not in blob:
        return
    for item in blob['_embedded']['items']:
        yield item['url']


def open_files(paths):
    for path in paths:
        with open(path) as fp:
            for url in find_urls(fp):
                yield url


def main():
    for url in open_files(sys.argv[1:]):
        print(url)


if __name__ == '__main__':
    main()

... which you call like this:

../wtf-wallabag.py $(ls | sort -tp -k2 -n) > ../wallabag.list

(the sort pipeline is to sort the list of files by number, another oversight of my poor design)

Then you can throw archivebox at wallabag.list for a long time.

I know this is not very useful in itself - it would be best to have this in a wiki page. But I truly didn't know where to put it in your carefully crafted wiki (nor if i could!) so I figured it would still be useful to post this here.

Originally created by @anarcat on GitHub (May 6, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/233 ## Wiki Page URL I'm not sure where to put this.. I found a mention of Wallabag in https://github.com/pirate/ArchiveBox/wiki/Usage#CLI-Usage but also https://github.com/pirate/ArchiveBox#can-import-links-from-many-formats ## Suggested Edit I'm not sure how to phrase this either, but in the former, it says I can export my list of URLs from wallabag using the "Export" button. It doesn't say which format should be used but, in my case, i have over 10 000 links archived in Wallabag, which makes it impractical, to say the least, to export. Any of the buttons just fails with a blank page. So I made this horrid script to pull all the links into JSON files. It's atrocious, but it works. ``` page=1; while http --check-status GET 'https://lib3.net/wallabag/api/entries.json?perPage=100&page='$page 'Authorization:Bearer [REDACTED]' > entries-p$page.json; do sleep 1 page=$(($page +1)) echo "fetching page $page" done ``` How to get the Bearer token is explained in "How to create my first app" in the very intuitively named "API clients management" section, e.g. http://wallabag.example.com/developer/howto/first-app <del>This loop will never complete, because the `http` client is too dumb to return proper exit codes when it hits a 404, and I was too lazy to fix that in script. Just let it run for a while until you start seeing files like this appear</del> `--check-status` FTW! Then the *real* fun begins! While we say archivebox can parse stuff from wallabag, it can't actually parse that JSON! ``` $ archivebox add pages/entries-p2.json > ./sources/entries-p2.json-1557178820.txt [*] [2019-05-06 21:40:20] Parsing new links from output/sources/entries-p2.json-1557178820.txt... > Parsed 0 links as Failed to parse (0 new links added) [*] [2019-05-06 21:40:20] Writing 101 links to main index... √ /srv/backup/archive/archivebox/index.sqlite3 √ /srv/backup/archive/archivebox/index.json √ /srv/backup/archive/archivebox/index.html [▶] [2019-05-06 21:40:21] Updating content for 0 matching pages in archive... [√] [2019-05-06 21:40:21] Update of 0 pages complete (0.00 sec) - 0 links skipped - 0 links updated - 0 links had errors To view your archive, open: /srv/backup/archive/archivebox/index.html Or run the built-in webserver: archivebox server [*] [2019-05-06 21:40:21] Writing 101 links to main index... √ /srv/backup/archive/archivebox/index.sqlite3 √ /srv/backup/archive/archivebox/index.json √ /srv/backup/archive/archivebox/index.html ``` undettered, and unwilling to redownload everything again, I made this stupid Python script to generate a list of URLs: ``` #!/usr/bin/python3 import json import sys def find_urls(fp): try: blob = json.load(fp) except json.decoder.JSONDecodeError: return if '_embedded' not in blob: return for item in blob['_embedded']['items']: yield item['url'] def open_files(paths): for path in paths: with open(path) as fp: for url in find_urls(fp): yield url def main(): for url in open_files(sys.argv[1:]): print(url) if __name__ == '__main__': main() ``` ... which you call like this: ``` ../wtf-wallabag.py $(ls | sort -tp -k2 -n) > ../wallabag.list ``` (the `sort` pipeline is to sort the list of files by number, another oversight of my poor design) Then you can throw archivebox at `wallabag.list` for a long time. I know this is not very useful in itself - it would be best to have this in a wiki page. But I truly didn't know where to put it in your carefully crafted wiki (nor if i could!) so I figured it would still be useful to post this here.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#1670
No description provided.