[GH-ISSUE #439] Feature Request: archive.today family integration #294

Closed
opened 2026-03-01 14:42:10 +03:00 by kerem · 14 comments
Owner

Originally created by @jaw-sh on GitHub (Aug 13, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/439

The archive.today sites (including archive.is, archive.md, archive.vn, archive.fi, etc) should have special integrations..

Type

  • General question or discussion
  • Propose a brand new feature
  • Request modification of existing behavior or design

What is the problem that your feature request solves

archive.today's webmaster uses its status for activism. Using browsers the webmaster does not like (Brave) will result in the site being unusable. I would like to locally archive all archive.today links.

Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes

There is a .ZIP download available for every archive which can be downloaded, unzipped, and converted into the archive format.

What hacks or alternative solutions have you tried to solve the problem?

Currently, my attempts at archivebox adding archive.today links results in the archive failing.

How badly do you want this new feature?

  • It's an urgent deal-breaker, I can't live without it
  • It's important to add it in the near-mid term future
  • It would be nice to have eventually

  • I'm willing to contribute dev time / money to fix this issue
  • I like ArchiveBox so far / would recommend it to a friend
  • I've had a lot of difficulty getting ArchiveBox set up
Originally created by @jaw-sh on GitHub (Aug 13, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/439 The archive.today sites (including archive.is, archive.md, archive.vn, archive.fi, etc) should have special integrations.. ## Type - [ ] General question or discussion - [ ] Propose a brand new feature - [X] Request modification of existing behavior or design ## What is the problem that your feature request solves archive.today's webmaster uses its status for activism. Using browsers the webmaster does not like (Brave) will result in the site being unusable. I would like to locally archive all archive.today links. ## Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes There is a .ZIP download available for every archive which can be downloaded, unzipped, and converted into the archive format. ## What hacks or alternative solutions have you tried to solve the problem? Currently, my attempts at `archivebox add`ing archive.today links results in the archive failing. ## How badly do you want this new feature? - [ ] It's an urgent deal-breaker, I can't live without it - [X] It's important to add it in the near-mid term future - [ ] It would be nice to have eventually --- - [X] I'm willing to contribute dev time / money to fix this issue - [X] I like ArchiveBox so far / would recommend it to a friend - [ ] I've had a lot of difficulty getting ArchiveBox set up
Author
Owner

@jaw-sh commented on GitHub (Aug 13, 2020):

When I try to archive an archive.today page, I get errors and the archive is a directory of junk instead of the actual page.


[+] [2020-08-13 15:18:24] Adding 1 links to index (crawl depth=0)...
> Saved verbatim input to sources/1597331904-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index 
[*] [2020-08-13 15:18:24] Writing 2 links to main index...
> /opt/archive/index.sqlite3 √ /opt/archive/index.sqlite3 > /opt/archive/index.json √ /opt/archive/index.json > /opt/archive/index.html √ /opt/archive/index.html 
[▶] [2020-08-13 15:18:25] Collecting content for 1 Snapshots in archive...
[
+
] [
2020-08-13 15:18:25
] "archive.is/nX7fq" 
http://archive.is/nX7fq
> ./archive/1597331904 > title 
Failed:

ConnectionError 
HTTPConnectionPool(host='archive.is', port=80): Max retries exceeded with url: /nX7fq (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 113] No route to host'))

Run to see full output:
cd /opt/archive/archive/1597331904; curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" http://archive.is/nX7fq > favicon > wget 
Failed:


Wget failed or got an error from the server

Got wget response code: 4.

failed: No route to host.

Run to see full output:
cd /opt/archive/archive/1597331904; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=warc/1597331908 --page-requisites "--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto http://archive.is/nX7fq > singlefile 
Failed:

Exception 
Failed to chmod: /opt/archive/archive/1597331904/singlefile.html does not exist (did the previous step fail?)

Run to see full output:
cd /opt/archive/archive/1597331904; /opt/SingleFile/cli/single-file --browser-executable-path=chromium "--browser-args="["--headless", "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36", "--window-size=1440,2000"]"" http://archive.is/nX7fq /opt/archive/archive/1597331904/singlefile.html > pdf 
Failed:


Failed to save PDF

[0813/151832.239630:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization

../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063

Received signal 11 SEGV_MAPERR 00000ffa003f

#0 0x563968f29529 (/usr/lib/chromium/chromium+0x51f9528)

#1 0x563968e87253 (/usr/lib/chromium/chromium+0x5157252)

Run to see full output:
cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf http://archive.is/nX7fq > screenshot 
Failed:


Failed to save screenshot

[0813/151847.542441:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization

../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063

Received signal 11 SEGV_MAPERR 00000ffa003f

#0 0x55ee106ed529 (/usr/lib/chromium/chromium+0x51f9528)

#1 0x55ee1064b253 (/usr/lib/chromium/chromium+0x5157252)

Run to see full output:
cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --screenshot http://archive.is/nX7fq > dom 
Failed:


Failed to save DOM

[0813/151902.868207:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization

../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063

Received signal 11 SEGV_MAPERR 00000ffa003f

#0 0x55d478e36529 (/usr/lib/chromium/chromium+0x51f9528)

#1 0x55d478d94253 (/usr/lib/chromium/chromium+0x5157252)

Run to see full output:
cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --dump-dom http://archive.is/nX7fq > media 
Failed:


Failed to save media

Got youtube-dl response code: 1.

WARNING: Could not send HEAD request to http://archive.is/nX7fq: 

ERROR: Unable to download webpage: (caused by URLError(OSError(113, 'No route to host')))

Run to see full output:
cd /opt/archive/archive/1597331904; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata --yes-playlist http://archive.is/nX7fq > archive_org 
[√] [2020-08-13 15:19:09] Update of 1 pages complete (44.27 sec)
- 0 links skipped - 0 links updated - 2 links had errors 
Hint:
To view your archive index, open: /opt/archive/index.html Or run the built-in webserver: archivebox server 
[*] [2020-08-13 15:19:09] Writing 2 links to main index...
> /opt/archive/index.sqlite3 √ /opt/archive/index.sqlite3 > /opt/archive/index.json √ /opt/archive/index.json > /opt/archive/index.html √ /opt/archive/index.html
<!-- gh-comment-id:673544339 --> @jaw-sh commented on GitHub (Aug 13, 2020): When I try to archive an archive.today page, I get errors and the archive is a directory of junk instead of the actual page. ```  [+] [2020-08-13 15:18:24] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1597331904-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index  [*] [2020-08-13 15:18:24] Writing 2 links to main index... > /opt/archive/index.sqlite3 √ /opt/archive/index.sqlite3 > /opt/archive/index.json √ /opt/archive/index.json > /opt/archive/index.html √ /opt/archive/index.html  [▶] [2020-08-13 15:18:25] Collecting content for 1 Snapshots in archive... [ + ] [ 2020-08-13 15:18:25 ] "archive.is/nX7fq"  http://archive.is/nX7fq > ./archive/1597331904 > title  Failed:  ConnectionError  HTTPConnectionPool(host='archive.is', port=80): Max retries exceeded with url: /nX7fq (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 113] No route to host'))  Run to see full output: cd /opt/archive/archive/1597331904; curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" http://archive.is/nX7fq > favicon > wget  Failed:   Wget failed or got an error from the server  Got wget response code: 4.  failed: No route to host.  Run to see full output: cd /opt/archive/archive/1597331904; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=warc/1597331908 --page-requisites "--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto http://archive.is/nX7fq > singlefile  Failed:  Exception  Failed to chmod: /opt/archive/archive/1597331904/singlefile.html does not exist (did the previous step fail?)  Run to see full output: cd /opt/archive/archive/1597331904; /opt/SingleFile/cli/single-file --browser-executable-path=chromium "--browser-args="["--headless", "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36", "--window-size=1440,2000"]"" http://archive.is/nX7fq /opt/archive/archive/1597331904/singlefile.html > pdf  Failed:   Failed to save PDF  [0813/151832.239630:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization  ../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063  Received signal 11 SEGV_MAPERR 00000ffa003f  #0 0x563968f29529 (/usr/lib/chromium/chromium+0x51f9528)  #1 0x563968e87253 (/usr/lib/chromium/chromium+0x5157252)  Run to see full output: cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf http://archive.is/nX7fq > screenshot  Failed:   Failed to save screenshot  [0813/151847.542441:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization  ../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063  Received signal 11 SEGV_MAPERR 00000ffa003f  #0 0x55ee106ed529 (/usr/lib/chromium/chromium+0x51f9528)  #1 0x55ee1064b253 (/usr/lib/chromium/chromium+0x5157252)  Run to see full output: cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --screenshot http://archive.is/nX7fq > dom  Failed:   Failed to save DOM  [0813/151902.868207:ERROR:viz_main_impl.cc(152)] Exiting GPU process due to errors during initialization  ../../sandbox/linux/seccomp-bpf-helpers/sigsys_handlers.cc:**CRASHING**:seccomp-bpf failure in syscall 0063  Received signal 11 SEGV_MAPERR 00000ffa003f  #0 0x55d478e36529 (/usr/lib/chromium/chromium+0x51f9528)  #1 0x55d478d94253 (/usr/lib/chromium/chromium+0x5157252)  Run to see full output: cd /opt/archive/archive/1597331904; chromium --headless "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --dump-dom http://archive.is/nX7fq > media  Failed:   Failed to save media  Got youtube-dl response code: 1.  WARNING: Could not send HEAD request to http://archive.is/nX7fq:   ERROR: Unable to download webpage: (caused by URLError(OSError(113, 'No route to host')))  Run to see full output: cd /opt/archive/archive/1597331904; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata --yes-playlist http://archive.is/nX7fq > archive_org  [√] [2020-08-13 15:19:09] Update of 1 pages complete (44.27 sec) - 0 links skipped - 0 links updated - 2 links had errors  Hint: To view your archive index, open: /opt/archive/index.html Or run the built-in webserver: archivebox server  [*] [2020-08-13 15:19:09] Writing 2 links to main index... > /opt/archive/index.sqlite3 √ /opt/archive/index.sqlite3 > /opt/archive/index.json √ /opt/archive/index.json > /opt/archive/index.html √ /opt/archive/index.html ```
Author
Owner

@cdvv7788 commented on GitHub (Aug 13, 2020):

@jaw-sh can you provide the command (with url) you are testing? I can give it a check (It is probably being blocked by the target url).

<!-- gh-comment-id:673545676 --> @cdvv7788 commented on GitHub (Aug 13, 2020): @jaw-sh can you provide the command (with url) you are testing? I can give it a check (It is probably being blocked by the target url).
Author
Owner

@jaw-sh commented on GitHub (Aug 13, 2020):

@cdvv7788 http://archive.is/nX7fq

<!-- gh-comment-id:673546257 --> @jaw-sh commented on GitHub (Aug 13, 2020): @cdvv7788 http://archive.is/nX7fq
Author
Owner

@cdvv7788 commented on GitHub (Aug 13, 2020):

image
It works for me. How are you trying to run archivebox? Are you setting some environment variable or changing some configuration? Are you on master?

<!-- gh-comment-id:673549391 --> @cdvv7788 commented on GitHub (Aug 13, 2020): ![image](https://user-images.githubusercontent.com/5531776/90155108-8ebb8e80-dd50-11ea-8c9b-fab820e3d20d.png) It works for me. How are you trying to run `archivebox`? Are you setting some environment variable or changing some configuration? Are you on `master`?
Author
Owner

@jaw-sh commented on GitHub (Aug 13, 2020):

http://archive.vn/nX7fq
https://tinf.io/archive/1597333033/

All I really want is the static, non-interactive version of the page they already archived.

[i] [2020-08-13 15:37:12] ArchiveBox v0.4.13: archivebox add http://archive.vn/nX7fq
    > /opt/archive

[+] [2020-08-13 15:37:13] Adding 1 links to index (crawl depth=0)...
    > Saved verbatim input to sources/1597333033-import.txt
    > Parsed 1 URLs from input (Plain Text)                                                  
    > Found 1 new URLs not already in index                                                  

[*] [2020-08-13 15:37:13] Writing 2 links to main index...
    √ /opt/archive/index.sqlite3                                                             
    √ /opt/archive/index.json                                                                
    √ /opt/archive/index.html                                                                

[▶] [2020-08-13 15:37:14] Collecting content for 1 Snapshots in archive...

[+] [2020-08-13 15:37:15] "archive.vn/nX7fq"
    http://archive.vn/nX7fq
    > ./archive/1597333033
      > title
        Failed:                                                                              
            ConnectionError HTTPConnectionPool(host='archive.vn', port=80): Max retries exceeded with url: /nX7fq (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f52fc3b7ac8>: Failed to establish a new connection: [Errno -5] No address associated with hostname'))
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" http://archive.vn/nX7fq

      > favicon
      > wget                                                                                 
        Failed:                                                                              
            TimeoutExpired Command '['wget', '--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off', '--timeout=60', '--restrict-file-names=windows', '--warc-file=warc/1597333035', '--page-requisites', '--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1', '--compression=auto', 'http://archive.vn/nX7fq']' timed out after 60 seconds
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=warc/1597333035 --page-requisites "--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto http://archive.vn/nX7fq

      > singlefile
        Failed:                                                                              
            Exception Failed to chmod: /opt/archive/archive/1597333033/singlefile.html does not exist (did the previous step fail?)
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            /opt/SingleFile/cli/single-file --browser-executable-path=chromium "--browser-args="["--headless", "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36", "--window-size=1440,2000"]"" http://archive.vn/nX7fq /opt/archive/archive/1597333033/singlefile.html

      > pdf
      > screenshot                                                                           
      > dom                                                                                  
      > media                                                                                
        Failed:                                                                              
             Failed to save media
            Got youtube-dl response code: 1.
            WARNING: Could not send HEAD request to http://archive.vn/nX7fq: <urlopen error [Errno 99] Cannot assign requested address>
            ERROR: Unable to download webpage: <urlopen error [Errno 99] Cannot assign requested address> (caused by URLError(OSError(99, 'Cannot assign requested address')))
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata --yes-playlist http://archive.vn/nX7fq

      > archive_org
        Failed:                                                                              
             WaybackException: java.lang.IllegalStateException: Payload size does not match content-length!
        Run to see full output:
            cd /opt/archive/archive/1597333033;
            curl --silent --location --head --compressed --max-time 60 --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/http://archive.vn/nX7fq


[√] [2020-08-13 15:38:20] Update of 1 pages complete (1.10 min)
    - 0 links skipped
    - 0 links updated
    - 1 links had errors

    Hint: To view your archive index, open:
        /opt/archive/index.html
    Or run the built-in webserver:
        archivebox server

[*] [2020-08-13 15:38:20] Writing 2 links to main index...
    √ /opt/archive/index.sqlite3                                                             
    √ /opt/archive/index.json                                                                
    √ /opt/archive/index.html ```
<!-- gh-comment-id:673551975 --> @jaw-sh commented on GitHub (Aug 13, 2020): http://archive.vn/nX7fq https://tinf.io/archive/1597333033/ All I really want is the static, non-interactive version of the page they already archived. ```sudo -u archive archivebox add http://archive.vn/nX7fq [i] [2020-08-13 15:37:12] ArchiveBox v0.4.13: archivebox add http://archive.vn/nX7fq > /opt/archive [+] [2020-08-13 15:37:13] Adding 1 links to index (crawl depth=0)... > Saved verbatim input to sources/1597333033-import.txt > Parsed 1 URLs from input (Plain Text) > Found 1 new URLs not already in index [*] [2020-08-13 15:37:13] Writing 2 links to main index... √ /opt/archive/index.sqlite3 √ /opt/archive/index.json √ /opt/archive/index.html [▶] [2020-08-13 15:37:14] Collecting content for 1 Snapshots in archive... [+] [2020-08-13 15:37:15] "archive.vn/nX7fq" http://archive.vn/nX7fq > ./archive/1597333033 > title Failed: ConnectionError HTTPConnectionPool(host='archive.vn', port=80): Max retries exceeded with url: /nX7fq (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f52fc3b7ac8>: Failed to establish a new connection: [Errno -5] No address associated with hostname')) Run to see full output: cd /opt/archive/archive/1597333033; curl --silent --max-time 60 --location --compressed --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" http://archive.vn/nX7fq > favicon > wget Failed: TimeoutExpired Command '['wget', '--no-verbose', '--adjust-extension', '--convert-links', '--force-directories', '--backup-converted', '--span-hosts', '--no-parent', '-e', 'robots=off', '--timeout=60', '--restrict-file-names=windows', '--warc-file=warc/1597333035', '--page-requisites', '--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1', '--compression=auto', 'http://archive.vn/nX7fq']' timed out after 60 seconds Run to see full output: cd /opt/archive/archive/1597333033; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=warc/1597333035 --page-requisites "--user-agent=ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) wget/GNU Wget 1.20.1" --compression=auto http://archive.vn/nX7fq > singlefile Failed: Exception Failed to chmod: /opt/archive/archive/1597333033/singlefile.html does not exist (did the previous step fail?) Run to see full output: cd /opt/archive/archive/1597333033; /opt/SingleFile/cli/single-file --browser-executable-path=chromium "--browser-args="["--headless", "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36", "--window-size=1440,2000"]"" http://archive.vn/nX7fq /opt/archive/archive/1597333033/singlefile.html > pdf > screenshot > dom > media Failed: Failed to save media Got youtube-dl response code: 1. WARNING: Could not send HEAD request to http://archive.vn/nX7fq: <urlopen error [Errno 99] Cannot assign requested address> ERROR: Unable to download webpage: <urlopen error [Errno 99] Cannot assign requested address> (caused by URLError(OSError(99, 'Cannot assign requested address'))) Run to see full output: cd /opt/archive/archive/1597333033; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --no-check-certificate --user-agent --all-subs --extract-audio --keep-video --ignore-errors --geo-bypass --audio-format mp3 --audio-quality 320K --embed-thumbnail --add-metadata --yes-playlist http://archive.vn/nX7fq > archive_org Failed: WaybackException: java.lang.IllegalStateException: Payload size does not match content-length! Run to see full output: cd /opt/archive/archive/1597333033; curl --silent --location --head --compressed --max-time 60 --user-agent "ArchiveBox/0.4.13 (+https://github.com/pirate/ArchiveBox/) curl/curl 7.64.0 (x86_64-pc-linux-gnu)" https://web.archive.org/save/http://archive.vn/nX7fq [√] [2020-08-13 15:38:20] Update of 1 pages complete (1.10 min) - 0 links skipped - 0 links updated - 1 links had errors Hint: To view your archive index, open: /opt/archive/index.html Or run the built-in webserver: archivebox server [*] [2020-08-13 15:38:20] Writing 2 links to main index... √ /opt/archive/index.sqlite3 √ /opt/archive/index.json √ /opt/archive/index.html ```
Author
Owner

@jaw-sh commented on GitHub (Aug 13, 2020):

ArchiveBox really needs a way to capture the DOM at "first rest" when the page is fully loaded. With Twitter, the archive is completely mangled because it tries to totally replicate the entire Twitter living webpage. Instagram is also completely broken.

I can open a new issue for this and I am willing to put cash bounties on these things.

https://twitter.com/dril/status/134787490526658561
https://tinf.io/archive/1597335380/twitter.com/dril/status/134787490526658561.html
image

<!-- gh-comment-id:673572471 --> @jaw-sh commented on GitHub (Aug 13, 2020): ArchiveBox really needs a way to capture the DOM at "first rest" when the page is fully loaded. With Twitter, the archive is completely mangled because it tries to totally replicate the entire Twitter living webpage. Instagram is also completely broken. I can open a new issue for this and I am willing to put cash bounties on these things. https://twitter.com/dril/status/134787490526658561 https://tinf.io/archive/1597335380/twitter.com/dril/status/134787490526658561.html ![image](https://user-images.githubusercontent.com/9262991/90159873-4c667180-dd91-11ea-9a85-2fcb086a7bad.png)
Author
Owner

@pirate commented on GitHub (Aug 13, 2020):

@jaw-sh it does capture at first rest with 2 of the methods, the DOM dump and Singlefile. Have you tried looking at those outputs?

<!-- gh-comment-id:673662284 --> @pirate commented on GitHub (Aug 13, 2020): @jaw-sh it does capture at first rest with 2 of the methods, the DOM dump and Singlefile. Have you tried looking at those outputs?
Author
Owner

@jaw-sh commented on GitHub (Aug 13, 2020):

I have the single-file binary set. It didn't work at all before I set it.

<!-- gh-comment-id:673697168 --> @jaw-sh commented on GitHub (Aug 13, 2020): I have the single-file binary set. It didn't work at all before I set it.
Author
Owner

@cdvv7788 commented on GitHub (Aug 13, 2020):

I have the single-file binary set. It didn't work at all before I set it.

We have a fix in an incoming PR that will disable it by default. Using docker has support for all of the extractors out of the box.

<!-- gh-comment-id:673698896 --> @cdvv7788 commented on GitHub (Aug 13, 2020): > I have the single-file binary set. It didn't work at all before I set it. We have a fix in an incoming PR that will disable it by default. Using docker has support for all of the extractors out of the box.
Author
Owner

@jaw-sh commented on GitHub (Aug 13, 2020):

@cdvv7788 Sounds good.

What I really, really need is this:

  • A way to request URLs be archived with a POST/GET request from another service programatically* (auto-archiving links posted by users).
  • A way to request web content be archived at their first rest state, identical to how archive.today works.
  • A way to import archive.today archives exactly as they appear on archive.today, instead of just the archive.today page.

I am willing to pay for this.

<!-- gh-comment-id:673700004 --> @jaw-sh commented on GitHub (Aug 13, 2020): @cdvv7788 Sounds good. What I really, really need is this: - A way to request URLs be archived with a POST/GET request from another service programatically* (auto-archiving links posted by users). - A way to request web content be archived at their first rest state, identical to how archive.today works. - A way to import archive.today archives exactly as they appear on archive.today, instead of just the archive.today page. I am willing to pay for this.
Author
Owner

@pirate commented on GitHub (Aug 13, 2020):

A way to request URLs be archived with a POST/GET request from another service programatically* (auto-archiving links posted by users).

This is already present, you can POST to https://127.0.0.1:8000/admin/core/snapshot/add/ with the following fields to archive a link:

  • url: str (a string containing any number of URLs)
  • depth: int (either 0 or 1, as detailed in archivebox add --help)
  • your session cookie header to authenticate the request

A way to request web content be archived at their first rest state, identical to how archive.today works.

As mentioned above, this is already present, both the DOM dump and SingleFile methods archive "at first rest", i.e. ~1s after DOM.ready event fires.
The other methods do not execute JS, and so "page ready" is not a concept that applies to them.
Your screenshot is of the wget output method only, have you tried viewing the SingleFile or DOM dump outputs? They should generally work fine for twitter.

A way to import archive.today archives exactly as they appear on archive.today, instead of just the archive.today page.

I'm afraid this is not easily possible, archive.today explicitly does not expose an API that allows users to download their snapshots. If they did have such an API, then that task would fall under the umbrella of this ticket: https://github.com/pirate/ArchiveBox/issues/160

image

If you are serious about this, be aware that funding development on this issue would be on the order of $5k USD or more. We run a software consultancy and you can find more info about our hiring us here: Monadical.com.

Also related (for improving exporting to sites like archive.today/archive.org): https://github.com/pirate/ArchiveBox/issues/146

<!-- gh-comment-id:673709277 --> @pirate commented on GitHub (Aug 13, 2020): > A way to request URLs be archived with a POST/GET request from another service programatically* (auto-archiving links posted by users). This is already present, you can POST to `https://127.0.0.1:8000/admin/core/snapshot/add/` with the following fields to archive a link: - `url: str` (a string containing any number of URLs) - `depth: int` (either 0 or 1, as detailed in `archivebox add --help`) - your session cookie header to authenticate the request > A way to request web content be archived at their first rest state, identical to how archive.today works. As mentioned above, this is already present, both the DOM dump and SingleFile methods archive "at first rest", i.e. ~1s after DOM.ready event fires. The other methods do not execute JS, and so "page ready" is not a concept that applies to them. Your screenshot is of the wget output method only, have you tried viewing the SingleFile or DOM dump outputs? They should generally work fine for twitter. > A way to import archive.today archives exactly as they appear on archive.today, instead of just the archive.today page. I'm afraid this is not easily possible, archive.today explicitly does not expose an API that allows users to download their snapshots. If they did have such an API, then that task would fall under the umbrella of this ticket: https://github.com/pirate/ArchiveBox/issues/160 ![image](https://user-images.githubusercontent.com/511499/90187262-83378a00-dd87-11ea-839a-2fc37ace4681.png) If you are serious about this, be aware that funding development on this issue would be on the order of $5k USD or more. We run a software consultancy and you can find more info about our hiring us here: Monadical.com. Also related (for improving *exporting* to sites like archive.today/archive.org): https://github.com/pirate/ArchiveBox/issues/146
Author
Owner

@jaw-sh commented on GitHub (Aug 13, 2020):

archive.today/is/vn/fi does not use the WARC format, they export a .zip download. Even if it's not easy, converting that .zip download into WARC and using it as a snapshot is something I would pay for. I have thousands of these links I would like to host myself.

I must be missing something re: the single file archive. Is there a special config setting I have to set to explicitly use single file? I believe I am already using it but Instagram and Twitter archives are malformed. I had to create a binary to get any archive to work.

<!-- gh-comment-id:673724295 --> @jaw-sh commented on GitHub (Aug 13, 2020): archive.today/is/vn/fi does not use the WARC format, they export a .zip download. Even if it's not easy, converting that .zip download into WARC and using it as a snapshot is something I would pay for. I have thousands of these links I would like to host myself. I must be missing something re: the single file archive. Is there a special config setting I have to set to explicitly use single file? I believe I am already using it but Instagram and Twitter archives are malformed. I had to create a binary to get any archive to work.
Author
Owner

@pirate commented on GitHub (Aug 13, 2020):

We might be able to download that ZIP and rehost it verbatim in the ArchiveBox index without converting it to WARC. ArchiveBox wouldn't be able to run any of its own extractors though (wget, youtubedl, git, chrome, etc.), you'd basically just see the archive.today version in the index with none of ArchiveBox's own functionality. Is that what you're asking for?

https://github.com/pirate/ArchiveBox/wiki/Usage#ui-usage

All archive methods (that are installed) are run for every URL, you can access them by clicking the favicon next to the title, or any of the icons in the "Files" column.

Screen Shot 2020-08-13 at 5 51 18 PM Screen Shot 2020-08-13 at 5 52 37 PM
<!-- gh-comment-id:673730962 --> @pirate commented on GitHub (Aug 13, 2020): We might be able to download that ZIP and rehost it verbatim in the ArchiveBox index without converting it to WARC. ArchiveBox wouldn't be able to run any of its own extractors though (wget, youtubedl, git, chrome, etc.), you'd basically just see the archive.today version in the index with none of ArchiveBox's own functionality. Is that what you're asking for? https://github.com/pirate/ArchiveBox/wiki/Usage#ui-usage All archive methods (that are installed) are run for every URL, you can access them by clicking the favicon next to the title, or any of the icons in the "Files" column. <img width="2048" alt="Screen Shot 2020-08-13 at 5 51 18 PM" src="https://user-images.githubusercontent.com/511499/90190780-ac5b1900-dd8d-11ea-9453-2525bd74599c.png"> <img width="2048" alt="Screen Shot 2020-08-13 at 5 52 37 PM" src="https://user-images.githubusercontent.com/511499/90190875-d01e5f00-dd8d-11ea-89e0-ba1a866b2555.png">
Author
Owner

@pirate commented on GitHub (Jun 13, 2023):

I'm merging this feature with https://github.com/ArchiveBox/ArchiveBox/issues/160, which is a more general TODO to add support for searching/importing from 3rd party archiving platforms.

Please subscribe to that issue for progress updates / discussions.

<!-- gh-comment-id:1588495895 --> @pirate commented on GitHub (Jun 13, 2023): I'm merging this feature with https://github.com/ArchiveBox/ArchiveBox/issues/160, which is a more general TODO to add support for searching/importing from 3rd party archiving platforms. Please subscribe to that issue for progress updates / discussions.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#294
No description provided.