[GH-ISSUE #331] It fails to grab cnn.com properly #238

Closed
opened 2026-03-01 14:41:45 +03:00 by kerem · 2 comments
Owner

Originally created by @gerroon on GitHub (Mar 20, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/331

Describe the bug

AB gets only some elements of cnn.com but not the actual content. Please see the screnshot for what it grabbed from cnn.com

Bear in mind that it scrapes the page, just that not properly. I can see the files in the data folder, so it cant be a permission issue.

Steps to reproduce

echo "https://edition.cnn.com" | docker-compose exec -T archivebox /bin/archive

Screenshots or log output

https://i.imgur.com/SCgez2G.png

Log below

Software versions

Debian Testing , docker-compose, 83197ef

echo "https://edition.cnn.com" | docker-compose exec -T archivebox /bin/archive
fatal: Not a git repository (or any of the parent directories): .git
[*] [2020-03-20 15:41:28] Parsing new links from output/sources/stdin-1584718888.txt...
    > Adding 1 new links to index (parsed import as Plain Text)
[*] [2020-03-20 15:41:28] Saving main index files...
    √ /data/index.json
    √ /data/index.html
[▶] [2020-03-20 15:41:28] Updating content for 2 pages in archive...

[+] [2020-03-20 15:41:28] "https://edition.cnn.com"
    https://edition.cnn.com
    > /data/archive/1584718888
      > title
      > favicon
      > wget
        Failed:TimeoutExpired Command 'wget' timed out after 60 seconds
        Run to see full output:
            cd /data/archive/1584718888;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=warc/1584718888 --page-requisites "--user-agent=ArchiveBox/ (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://edition.cnn.com
      > pdf
      > screenshot
      > dom
      > media
      > archive_org

[*] [2020-03-20 15:42:41] "GitHub - pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more..."
    https://github.com/pirate/ArchiveBox
    √ /data/archive/1584718296
      > pdf
        Failed:Exception Failed to chmod: output.pdf does not exist (did the previous step fail?)
        Run to see full output:
            cd /data/archive/1584718296;
            google-chrome-unstable --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://github.com/pirate/ArchiveBox
      > screenshot
        Failed:Exception Failed to chmod: screenshot.png does not exist (did the previous step fail?)
        Run to see full output:
            cd /data/archive/1584718296;
            google-chrome-unstable --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --screenshot https://github.com/pirate/ArchiveBox
[√] [2020-03-20 15:42:51] Update of 2 pages complete (1.39 min)
    - 0 links skipped
    - 0 links updated
    - 2 links had errors
    To view your archive, open: /data/index.html
[*] [2020-03-20 15:42:51] Saving main index files...
    √ /data/index.json
    √ /data/index.html


Originally created by @gerroon on GitHub (Mar 20, 2020). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/331 #### Describe the bug AB gets only some elements of cnn.com but not the actual content. Please see the screnshot for what it grabbed from cnn.com Bear in mind that it scrapes the page, just that not properly. I can see the files in the data folder, so it cant be a permission issue. #### Steps to reproduce `echo "https://edition.cnn.com" | docker-compose exec -T archivebox /bin/archive` #### Screenshots or log output https://i.imgur.com/SCgez2G.png Log below #### Software versions Debian Testing , docker-compose, 83197ef ``` echo "https://edition.cnn.com" | docker-compose exec -T archivebox /bin/archive fatal: Not a git repository (or any of the parent directories): .git [*] [2020-03-20 15:41:28] Parsing new links from output/sources/stdin-1584718888.txt... > Adding 1 new links to index (parsed import as Plain Text) [*] [2020-03-20 15:41:28] Saving main index files... √ /data/index.json √ /data/index.html [▶] [2020-03-20 15:41:28] Updating content for 2 pages in archive... [+] [2020-03-20 15:41:28] "https://edition.cnn.com" https://edition.cnn.com > /data/archive/1584718888 > title > favicon > wget Failed:TimeoutExpired Command 'wget' timed out after 60 seconds Run to see full output: cd /data/archive/1584718888; wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --restrict-file-names=windows --timeout=60 --warc-file=warc/1584718888 --page-requisites "--user-agent=ArchiveBox/ (+https://github.com/pirate/ArchiveBox/) wget/1.18" https://edition.cnn.com > pdf > screenshot > dom > media > archive_org [*] [2020-03-20 15:42:41] "GitHub - pirate/ArchiveBox: 🗃 The open source self-hosted web archive. Takes browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more..." https://github.com/pirate/ArchiveBox √ /data/archive/1584718296 > pdf Failed:Exception Failed to chmod: output.pdf does not exist (did the previous step fail?) Run to see full output: cd /data/archive/1584718296; google-chrome-unstable --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --print-to-pdf https://github.com/pirate/ArchiveBox > screenshot Failed:Exception Failed to chmod: screenshot.png does not exist (did the previous step fail?) Run to see full output: cd /data/archive/1584718296; google-chrome-unstable --headless --no-sandbox --disable-gpu "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36" --window-size=1440,2000 --timeout=60000 --screenshot https://github.com/pirate/ArchiveBox [√] [2020-03-20 15:42:51] Update of 2 pages complete (1.39 min) - 0 links skipped - 0 links updated - 2 links had errors To view your archive, open: /data/index.html [*] [2020-03-20 15:42:51] Saving main index files... √ /data/index.json √ /data/index.html ```
kerem closed this issue 2026-03-01 14:41:45 +03:00
Author
Owner

@pirate commented on GitHub (Mar 22, 2020):

Unfortunatley many news sites are quite hostile to scripted access, and do everything in their power to prevent it.

I tried a few approaches to archive it but failed. I think you may have to rely on archive.org or another tool if you archive many CNN sites for now. I don't have any magic fix unfortunately.

We are adding other archive methods in the future, so I hope to see this situation improve. For now I recommend trying pywb/webrecorder.io or https://github.com/gildas-lormeau/SingleFile.

Because many sites have different issues with archiving depending on subtool-specific problems, we generally don't keep issues open for them unless the bugs are caused by archivebox directly. You can see more discussion on a similar case here: https://github.com/pirate/ArchiveBox/issues/328#issuecomment-599796868

I am particularly disappointed that CNN doesn't work though, so I'll keep an eye out for potential fixes and post back on this issue if I find any. If you find any combination of chrome headless / wget command line arguments that make it work, let us know and I'll add them as config options!

<!-- gh-comment-id:602162836 --> @pirate commented on GitHub (Mar 22, 2020): Unfortunatley many news sites are quite hostile to scripted access, and do everything in their power to prevent it. I tried a few approaches to archive it but failed. I think you may have to rely on archive.org or another tool if you archive many CNN sites for now. I don't have any magic fix unfortunately. We are adding other archive methods in the future, so I hope to see this situation improve. For now I recommend trying `pywb`/webrecorder.io or https://github.com/gildas-lormeau/SingleFile. Because many sites have different issues with archiving depending on subtool-specific problems, we generally don't keep issues open for them unless the bugs are caused by archivebox directly. You can see more discussion on a similar case here: https://github.com/pirate/ArchiveBox/issues/328#issuecomment-599796868 I am particularly disappointed that CNN doesn't work though, so I'll keep an eye out for potential fixes and post back on this issue if I find any. If you find any combination of chrome headless / wget command line arguments that make it work, let us know and I'll add them as config options!
Author
Owner

@gerroon commented on GitHub (Mar 22, 2020):

Hi

Thanks for the reply. It makes sense, as long as it is not a bug with the app. My plan is to archieve big news sites daily.

<!-- gh-comment-id:602223894 --> @gerroon commented on GitHub (Mar 22, 2020): Hi Thanks for the reply. It makes sense, as long as it is not a bug with the app. My plan is to archieve big news sites daily.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#238
No description provided.