mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #331] It fails to grab cnn.com properly #238
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#238
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @gerroon on GitHub (Mar 20, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/331
Describe the bug
AB gets only some elements of cnn.com but not the actual content. Please see the screnshot for what it grabbed from cnn.com
Bear in mind that it scrapes the page, just that not properly. I can see the files in the data folder, so it cant be a permission issue.
Steps to reproduce
echo "https://edition.cnn.com" | docker-compose exec -T archivebox /bin/archiveScreenshots or log output
https://i.imgur.com/SCgez2G.png
Log below
Software versions
Debian Testing , docker-compose,
83197ef@pirate commented on GitHub (Mar 22, 2020):
Unfortunatley many news sites are quite hostile to scripted access, and do everything in their power to prevent it.
I tried a few approaches to archive it but failed. I think you may have to rely on archive.org or another tool if you archive many CNN sites for now. I don't have any magic fix unfortunately.
We are adding other archive methods in the future, so I hope to see this situation improve. For now I recommend trying
pywb/webrecorder.io or https://github.com/gildas-lormeau/SingleFile.Because many sites have different issues with archiving depending on subtool-specific problems, we generally don't keep issues open for them unless the bugs are caused by archivebox directly. You can see more discussion on a similar case here: https://github.com/pirate/ArchiveBox/issues/328#issuecomment-599796868
I am particularly disappointed that CNN doesn't work though, so I'll keep an eye out for potential fixes and post back on this issue if I find any. If you find any combination of chrome headless / wget command line arguments that make it work, let us know and I'll add them as config options!
@gerroon commented on GitHub (Mar 22, 2020):
Hi
Thanks for the reply. It makes sense, as long as it is not a bug with the app. My plan is to archieve big news sites daily.