mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #318] Feature Request: Add SingleFile CLI option to fire scroll event and load deferred images to better support WeChat archiving #3251
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#3251
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @KagurazakaShirosatosu on GitHub (Feb 2, 2020).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/318
Type
What is the problem that your feature request solves
I archived a page form Wechat Open Platform (such as https://mp.weixin.qq.com/s/ri4nDgPQo4OVWaIWG9EQZA)
but I found that all images on the page can't be archived.
(https://archive.sager.wang/archive/1580637904/mp.weixin.qq.com/s/ri4nDgPQo4OVWaIWG9EQZA.html)
and the title also show "Unable to detect page title" in index of the archive box.
WeChat is the biggest IM in China and it has the strictest censor there.
So I am hopping archive box can archive the page with images.
I am sorry for my bad English :-)
Describe the ideal specific solution you'd want, and whether it fits into any broader scope of changes
I hope archive box can archive the page (with images) from Wechat Open Platform.
What hacks or alternative solutions have you tried to solve the problem?
archive.is can archive the page from wechat open platform with images.
How badly do you want this new feature?
@wych42 commented on GitHub (Mar 22, 2020):
It's caused by wechat media platform(article) page has an empty
@cdvv7788 commented on GitHub (Jul 17, 2020):
@pirate the title is not present, is it ok to look in the
og:titletag as a fallback? (the retrieval is stopping because of this)About the images, they are being lazy-loaded. I am not sure if wget can handle that, but that is something we can check after the title issue is fixed.
@pirate commented on GitHub (Jul 17, 2020):
Yeah you can look in og:title, but lets not handle the image lazy loading right now, that's a very complex problem.
@hope1 commented on GitHub (Jan 11, 2022):
Greetings. I wonder if it is possible at this point to revisit this problem?
Given the prominence of Wechat (and the most comprehensive censoring, as OP also mentioned), I'd wager that articles on mp.weixin.qq.com are probably the most common target for archiving for users in China. It certainly is the case for me, as most of the articles I have felt a need to archive are on there. It would be wonderful to have ArchiveBox available for this usage, especially as the archive.* sites now block all web host proxy users.
Apologies for digging up an old thread and thank you all for your hard work.
@pirate commented on GitHub (Jan 11, 2022):
I recommend seeing if they can be archived with SingleFile, and if not, raising the issue on that repo. ArchiveBox does not itself do any archiving, it's just a collection of other utilities that do the actual archiving. If there are issues with archive fidelity in general the issue is to raise those issues with the sub-utilities or add a new extractor.
@KagurazakaShirosatosu commented on GitHub (Jan 19, 2022):
Hi,
I found that mp.weixin.qq.com can be captured by SingleFile including pictures if I scroll down the entire webpage manually and wait all pictures finish loading. I think ArchiveBox can scroll to the bottom of the page, then wait for networkidle0 and then call SignleFile to capture it.
@pirate commented on GitHub (Jan 22, 2022):
ArchiveBox does not have granular control over how the singlefile capture is done, we only call the SingleFile CLI. Unless they provide a CLI option to scroll before capturing, we cannot do that.
@canoziia commented on GitHub (Dec 1, 2022):
Hello, I found that SingleFile CLI has an option
--load-deferred-images-dispatch-scroll-event. When it is enable, lazy-loaded images can be saved perfectly (at least on WeChat's Page).https://github.com/gildas-lormeau/single-file-cli/blob/master/args.js#L206
@pirate commented on GitHub (Dec 3, 2022):
Thats a great find! 🥳 Thanks. Lets add that option to archivebox by default then.
@melyux commented on GitHub (Jul 16, 2023):
On a lot of my pages, turning
--load-deferred-images-dispatch-scroll-eventto true causes some "subscribe to my newsletter" popup to come up while having it off prevents this and still loads deferred images. So it probably shouldn't be a default.This is all moot though because the bundled singlefile is outdated and doesn't support deferred images at all right now