mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #134] Intermittent network response dropping when building and executing inside docker #90
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#90
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @mawmawmawm on GitHub (Jan 25, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/134
It seems like the pocket RSS feeds are not being parsed correctly and fragments of the XML / HTML tags are being included in the links.
Here's how to reproduce this:
I created a pocket-account with two links in it, the corresponding RSS that is being downloaded looks like this:
Instead of the two
<link>s, the software now tries to pull in 10 links and seems to mess up the URLs:(note the
<guid>at the end of the URL wget is trying to download.In the end, no links could be saved:
Latest stable version.
@pirate commented on GitHub (Jan 26, 2019):
What a coincidence, I just ran into this myself 20min ago, not sure what broke but I'll take a look.
@pirate commented on GitHub (Jan 26, 2019):
Can you give this a shot:
ff125d9(latest master)Hope I didn't mess up your archive, it might be a pain but you should be able to use https://github.com/pirate/ArchiveBox/blob/master/archivebox/purge.py to remove any bad links.
@mawmawmawm commented on GitHub (Jan 26, 2019):
Hey, I’d love to try it, but I’m running into the same old error again that I mentioned in my other ticket. In the mean time it worked however. Could this be a server issue on Googles side of things?
I tried to delete everything and rebuilt everything at least 5 times with no cache, but no luck. Errors out on different lines / chars btw.
@mawmawmawm commented on GitHub (Jan 26, 2019):
I tried a couple things, e.g. update npm via
npm i npm(Nownpm@5.6.0 /usr/local/lib/node_modules/npm) and alsonpm cache clean --forceas well asnpm updatebut I’m still running into the same issue.
docker-compose build --no-cache --force-rmalso didn’t help.Other sites mention this would be due to an invalid npm coaching mechanism in older versions, any idea how to fix this?
@pirate commented on GitHub (Jan 26, 2019):
It's very odd because I cant reproduce it, and it's happening during the build process which should theoretically be the same no matter what machine you're building on.
Try pulling the latest master, I bumped the base image version up from
8to11.@mawmawmawm commented on GitHub (Jan 27, 2019):
Thanks for your quick response, but no luck - I'm sorry.
I tried everything I could think of...
docker rmi <id>and also the old containersdocker-compose build --no-cache --force-rm(which takes considerably longer since everything is done from scratch again)But no luck. Still the same issue in step 7/15 with ...
As mentioned, the actual "end" varies and the while parsing near... is always different.
I don't know why it worked once (basically before I created this issue) and now not anymore...
@mawmawmawm commented on GitHub (Jan 28, 2019):
OK, I did more digging. I edited the
Dockerfileto include anRUN npm cache clean --forcebefore the puppeteer (now step 8 instead of 7) installation, but no luck there as well:I then reduced the Dockerfile to the bare minimum to see if that would give me any clue:
But still (this time errored out on the same spot):
So the error must be within the npm package of puppeteer?!
@mawmawmawm commented on GitHub (Jan 28, 2019):
So... even more...
I changed
puppeteerforpuppeteer-core(a version of Puppeteer that doesn't download Chromium by default) in the Dockerfile, because we're installing chromium anyways separately. This at first failed as well:There seems to be something going on either with my network connection or the npm servers.
I tried again:
This finally did work. Not sure about the tarball errors.
Back to the original purpose of the ticket, pocket feeds not being properly imported: I tried the same RSS feed and this time my two links were parsed / downloaded correctly; screenshot, html, pdf confirmed and working.
Thanks again for your support and this project. Love it and i think it's very important. You might want to consider
puppeteer-core.@pirate commented on GitHub (Jan 28, 2019):
I suspect the network error might actually still be happening, but that by reducing the number of requests needed to install it managed to succeed once?
Because it seems like a lower level network issue, if you want you could try testing
curl https://registry.npmjs.org/rimrafinside the built container viaexec, or by putting it somewhere in the Dockerfile then checking the output isn't truncated.Is the working build reproducible for you? (does it succeed reliably multiple times when doing a
--no-cache --force-rmrebuild)If puppeteer-core does fix it reliably, then I'm happy to change to it, especially as you mentioned we're already installing chrome manually.
@mawmawmawm commented on GitHub (Jan 28, 2019):
Looks like something with this server is weird. I tried this in the container:
Another try:
Now I got the file on the first try, but the first time around not, see the
Read error. wget then was smart enough to fetch the rest of the file (http 206, I diff'd the files and they're the same) but I don't know if npm is doing the same thing (since ahttp 200was sent). If not, that explains the weird behavior I believe.@pirate commented on GitHub (Jan 28, 2019):
Very interesting, it looks like you have some network issue that starts dropping response data after ~64kb. I've never seen anything like that.
What OS are you running on? Have you experienced anything like these truncated responses in your day-to-day (non-docker) activities?
Do you mind re-running that wget test with
-v, or usecurl -v? At this point, I'm just curious what could be causing this...@mawmawmawm commented on GitHub (Jan 28, 2019):
Has never been an issue so far - docker is running on my Synology NAS DS716+ with their (linux based) DSM 6.2.1 software / OS.
Here we go:
That worked as expected. I gave it more tries and after 13 successful ones, I saw the error / partial download again, however now on a different byte.
This time from a different server (
104.16.16.35) however.@pirate commented on GitHub (Jan 29, 2019):
Can you try with curl passing
-H "Accept-Encoding: gzip?@mawmawmawm commented on GitHub (Jan 30, 2019):
I tried
instead - same thing, I just find the
wgetoutput a bit friendlier to read. I did this directly on my machine as well as on the NAS.I tried it ~50 times and I saw that the content-length was different sometimes (see the values in [brackets], sometimes 10284, 10070, 10281 etc:
I renamed those files to
.zip, extracted and diff'd them and they were the same afterwards. This is a mystery to me :)@pirate commented on GitHub (Feb 4, 2019):
I'm going to close this since it seems to be specific to your machine and unlikely to affect other people. I'm still very much interested in figuring it out though. You could try Wiresharking the traffic from the container to see if there's any ICMP or TCP weirdness that's ending the stream early.