mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[GH-ISSUE #984] Bug: Indexing subtitles in media extractor fails when they're not UTF-8 encoded #2120
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2120
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @kylrth on GitHub (May 27, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/984
I get the following when archiving a link to a YouTube video:
When this happens it stops processing the rest of the URLs I provided.
ArchiveBox version
@pirate commented on GitHub (Jun 9, 2022):
Seems like after the media extractor completes it's trying to load some subtitles / video metadata files for full-text indexing (generated by YouTube-dl) that's aren't encoded with UTF-8. I don't know of an easy full solution to this other than attempting to detect the encoding of those files dynamically (which is difficult and often error prone).
What we can do for now is just add an exception catcher that skips trying to index those files if they throw encoding errors + display a warning like (
> Warning: Skipped adding some files to full-text index as they are not in UTF-8 format).@turian commented on GitHub (Aug 20, 2022):
That would be a great workaround.
@pirate commented on GitHub (Aug 22, 2022):
For anyone landing on this issue, just know it's fairly harmless. Despite an error being displayed the archive method still completes successfully and the files are saved, it's just the full-text indexing part that fails. Hence why I haven't prioritized fixing it already. PRs welcome though! Otherwise I'll get around to it on the next sprint after I do 0.6.3 (which is already bloated and late).
@jgoerzen commented on GitHub (Aug 25, 2022):
The problem is that it crashes the whole add/update run. I've got a thousand other files to do, and they never get saved.
@turian commented on GitHub (Aug 27, 2022):
Agree with @jgoerzen that this bug is a showstopper from getting me to migrate to archivebox currently :(
@turian commented on GitHub (Sep 11, 2022):
@pirate How far out is 0.6.3
@turian commented on GitHub (Sep 12, 2022):
I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026
TDLR, until that's merged:
Add this to ArchiveBox.conf:
If that doesn't work and you still get crap UnicodeDecodeErrors, you can use my Docker
turian/archivebox:kludge-984-UTF8-bug, instead ofarchivebox/archiveboxfor now. Or use my branch and pip install or whatever from there.@jgoerzen commented on GitHub (Sep 14, 2022):
@turian Thanks for your work on this!
Unfortunately, on your Docker image, I get:
PermissionError: [Errno 13] Permission denied: '/app/archivebox/core/migrations/0021_auto_20220914_0213.py'
And there is no /usr/bin/yt-dlp in the standard Docker image.
@pirate commented on GitHub (Sep 15, 2022):
Probably still a month or two out. I'm currently trying to find new housing in Oakland and that's taking up all my free time.
Might try and secure a $20-50k grant to work on ArchiveBox full-time in the near future! Will keep y'all posted, sorry for the brutal delay with this release, I know it's taking a lot longer than usual and I know that has real impact on everyone's workflows.
@turian commented on GitHub (Sep 15, 2022):
@jgoerzen I have fixed both these issues in my branch and have submitted PRs. The migrations bug is something in
dev, but I pushed a minor PR to fix it. You can even create the migration yourself fromdevwithdjango manage.py createmigrationsYou can use my Docker turian/archivebox:kludge-984-UTF8-bug, instead of archivebox/archivebox for now. Or use my branch and pip install or whatever from there.
@turian commented on GitHub (Sep 15, 2022):
Ach damn :(
I lived in the bay area. I feel your pain. Have you considered moving to Berlin?
Well as a Berliner you could apply for an EU grant. Somehow memex got one even tho they are for-profit now. It seems like a cool project but they refuse to implement bulk export. Their sponsors
If you ping me later I might have other ideas for sponsors.
Can I please ask for a tiny request?
As a new contributor can you please just enable access that github actions CI/CD will run on my PRs?
Besides my larger PRs on yt-dlp (which I know you are too busy to review since it requires some thought), I have this tiny one to fix everyone's migration complaint about
dev: https://github.com/ArchiveBox/ArchiveBox/pull/1027and this one-liner documentation change: https://github.com/ArchiveBox/ArchiveBox/pull/1023
Good luck with the move!
@pirate commented on GitHub (Sep 22, 2022):
Thanks so much @turian for all your work here! I'll get on reviewing those PRs and I'll enable the CI checks for contributors.
@pirate commented on GitHub (Jan 19, 2024):
The original issue should be fixed here as of v0.7.2! Comment back if you're still having issues and I'll re-open.