[GH-ISSUE #984] Bug: Indexing subtitles in media extractor fails when they're not UTF-8 encoded #2120

Closed
opened 2026-03-01 17:56:38 +03:00 by kerem · 13 comments
Owner

Originally created by @kylrth on GitHub (May 27, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/984

I get the following when archiving a link to a YouTube video:

[+] [2022-05-27 13:56:21] "youtu.be/_ZNCCttVMg8?t=1592"
    https://youtu.be/_ZNCCttVMg8?t=1592
    > ./archive/1653659773.131913
      > title
      > favicon
      > headers
      > singlefile
      > screenshot
      > wget
      > readability
      > mercury
      > media
Traceback (most recent call last):
  File "/app/archivebox/extractors/__init__.py", line 109, in archive_link
    ! Failed to archive link: Exception: Exception in archive_methods.save_media(Link(url=https://youtu.be/_ZNCCttVMg8?t=1592))

    result = method_function(link=link, out_dir=out_dir)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/media.py", line 75, in save_media
    index_texts = [
  File "/app/archivebox/extractors/media.py", line 76, in <listcomp>
    text_file.read_text(encoding='utf-8').strip()
  File "/usr/local/lib/python3.10/pathlib.py", line 1133, in read_text
    return f.read()
  File "/usr/local/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 87545: invalid start byte

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/bin/archivebox", line 33, in <module>
    sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')())
  File "/app/archivebox/cli/__init__.py", line 140, in main
    run_subcommand(
  File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand
    module.main(args=subcommand_args, stdin=stdin, pwd=pwd)    # type: ignore
  File "/app/archivebox/cli/archivebox_add.py", line 103, in main
    add(
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/main.py", line 626, in add
    archive_links(new_links, overwrite=False, **archive_kwargs)
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/__init__.py", line 181, in archive_links
    archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
  File "/app/archivebox/util.py", line 114, in typechecked_function
    return func(*args, **kwargs)
  File "/app/archivebox/extractors/__init__.py", line 130, in archive_link
    raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
Exception: Exception in archive_methods.save_media(Link(url=https://youtu.be/_ZNCCttVMg8?t=1592))

When this happens it stops processing the rest of the URLs I provided.

ArchiveBox version

ArchiveBox v0.6.3
Cpython Linux Linux-5.4.0-113-generic-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.10.4         valid     /usr/local/bin/python3.10                                                   
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py          
 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2022.04.08     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v100.0.4896.127  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           24 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            7 files         valid     /data                                                                       
 √  SOURCES_DIR           4 files         valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           35 files        valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             672.0 KB        valid     ./index.sqlite3
Originally created by @kylrth on GitHub (May 27, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/984 I get the following when archiving a link to a YouTube video: ```logs [+] [2022-05-27 13:56:21] "youtu.be/_ZNCCttVMg8?t=1592" https://youtu.be/_ZNCCttVMg8?t=1592 > ./archive/1653659773.131913 > title > favicon > headers > singlefile > screenshot > wget > readability > mercury > media Traceback (most recent call last): File "/app/archivebox/extractors/__init__.py", line 109, in archive_link ! Failed to archive link: Exception: Exception in archive_methods.save_media(Link(url=https://youtu.be/_ZNCCttVMg8?t=1592)) result = method_function(link=link, out_dir=out_dir) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/extractors/media.py", line 75, in save_media index_texts = [ File "/app/archivebox/extractors/media.py", line 76, in <listcomp> text_file.read_text(encoding='utf-8').strip() File "/usr/local/lib/python3.10/pathlib.py", line 1133, in read_text return f.read() File "/usr/local/lib/python3.10/codecs.py", line 322, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 87545: invalid start byte The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/local/bin/archivebox", line 33, in <module> sys.exit(load_entry_point('archivebox', 'console_scripts', 'archivebox')()) File "/app/archivebox/cli/__init__.py", line 140, in main run_subcommand( File "/app/archivebox/cli/__init__.py", line 80, in run_subcommand module.main(args=subcommand_args, stdin=stdin, pwd=pwd) # type: ignore File "/app/archivebox/cli/archivebox_add.py", line 103, in main add( File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/main.py", line 626, in add archive_links(new_links, overwrite=False, **archive_kwargs) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/extractors/__init__.py", line 181, in archive_links archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir)) File "/app/archivebox/util.py", line 114, in typechecked_function return func(*args, **kwargs) File "/app/archivebox/extractors/__init__.py", line 130, in archive_link raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format( Exception: Exception in archive_methods.save_media(Link(url=https://youtu.be/_ZNCCttVMg8?t=1592)) ``` When this happens it stops processing the rest of the URLs I provided. #### ArchiveBox version ```logs ArchiveBox v0.6.3 Cpython Linux Linux-5.4.0-113-generic-x86_64-with-glibc2.31 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=sonic [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v17.9.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js - GIT_BINARY - disabled /usr/bin/git √ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v100.0.4896.127 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 24 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 7 files valid /data √ SOURCES_DIR 4 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 35 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 672.0 KB valid ./index.sqlite3 ```
kerem closed this issue 2026-03-01 17:56:39 +03:00
Author
Owner

@pirate commented on GitHub (Jun 9, 2022):

Seems like after the media extractor completes it's trying to load some subtitles / video metadata files for full-text indexing (generated by YouTube-dl) that's aren't encoded with UTF-8. I don't know of an easy full solution to this other than attempting to detect the encoding of those files dynamically (which is difficult and often error prone).

What we can do for now is just add an exception catcher that skips trying to index those files if they throw encoding errors + display a warning like (> Warning: Skipped adding some files to full-text index as they are not in UTF-8 format).

<!-- gh-comment-id:1150541627 --> @pirate commented on GitHub (Jun 9, 2022): Seems like after the media extractor completes it's trying to load some subtitles / video metadata files for full-text indexing (generated by YouTube-dl) that's aren't encoded with UTF-8. I don't know of an easy full solution to this other than attempting to detect the encoding of those files dynamically (which is difficult and often error prone). What we can do for now is just add an exception catcher that skips trying to index those files if they throw encoding errors + display a warning like (`> Warning: Skipped adding some files to full-text index as they are not in UTF-8 format`).
Author
Owner

@turian commented on GitHub (Aug 20, 2022):

What we can do for now is just add an exception catcher that skips trying to index those files if they throw encoding errors + display a warning like (> Warning: Skipped adding some files to full-text index as they are not in UTF-8 format).

That would be a great workaround.

<!-- gh-comment-id:1221374280 --> @turian commented on GitHub (Aug 20, 2022): > What we can do for now is just add an exception catcher that skips trying to index those files if they throw encoding errors + display a warning like (`> Warning: Skipped adding some files to full-text index as they are not in UTF-8 format`). That would be a great workaround.
Author
Owner

@pirate commented on GitHub (Aug 22, 2022):

For anyone landing on this issue, just know it's fairly harmless. Despite an error being displayed the archive method still completes successfully and the files are saved, it's just the full-text indexing part that fails. Hence why I haven't prioritized fixing it already. PRs welcome though! Otherwise I'll get around to it on the next sprint after I do 0.6.3 (which is already bloated and late).

<!-- gh-comment-id:1221660134 --> @pirate commented on GitHub (Aug 22, 2022): For anyone landing on this issue, just know it's fairly harmless. Despite an error being displayed the archive method still completes successfully and the files are saved, it's just the full-text indexing part that fails. Hence why I haven't prioritized fixing it already. PRs welcome though! Otherwise I'll get around to it on the next sprint after I do 0.6.3 (which is already bloated and late).
Author
Owner

@jgoerzen commented on GitHub (Aug 25, 2022):

The problem is that it crashes the whole add/update run. I've got a thousand other files to do, and they never get saved.

<!-- gh-comment-id:1226685442 --> @jgoerzen commented on GitHub (Aug 25, 2022): The problem is that it crashes the whole add/update run. I've got a thousand other files to do, and they never get saved.
Author
Owner

@turian commented on GitHub (Aug 27, 2022):

Agree with @jgoerzen that this bug is a showstopper from getting me to migrate to archivebox currently :(

<!-- gh-comment-id:1229151850 --> @turian commented on GitHub (Aug 27, 2022): Agree with @jgoerzen that this bug is a showstopper from getting me to migrate to archivebox currently :(
Author
Owner

@turian commented on GitHub (Sep 11, 2022):

@pirate How far out is 0.6.3

<!-- gh-comment-id:1242932144 --> @turian commented on GitHub (Sep 11, 2022): @pirate How far out is 0.6.3
Author
Owner

@turian commented on GitHub (Sep 12, 2022):

I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026

TDLR, until that's merged:

Add this to ArchiveBox.conf:

YOUTUBEDL_BINARY=/usr/bin/yt-dlp

If that doesn't work and you still get crap UnicodeDecodeErrors, you can use my Docker turian/archivebox:kludge-984-UTF8-bug, instead of archivebox/archivebox for now. Or use my branch and pip install or whatever from there.

<!-- gh-comment-id:1244600756 --> @turian commented on GitHub (Sep 12, 2022): I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026 TDLR, until that's merged: Add this to ArchiveBox.conf: ``` YOUTUBEDL_BINARY=/usr/bin/yt-dlp ``` If that doesn't work and you still get crap UnicodeDecodeErrors, you can use my Docker `turian/archivebox:kludge-984-UTF8-bug`, instead of `archivebox/archivebox` for now. Or use [my branch](https://github.com/turian/ArchiveBox/tree/feature/kludge-984-UTF8-bug) and pip install or whatever from there.
Author
Owner

@jgoerzen commented on GitHub (Sep 14, 2022):

@turian Thanks for your work on this!

Unfortunately, on your Docker image, I get:

PermissionError: [Errno 13] Permission denied: '/app/archivebox/core/migrations/0021_auto_20220914_0213.py'

And there is no /usr/bin/yt-dlp in the standard Docker image.

<!-- gh-comment-id:1246143330 --> @jgoerzen commented on GitHub (Sep 14, 2022): @turian Thanks for your work on this! Unfortunately, on your Docker image, I get: PermissionError: [Errno 13] Permission denied: '/app/archivebox/core/migrations/0021_auto_20220914_0213.py' And there is no /usr/bin/yt-dlp in the standard Docker image.
Author
Owner

@pirate commented on GitHub (Sep 15, 2022):

Probably still a month or two out. I'm currently trying to find new housing in Oakland and that's taking up all my free time.

Might try and secure a $20-50k grant to work on ArchiveBox full-time in the near future! Will keep y'all posted, sorry for the brutal delay with this release, I know it's taking a lot longer than usual and I know that has real impact on everyone's workflows.

<!-- gh-comment-id:1247478586 --> @pirate commented on GitHub (Sep 15, 2022): Probably still a month or two out. I'm currently trying to find new housing in Oakland and that's taking up all my free time. Might try and secure a $20-50k grant to work on ArchiveBox full-time in the near future! Will keep y'all posted, sorry for the brutal delay with this release, I know it's taking a lot longer than usual and I know that has real impact on everyone's workflows.
Author
Owner

@turian commented on GitHub (Sep 15, 2022):

@jgoerzen I have fixed both these issues in my branch and have submitted PRs. The migrations bug is something in dev, but I pushed a minor PR to fix it. You can even create the migration yourself from dev with django manage.py createmigrations

You can use my Docker turian/archivebox:kludge-984-UTF8-bug, instead of archivebox/archivebox for now. Or use my branch and pip install or whatever from there.

<!-- gh-comment-id:1248512316 --> @turian commented on GitHub (Sep 15, 2022): @jgoerzen I have fixed both these issues in my branch and have submitted PRs. The migrations bug is something in `dev`, but I pushed a minor PR to fix it. You can even create the migration yourself from `dev` with `django manage.py createmigrations` You can use my Docker turian/archivebox:kludge-984-UTF8-bug, instead of archivebox/archivebox for now. Or use [my branch](https://github.com/turian/ArchiveBox/tree/feature/kludge-984-UTF8-bug) and pip install or whatever from there.
Author
Owner

@turian commented on GitHub (Sep 15, 2022):

Probably still a month or two out. I'm currently trying to find new housing in Oakland and that's taking up all my free time.

Ach damn :(
I lived in the bay area. I feel your pain. Have you considered moving to Berlin?

Might try and secure a $20-50k grant to work on ArchiveBox full-time in the near future! Will keep y'all posted, sorry for the brutal delay with this release, I know it's taking a lot longer than usual and I know that has real impact on everyone's workflows.

Well as a Berliner you could apply for an EU grant. Somehow memex got one even tho they are for-profit now. It seems like a cool project but they refuse to implement bulk export. Their sponsors

image

If you ping me later I might have other ideas for sponsors.

Can I please ask for a tiny request?

As a new contributor can you please just enable access that github actions CI/CD will run on my PRs?

image

Besides my larger PRs on yt-dlp (which I know you are too busy to review since it requires some thought), I have this tiny one to fix everyone's migration complaint about dev: https://github.com/ArchiveBox/ArchiveBox/pull/1027

and this one-liner documentation change: https://github.com/ArchiveBox/ArchiveBox/pull/1023

Good luck with the move!

<!-- gh-comment-id:1248517196 --> @turian commented on GitHub (Sep 15, 2022): > Probably still a month or two out. I'm currently trying to find new housing in Oakland and that's taking up all my free time. Ach damn :( I lived in the bay area. I feel your pain. Have you considered moving to Berlin? > Might try and secure a $20-50k grant to work on ArchiveBox full-time in the near future! Will keep y'all posted, sorry for the brutal delay with this release, I know it's taking a lot longer than usual and I know that has real impact on everyone's workflows. Well as a Berliner you could apply for an EU grant. Somehow [memex](https://memex.garden/) got one even tho they are for-profit now. It seems like a cool project but they refuse to implement bulk export. Their sponsors <img width="1199" alt="image" src="https://user-images.githubusercontent.com/65918/190490976-57537a74-848c-4677-8985-7c6c5dc5e0ac.png"> If you ping me later I might have other ideas for sponsors. Can I please ask for a tiny request? *As a new contributor can you please just enable access that github actions CI/CD will run on my PRs?* <img width="820" alt="image" src="https://user-images.githubusercontent.com/65918/190491197-810d932f-2c1c-42c6-846b-3cd147c69ffc.png"> Besides my larger PRs on yt-dlp (which I know you are too busy to review since it requires some thought), I have this tiny one to fix everyone's migration complaint about `dev`: https://github.com/ArchiveBox/ArchiveBox/pull/1027 and this one-liner documentation change: https://github.com/ArchiveBox/ArchiveBox/pull/1023 Good luck with the move!
Author
Owner

@pirate commented on GitHub (Sep 22, 2022):

Thanks so much @turian for all your work here! I'll get on reviewing those PRs and I'll enable the CI checks for contributors.

<!-- gh-comment-id:1255502878 --> @pirate commented on GitHub (Sep 22, 2022): Thanks so much @turian for all your work here! I'll get on reviewing those PRs and I'll enable the CI checks for contributors.
Author
Owner

@pirate commented on GitHub (Jan 19, 2024):

The original issue should be fixed here as of v0.7.2! Comment back if you're still having issues and I'll re-open.

<!-- gh-comment-id:1899651375 --> @pirate commented on GitHub (Jan 19, 2024): The original issue should be fixed here as of v0.7.2! Comment back if you're still having issues and I'll re-open.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2120
No description provided.