[GH-ISSUE #991] Bug: Docker install not able to save YouTube videos (media failure) #616

Closed
opened 2026-03-01 14:45:02 +03:00 by kerem · 7 comments
Owner

Originally created by @CorneliousJD on GitHub (Jun 17, 2022).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/991

Describe the bug

When using a basic Docker install it fails to save YouTube videos out of the box.

Steps to reproduce

Setup Docker install, archive any YouTube video, and it will fail to save the actual video files.

Screenshots or log output

Sometimes it will say it failed to retrieve media, other times it will say success but there's no actual video files saved.

ArchiveBox version

:latest

ArchiveBox v0.6.3
Cpython Linux Linux-5.15.46-Unraid-x86_64-with-glibc2.31 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.10.4         valid     /usr/local/bin/python3.10                                                   
 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py          
 √  CURL_BINARY           v7.74.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21           valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v17.9.0         valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.30.2         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2022.04.08     valid     /usr/local/bin/yt-dlp                                                       
 √  CHROME_BINARY         v101.0.4951.41  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           24 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         4 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /data                                                                       
 √  SOURCES_DIR           11 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           5 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             248.0 KB        valid     ./index.sqlite3                ```
<!-- Tickets without full version info will closed until it is provided,
we need the full output here to help you solve your issue -->
Originally created by @CorneliousJD on GitHub (Jun 17, 2022). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/991 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug When using a basic Docker install it fails to save YouTube videos out of the box. #### Steps to reproduce Setup Docker install, archive any YouTube video, and it will fail to save the actual video files. #### Screenshots or log output Sometimes it will say it failed to retrieve media, other times it will say success but there's no actual video files saved. <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> #### ArchiveBox version :latest <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.3 Cpython Linux Linux-5.15.46-Unraid-x86_64-with-glibc2.31 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.10.4 valid /usr/local/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.74.0 valid /usr/bin/curl √ WGET_BINARY v1.21 valid /usr/bin/wget √ NODE_BINARY v17.9.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.30.2 valid /usr/bin/git √ YOUTUBEDL_BINARY v2022.04.08 valid /usr/local/bin/yt-dlp √ CHROME_BINARY v101.0.4951.41 valid /usr/bin/chromium √ RIPGREP_BINARY v12.1.1 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 24 files valid /app/archivebox √ TEMPLATES_DIR 4 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 5 files valid /data √ SOURCES_DIR 11 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 5 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 248.0 KB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
Author
Owner

@pirate commented on GitHub (Jul 12, 2022):

Quoting the error output from @jansendotsh's issue:

Howdy 👋 I set up an ArchiveBox instance on a public facing VPS with docker-compose (using just ArchiveBox & Sonic). Whenever I send over a YouTube video, it seems to get wedged in a "not yet archived" state while the media archive method appears to not even kick off. I dug into the container's logs a bit and there looks to be an issue starting the media.py tool:

archivebox_1  | Internal Server Error: /add/
archivebox_1  | Traceback (most recent call last):
archivebox_1  |   File "/app/archivebox/extractors/__init__.py", line 109, in archive_link
archivebox_1  |     result = method_function(link=link, out_dir=out_dir)
archivebox_1  |   File "/app/archivebox/util.py", line 114, in typechecked_function
archivebox_1  |     return func(*args, **kwargs)
archivebox_1  |   File "/app/archivebox/extractors/media.py", line 74, in save_media
archivebox_1  |     index_texts = [
archivebox_1  |   File "/app/archivebox/extractors/media.py", line 75, in <listcomp>
archivebox_1  |     text_file.read_text(encoding='utf-8').strip()
archivebox_1  |   File "/usr/local/lib/python3.9/pathlib.py", line 1257, in read_text
archivebox_1  |     return f.read()
archivebox_1  |   File "/usr/local/lib/python3.9/codecs.py", line 322, in decode
archivebox_1  |     (result, consumed) = self._buffer_decode(data, self.errors, final)
archivebox_1  | UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 414699: invalid continuation byte
archivebox_1  | The above exception was the direct cause of the following exception:
archivebox_1  | Traceback (most recent call last):
archivebox_1  |   File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner
archivebox_1  |     response = get_response(request)
archivebox_1  |   File "/usr/local/lib/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response
archivebox_1  |     response = wrapped_callback(request, *callback_args, **callback_kwargs)
archivebox_1  |   File "/usr/local/lib/python3.9/site-packages/django/views/generic/base.py", line 70, in view
archivebox_1  |     return self.dispatch(request, *args, **kwargs)
archivebox_1  |   File "/usr/local/lib/python3.9/site-packages/django/contrib/auth/mixins.py", line 109, in dispatch
archivebox_1  |     return super().dispatch(request, *args, **kwargs)
archivebox_1  |   File "/usr/local/lib/python3.9/site-packages/django/views/generic/base.py", line 98, in dispatch
archivebox_1  |     return handler(request, *args, **kwargs)
archivebox_1  |   File "/usr/local/lib/python3.9/site-packages/django/views/generic/edit.py", line 142, in post
archivebox_1  |     return self.form_valid(form)
archivebox_1  |   File "/app/archivebox/core/views.py", line 286, in form_valid
archivebox_1  |     add(**input_kwargs)
archivebox_1  |   File "/app/archivebox/util.py", line 114, in typechecked_function
archivebox_1  |     return func(*args, **kwargs)
archivebox_1  |   File "/app/archivebox/main.py", line 624, in add
archivebox_1  |     archive_links(new_links, overwrite=False, **archive_kwargs)
archivebox_1  |   File "/app/archivebox/util.py", line 114, in typechecked_function
archivebox_1  |     return func(*args, **kwargs)
archivebox_1  |   File "/app/archivebox/extractors/__init__.py", line 181, in archive_links
archivebox_1  |     archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir))
archivebox_1  |   File "/app/archivebox/util.py", line 114, in typechecked_function
archivebox_1  |     return func(*args, **kwargs)
archivebox_1  |   File "/app/archivebox/extractors/__init__.py", line 130, in archive_link
archivebox_1  |     raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format(
archivebox_1  | Exception: Exception in archive_methods.save_media(Link(url=https://www.youtube.com/watch?v=r02eaOHenE0[](https://www.youtube.com/watch?v=r02eaOHenE0)))

Any thoughts on how I can clean this up to allow a successful archive?

<!-- gh-comment-id:1182337849 --> @pirate commented on GitHub (Jul 12, 2022): Quoting the error output from @[jansendotsh](https://github.com/jansendotsh)'s issue: > Howdy 👋 I set up an ArchiveBox instance on a public facing VPS with docker-compose (using just ArchiveBox & Sonic). Whenever I send over a YouTube video, it seems to get wedged in a "not yet archived" state while the media archive method appears to not even kick off. I dug into the container's logs a bit and there looks to be an issue starting the `media.py` tool: > > ``` > archivebox_1 | Internal Server Error: /add/ > archivebox_1 | Traceback (most recent call last): > archivebox_1 | File "/app/archivebox/extractors/__init__.py", line 109, in archive_link > archivebox_1 | result = method_function(link=link, out_dir=out_dir) > archivebox_1 | File "/app/archivebox/util.py", line 114, in typechecked_function > archivebox_1 | return func(*args, **kwargs) > archivebox_1 | File "/app/archivebox/extractors/media.py", line 74, in save_media > archivebox_1 | index_texts = [ > archivebox_1 | File "/app/archivebox/extractors/media.py", line 75, in <listcomp> > archivebox_1 | text_file.read_text(encoding='utf-8').strip() > archivebox_1 | File "/usr/local/lib/python3.9/pathlib.py", line 1257, in read_text > archivebox_1 | return f.read() > archivebox_1 | File "/usr/local/lib/python3.9/codecs.py", line 322, in decode > archivebox_1 | (result, consumed) = self._buffer_decode(data, self.errors, final) > archivebox_1 | UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 414699: invalid continuation byte > archivebox_1 | The above exception was the direct cause of the following exception: > archivebox_1 | Traceback (most recent call last): > archivebox_1 | File "/usr/local/lib/python3.9/site-packages/django/core/handlers/exception.py", line 47, in inner > archivebox_1 | response = get_response(request) > archivebox_1 | File "/usr/local/lib/python3.9/site-packages/django/core/handlers/base.py", line 181, in _get_response > archivebox_1 | response = wrapped_callback(request, *callback_args, **callback_kwargs) > archivebox_1 | File "/usr/local/lib/python3.9/site-packages/django/views/generic/base.py", line 70, in view > archivebox_1 | return self.dispatch(request, *args, **kwargs) > archivebox_1 | File "/usr/local/lib/python3.9/site-packages/django/contrib/auth/mixins.py", line 109, in dispatch > archivebox_1 | return super().dispatch(request, *args, **kwargs) > archivebox_1 | File "/usr/local/lib/python3.9/site-packages/django/views/generic/base.py", line 98, in dispatch > archivebox_1 | return handler(request, *args, **kwargs) > archivebox_1 | File "/usr/local/lib/python3.9/site-packages/django/views/generic/edit.py", line 142, in post > archivebox_1 | return self.form_valid(form) > archivebox_1 | File "/app/archivebox/core/views.py", line 286, in form_valid > archivebox_1 | add(**input_kwargs) > archivebox_1 | File "/app/archivebox/util.py", line 114, in typechecked_function > archivebox_1 | return func(*args, **kwargs) > archivebox_1 | File "/app/archivebox/main.py", line 624, in add > archivebox_1 | archive_links(new_links, overwrite=False, **archive_kwargs) > archivebox_1 | File "/app/archivebox/util.py", line 114, in typechecked_function > archivebox_1 | return func(*args, **kwargs) > archivebox_1 | File "/app/archivebox/extractors/__init__.py", line 181, in archive_links > archivebox_1 | archive_link(to_archive, overwrite=overwrite, methods=methods, out_dir=Path(link.link_dir)) > archivebox_1 | File "/app/archivebox/util.py", line 114, in typechecked_function > archivebox_1 | return func(*args, **kwargs) > archivebox_1 | File "/app/archivebox/extractors/__init__.py", line 130, in archive_link > archivebox_1 | raise Exception('Exception in archive_methods.save_{}(Link(url={}))'.format( > archivebox_1 | Exception: Exception in archive_methods.save_media(Link(url=https://www.youtube.com/watch?v=r02eaOHenE0[](https://www.youtube.com/watch?v=r02eaOHenE0))) > ``` > > Any thoughts on how I can clean this up to allow a successful archive?
Author
Owner

@pirate commented on GitHub (Jul 20, 2022):

May be the same issue as https://github.com/ArchiveBox/ArchiveBox/issues/984, not sure yet

<!-- gh-comment-id:1190500376 --> @pirate commented on GitHub (Jul 20, 2022): May be the same issue as https://github.com/ArchiveBox/ArchiveBox/issues/984, not sure yet
Author
Owner

@jgoerzen commented on GitHub (Aug 22, 2022):

I'm seeing this also, and annoyingly, it crashes the entire add/update operation so that other sites aren't archiving either.

<!-- gh-comment-id:1222666354 --> @jgoerzen commented on GitHub (Aug 22, 2022): I'm seeing this also, and annoyingly, it crashes the entire add/update operation so that other sites aren't archiving either.
Author
Owner

@turian commented on GitHub (Sep 12, 2022):

I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026

TDLR, until that's merged:

Add this to ArchiveBox.conf:

YOUTUBEDL_BINARY=/usr/bin/yt-dlp

If that doesn't work and you still get crap UnicodeDecodeErrors, you can use my Docker turian/archivebox:kludge-984-UTF8-bug, instead of archivebox/archivebox for now. Or use my branch and pip install or whatever from there.

<!-- gh-comment-id:1244601796 --> @turian commented on GitHub (Sep 12, 2022): I believe I fixed this is https://github.com/ArchiveBox/ArchiveBox/pull/1026 TDLR, until that's merged: Add this to ArchiveBox.conf: ``` YOUTUBEDL_BINARY=/usr/bin/yt-dlp ``` If that doesn't work and you still get crap UnicodeDecodeErrors, you can use my Docker `turian/archivebox:kludge-984-UTF8-bug`, instead of `archivebox/archivebox` for now. Or use [my branch](https://github.com/turian/ArchiveBox/tree/feature/kludge-984-UTF8-bug) and pip install or whatever from there.
Author
Owner

@WakeReality commented on GitHub (Jan 8, 2023):

I'm using docker-compose to run ArchiveBox. How do I get yt-dlp inside the docker container? I don't understand how Docker works well enough to populate /usr/bin/yt-dlp binary. Thank you.

<!-- gh-comment-id:1374877936 --> @WakeReality commented on GitHub (Jan 8, 2023): I'm using docker-compose to run ArchiveBox. How do I get yt-dlp inside the docker container? I don't understand how Docker works well enough to populate /usr/bin/yt-dlp binary. Thank you.
Author
Owner

@turian commented on GitHub (Apr 2, 2023):

@WakeReality try my latest fix which is merged into dev, but I don't believe main yet.

<!-- gh-comment-id:1493335882 --> @turian commented on GitHub (Apr 2, 2023): @WakeReality try my latest fix which is merged into dev, but I don't believe main yet.
Author
Owner

@pirate commented on GitHub (Jun 13, 2023):

Thanks @turian for that fix! I'm currently doing a Github issue review and am closing old issues + collecting TODOs for the next release.

If anyone is experiencing this issue still on archivebox/archivebox:dev comment back here and I can reopen the issue.

<!-- gh-comment-id:1589146085 --> @pirate commented on GitHub (Jun 13, 2023): Thanks @turian for that fix! I'm currently doing a Github issue review and am closing old issues + collecting TODOs for the next release. If anyone is experiencing this issue still on `archivebox/archivebox:dev` comment back here and I can reopen the issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#616
No description provided.