[GH-ISSUE #761] COOKIES_FILE isn't used when fetching page titles, leading to saving captcha-page titles like "Before you continue to YouTube..." #3501

Open
opened 2026-03-14 23:17:27 +03:00 by kerem · 5 comments
Owner

Originally created by @dansbandit on GitHub (Jun 5, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/761

Describe the bug

Title becomes 'Before you continue to YouTube' instead of video title due to youtube redirects to a cookie consent form. This could be solved if you could add a cookie file to the curl command that is run.

["curl", "--silent", "--location", "--compressed", "--max-time", "60", "--user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.76.0 (amd64-portbld-freebsd12.2)", "https://www.youtube.com/watch?v=aP8sRCun63M"]

Steps to reproduce

  1. archivebox add https://www.youtube.com/watch?v=aP8sRCun63M
  2. Title becomes 'Before you continue to YouTube' when it should be 'ArchiveBox'

Screenshots or log output

N/A

ArchiveBox version

ArchiveBox v0.6.2
Cpython FreeBSD FreeBSD-12.2-RELEASE-p6-amd64-64bit-ELF amd64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     ./.local/bin/archivebox                                                     
 √  PYTHON_BINARY         v3.7.10         valid     /usr/local/bin/python3.7                                                    
 √  DJANGO_BINARY         v3.1.12         valid     ./.local/lib/python3.7/site-packages/django/bin/django-admin.py             
 √  CURL_BINARY           v7.76.0         valid     /usr/local/bin/curl                                                         
 √  WGET_BINARY           v1.21           valid     /usr/local/bin/wget                                                         
 √  NODE_BINARY           v14.16.1        valid     /usr/local/bin/node                                                         
 √  SINGLEFILE_BINARY     v0.3.13         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.1.0          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.31.1         valid     /usr/local/bin/git                                                          
 √  YOUTUBEDL_BINARY      v2021.05.16     valid     /home/archivebox/.local/bin/youtube-dl                                      
 √  CHROME_BINARY         v90.0.4430.212  valid     /usr/local/bin/chrome                                                       
 √  RIPGREP_BINARY        v12.1.1         valid     /usr/local/bin/rg                                                           

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     ./.local/lib/python3.7/site-packages/archivebox                             
 √  TEMPLATES_DIR         3 files         valid     ./.local/lib/python3.7/site-packages/archivebox/templates                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 √  CHROME_USER_DATA_DIR  1 files         valid     ./~/.config/chromium                                                        
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
Originally created by @dansbandit on GitHub (Jun 5, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/761 #### Describe the bug Title becomes 'Before you continue to YouTube' instead of video title due to youtube redirects to a cookie consent form. This could be solved if you could add a cookie file to the curl command that is run. ``` ["curl", "--silent", "--location", "--compressed", "--max-time", "60", "--user-agent", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) curl/curl 7.76.0 (amd64-portbld-freebsd12.2)", "https://www.youtube.com/watch?v=aP8sRCun63M"] ``` #### Steps to reproduce 1. `archivebox add https://www.youtube.com/watch?v=aP8sRCun63M` 2. Title becomes 'Before you continue to YouTube' when it should be 'ArchiveBox' #### Screenshots or log output N/A #### ArchiveBox version ``` ArchiveBox v0.6.2 Cpython FreeBSD FreeBSD-12.2-RELEASE-p6-amd64-64bit-ELF amd64 IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid ./.local/bin/archivebox √ PYTHON_BINARY v3.7.10 valid /usr/local/bin/python3.7 √ DJANGO_BINARY v3.1.12 valid ./.local/lib/python3.7/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.76.0 valid /usr/local/bin/curl √ WGET_BINARY v1.21 valid /usr/local/bin/wget √ NODE_BINARY v14.16.1 valid /usr/local/bin/node √ SINGLEFILE_BINARY v0.3.13 valid ./node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.1.0 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.31.1 valid /usr/local/bin/git √ YOUTUBEDL_BINARY v2021.05.16 valid /home/archivebox/.local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.212 valid /usr/local/bin/chrome √ RIPGREP_BINARY v12.1.1 valid /usr/local/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid ./.local/lib/python3.7/site-packages/archivebox √ TEMPLATES_DIR 3 files valid ./.local/lib/python3.7/site-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: √ CHROME_USER_DATA_DIR 1 files valid ./~/.config/chromium - COOKIES_FILE - disabled [i] Data locations: ```
Author
Owner

@SoraMakes commented on GitHub (Jun 11, 2021):

ArchiveBox provides a way to add a cockies file. I use the docker image and there i added the following environment variable for that: COOKIES_FILE=/data/cookies.txt

I think it is only used for media and wget.

<!-- gh-comment-id:859565814 --> @SoraMakes commented on GitHub (Jun 11, 2021): ArchiveBox provides a way to add a cockies file. I use the docker image and there i added the following environment variable for that: COOKIES_FILE=/data/cookies.txt I think it is only used for media and wget.
Author
Owner

@dansbandit commented on GitHub (Jun 15, 2021):

ArchiveBox provides a way to add a cockies file. I use the docker image and there i added the following environment variable for that: COOKIES_FILE=/data/cookies.txt

I think it is only used for media and wget.

Yes I've tried that environment variable and it seems that it doesn't affect the title.

<!-- gh-comment-id:861091303 --> @dansbandit commented on GitHub (Jun 15, 2021): > ArchiveBox provides a way to add a cockies file. I use the docker image and there i added the following environment variable for that: COOKIES_FILE=/data/cookies.txt > > I think it is only used for media and wget. Yes I've tried that environment variable and it seems that it doesn't affect the title.
Author
Owner

@pirate commented on GitHub (Jun 18, 2021):

Unfortunately the cookies file does not apply to the title, so there's no easy way to get around this right now till we push a fix to use the cookies in download_url() (see archivebox/extractors/title.py).

You'll have to edit the titles manually in the Admin to fix them, or try and stay under the rate limits that Youtube uses so that you're not throttled and getting captcha pages. You can always click Pull Title in the Admin UI to force re-fetching the title.

<!-- gh-comment-id:864321794 --> @pirate commented on GitHub (Jun 18, 2021): Unfortunately the cookies file does not apply to the title, so there's no easy way to get around this right now till we push a fix to use the cookies in `download_url()` (see `archivebox/extractors/title.py`). You'll have to edit the titles manually in the Admin to fix them, or try and stay under the rate limits that Youtube uses so that you're not throttled and getting captcha pages. You can always click `Pull Title` in the Admin UI to force re-fetching the title.
Author
Owner

@dansbandit commented on GitHub (Jun 21, 2021):

If I recall correctly the cookie consent form affects all European user regardless of rate limits.

In the meantime I will try to get the titles another way.

<!-- gh-comment-id:864979578 --> @dansbandit commented on GitHub (Jun 21, 2021): If I recall correctly the cookie consent form affects all European user regardless of rate limits. In the meantime I will try to get the titles another way.
Author
Owner

@JoshMock commented on GitHub (Feb 5, 2023):

Would this still be a good first ticket? Looking to start making some contributions to the project, but want to get familiar with the codebase first.

<!-- gh-comment-id:1416903247 --> @JoshMock commented on GitHub (Feb 5, 2023): Would this still be a good first ticket? Looking to start making some contributions to the project, but want to get familiar with the codebase first.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3501
No description provided.