[GH-ISSUE #889] Bug: Unable to download TikTok page #550

Closed
opened 2026-03-01 14:44:30 +03:00 by kerem · 3 comments
Owner

Originally created by @aidenmitchell on GitHub (Nov 13, 2021).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/889

Describe the bug

Not sure if this is a bug, a limitation of the Tiktok website, or a limitation of ArchiveBox, however ArchiveBox fails to download all of the tiktoks on a user's page.

Steps to reproduce

Ran ArchiveBox depth=1 using media archive method

Screenshots or log output

[+] [2021-11-12 23:56:55] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1636761415-import.txt > Parsed 1 URLs from input (Generic TXT) [*] Starting crawl of 1 sites 1 hop out from starting point > Downloading https://vm.tiktok.com/REDACTED_URL contents > Saved verbatim input to sources/1636761415.248872-crawl-vm.tiktok.com.txt > Parsed 0 URLs from input (Failed to parse) > Found 1 new URLs not already in index [*] [2021-11-12 23:56:55] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2021-11-12 23:56:55] Starting archiving of 1 snapshots in index... [+] [2021-11-12 23:56:55] "vm.tiktok.com/REDACTED_URL" https://vm.tiktok.com/REDACTED_URL > ./archive/1636761415.248872 > media Extractor failed: Failed to save media Got youtube-dl response code: 1. WARNING: The program functionality for this site has been marked as broken, and will probably not work. ERROR: Unable to download JSON metadata: HTTP Error 502: Bad Gateway (caused by ); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. Run to see full output: cd /data/archive/1636761415.248872; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --geo-bypass --add-metadata --max-filesize=750m https://vm.tiktok.com/REDACTED_URL 2 files (238.4 KB) in 0:00:03s [√] [2021-11-12 23:56:59] Update of 1 pages complete (3.54 sec) - 0 links skipped - 7 links updated - 3 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000 [+] [2021-11-12 23:56:55] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1636761415-import.txt > Parsed 1 URLs from input (Generic TXT) [*] Starting crawl of 1 sites 1 hop out from starting point > Downloading https://vm.tiktok.com/REDACTED_URL contents > Saved verbatim input to sources/1636761415.248872-crawl-vm.tiktok.com.txt > Parsed 0 URLs from input (Failed to parse) > Found 1 new URLs not already in index [*] [2021-11-12 23:56:55] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2021-11-12 23:56:55] Starting archiving of 1 snapshots in index... [+] [2021-11-12 23:56:55] "vm.tiktok.com/REDACTED_URL" https://vm.tiktok.com/REDACTED_URL > ./archive/1636761415.248872 > media Extractor failed: Failed to save media Got youtube-dl response code: 1. WARNING: The program functionality for this site has been marked as broken, and will probably not work. ERROR: Unable to download JSON metadata: HTTP Error 502: Bad Gateway (caused by ); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. Run to see full output: cd /data/archive/1636761415.248872; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --geo-bypass --add-metadata --max-filesize=750m https://vm.tiktok.com/REDACTED_URL 2 files (238.4 KB) in 0:00:03s [√] [2021-11-12 23:56:59] Update of 1 pages complete (3.54 sec) - 0 links skipped - 7 links updated - 3 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000

ArchiveBox version

ArchiveBox v0.6.3
Cpython Linux Linux-5.4.0-90-generic-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.3          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.8          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.13         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 √  GIT_BINARY            v2.20.1         valid     /usr/bin/git                                                                
 √  YOUTUBEDL_BINARY      v2021.06.06     valid     /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v90.0.4430.212  valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /data                                                                       
 √  SOURCES_DIR           12 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           3 files         valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             228.0 KB        valid     ./index.sqlite3 
Originally created by @aidenmitchell on GitHub (Nov 13, 2021). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/889 #### Describe the bug Not sure if this is a bug, a limitation of the Tiktok website, or a limitation of ArchiveBox, however ArchiveBox fails to download all of the tiktoks on a user's page. #### Steps to reproduce Ran ArchiveBox `depth=1` using `media` archive method #### Screenshots or log output ```logs [+] [2021-11-12 23:56:55] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1636761415-import.txt > Parsed 1 URLs from input (Generic TXT) [*] Starting crawl of 1 sites 1 hop out from starting point > Downloading https://vm.tiktok.com/REDACTED_URL contents > Saved verbatim input to sources/1636761415.248872-crawl-vm.tiktok.com.txt > Parsed 0 URLs from input (Failed to parse) > Found 1 new URLs not already in index [*] [2021-11-12 23:56:55] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2021-11-12 23:56:55] Starting archiving of 1 snapshots in index... [+] [2021-11-12 23:56:55] "vm.tiktok.com/REDACTED_URL" https://vm.tiktok.com/REDACTED_URL > ./archive/1636761415.248872 > media Extractor failed: Failed to save media Got youtube-dl response code: 1. WARNING: The program functionality for this site has been marked as broken, and will probably not work. ERROR: Unable to download JSON metadata: HTTP Error 502: Bad Gateway (caused by ); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. Run to see full output: cd /data/archive/1636761415.248872; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --geo-bypass --add-metadata --max-filesize=750m https://vm.tiktok.com/REDACTED_URL 2 files (238.4 KB) in 0:00:03s [√] [2021-11-12 23:56:59] Update of 1 pages complete (3.54 sec) - 0 links skipped - 7 links updated - 3 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000 [+] [2021-11-12 23:56:55] Adding 1 links to index (crawl depth=1)... > Saved verbatim input to sources/1636761415-import.txt > Parsed 1 URLs from input (Generic TXT) [*] Starting crawl of 1 sites 1 hop out from starting point > Downloading https://vm.tiktok.com/REDACTED_URL contents > Saved verbatim input to sources/1636761415.248872-crawl-vm.tiktok.com.txt > Parsed 0 URLs from input (Failed to parse) > Found 1 new URLs not already in index [*] [2021-11-12 23:56:55] Writing 1 links to main index... √ ./index.sqlite3 [▶] [2021-11-12 23:56:55] Starting archiving of 1 snapshots in index... [+] [2021-11-12 23:56:55] "vm.tiktok.com/REDACTED_URL" https://vm.tiktok.com/REDACTED_URL > ./archive/1636761415.248872 > media Extractor failed: Failed to save media Got youtube-dl response code: 1. WARNING: The program functionality for this site has been marked as broken, and will probably not work. ERROR: Unable to download JSON metadata: HTTP Error 502: Bad Gateway (caused by ); please report this issue on https://yt-dl.org/bug . Make sure you are using the latest version; see https://yt-dl.org/update on how to update. Be sure to call youtube-dl with the --verbose flag and include its complete output. Run to see full output: cd /data/archive/1636761415.248872; youtube-dl --write-description --write-info-json --write-annotations --write-thumbnail --no-call-home --write-sub --all-subs --write-auto-sub --convert-subs=srt --yes-playlist --continue --ignore-errors --geo-bypass --add-metadata --max-filesize=750m https://vm.tiktok.com/REDACTED_URL 2 files (238.4 KB) in 0:00:03s [√] [2021-11-12 23:56:59] Update of 1 pages complete (3.54 sec) - 0 links skipped - 7 links updated - 3 links had errors Hint: To manage your archive in a Web UI, run: archivebox server 0.0.0.0:8000 ``` #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.3 Cpython Linux Linux-5.4.0-90-generic-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.3 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.8 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.13 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.20.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.06.06 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.212 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 5 files valid /data √ SOURCES_DIR 12 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 3 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 228.0 KB valid ./index.sqlite3 ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
kerem closed this issue 2026-03-01 14:44:30 +03:00
Author
Owner

@pirate commented on GitHub (Nov 13, 2021):

Can you posted the (optionally redacted) contents of sources/1636761415.248872-crawl-vm.tiktok.com.txt? Looks like there aren't any URLs within that file according to ArchiveBox.

Also looks like there's an underlying issue with youtube-dl and TikTok that they need to fix, might need to wait for them to push a fix upstream and then upgrade your youtubedl version:

... youtubedl output from your logs above:
WARNING: The program functionality for this site has been marked as broken, and will probably not work. ERROR: Unable to download JSON metadata: HTTP Error 502: Bad Gateway
<!-- gh-comment-id:967741160 --> @pirate commented on GitHub (Nov 13, 2021): Can you posted the (optionally redacted) contents of `sources/1636761415.248872-crawl-vm.tiktok.com.txt`? Looks like there aren't any URLs within that file according to ArchiveBox. Also looks like there's an underlying issue with `youtube-dl` and TikTok that they need to fix, might need to wait for them to push a fix upstream and then upgrade your youtubedl version: ```logs ... youtubedl output from your logs above: WARNING: The program functionality for this site has been marked as broken, and will probably not work. ERROR: Unable to download JSON metadata: HTTP Error 502: Bad Gateway ```
Author
Owner

@aidenmitchell commented on GitHub (Nov 13, 2021):

sources/1636761415.248872-crawl-vm.tiktok.com.txt is empty.

As for youtube-dl, it does work with downloading individual videos, if you give ArchiveBox the video link.
For example, https://vm.tiktok.com/ZM8477Q11/ will download using the media archiver, so I'm hoping ArchiveBox can do that with each link on a profile page.

Sidenote, how would I upgrade youtube-dl if/when necessary? Thanks for all your help!

<!-- gh-comment-id:967743492 --> @aidenmitchell commented on GitHub (Nov 13, 2021): `sources/1636761415.248872-crawl-vm.tiktok.com.txt` is empty. As for `youtube-dl`, it *does* work with downloading individual videos, if you give ArchiveBox the video link. For example, `https://vm.tiktok.com/ZM8477Q11/` will download using the `media` archiver, so I'm hoping ArchiveBox can do that with each link on a profile page. Sidenote, how would I upgrade `youtube-dl` if/when necessary? Thanks for all your help!
Author
Owner

@pirate commented on GitHub (Jan 19, 2024):

Going to close this as stale for now. There are many improvements released since this was open, including switching from youtube-dl to yt-dlp.

We've also added SINGLEFILE_ARGS so users can point singlefile to a logged-in chrome profile that should allow archiving content that requires login.

Please open a new issue if you're still having problems with TikTok / any particular situations in archivebox >0.7.2!

<!-- gh-comment-id:1899773987 --> @pirate commented on GitHub (Jan 19, 2024): Going to close this as stale for now. There are many improvements released since this was open, including switching from `youtube-dl` to `yt-dlp`. We've also added SINGLEFILE_ARGS so users can point singlefile to a logged-in chrome profile that should allow archiving content that requires login. Please open a new issue if you're still having problems with TikTok / any particular situations in archivebox >`0.7.2`!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#550
No description provided.