[GH-ISSUE #1086] Bug: Not archiving Twitter correctly #680

Closed
opened 2026-03-01 14:45:29 +03:00 by kerem · 5 comments
Owner

Originally created by @m-primo on GitHub (Jan 19, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1086

Describe the bug

No screenshot, single file, and output.html are saved.
And not the tweet itself "Hmm...this page doesn’t exist. Try searching for something else.".
Check the screenshot

Steps to reproduce

  1. Open your ArchiveBox instance.
  2. Archive any Twitter tweet.
  3. Check for yourself.

Even in your own demo instance it doesn't work!

Screenshots or log output

image

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-5.15.0-58-generic-x86_64-with-glibc2.35 x86_64
IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox
 √  PYTHON_BINARY         v3.10.6         valid     /usr/bin/python3.10

 √  DJANGO_BINARY         v3.1.14         valid     /usr/local/lib/python3.10/dist-packages/django/bin/django-admin.py
 √  CURL_BINARY           v7.81.0         valid     /usr/bin/curl

 √  WGET_BINARY           v1.21.2         valid     /usr/bin/wget

 √  NODE_BINARY           v18.12.1        valid     /usr/bin/node

 √  SINGLEFILE_BINARY     v1.0.25         valid     ./node_modules/single-file/cli/single-file
 √  READABILITY_BINARY    v0.0.4          valid     ./node_modules/readability-extractor/readability-extractor
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js
 √  GIT_BINARY            v2.34.1         valid     /usr/bin/git

 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /usr/local/bin/youtube-dl
 √  CHROME_BINARY         v90.0.4430.212  valid     /usr/bin/chromium

 X  RIPGREP_BINARY        ?               invalid   rg


[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /usr/local/lib/python3.10/dist-packages/archivebox
 √  TEMPLATES_DIR         3 files         valid     /usr/local/lib/python3.10/dist-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled


[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled

 -  COOKIES_FILE          -               disabled


[i] Data locations:
 √  OUTPUT_DIR            6 files         valid     /home/<USERNAME_REDACTED>/archivebox

 √  SOURCES_DIR           1 files         valid     ./sources

 √  LOGS_DIR              2 files         valid     ./logs

 √  ARCHIVE_DIR           2 files         valid     ./archive

 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf

 √  SQL_INDEX             216.0 KB        valid     ./index.sqlite3


[!] Warning: Missing 1 recommended dependencies
    ! RIPGREP_BINARY: rg (unable to detect version)
Originally created by @m-primo on GitHub (Jan 19, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1086 <!-- Please fill out the following information, feel free to delete sections if they're not applicable or if long issue templates annoy you. (the only required section is the version information) --> #### Describe the bug <!-- A description of what the bug is, what you expected to happen, and any relevant context about issue. --> No screenshot, single file, and output.html are saved. And not the tweet itself "Hmm...this page doesn’t exist. Try searching for something else.". Check the screenshot #### Steps to reproduce <!-- For example: 1. Ran ArchiveBox with the following config '...' 2. Saw this output during archiving '....' 3. UI didn't show the thing I was expecting '....' --> 1. Open your ArchiveBox instance. 2. Archive any Twitter tweet. 3. Check for yourself. Even in your own [demo instance](https://demo.archivebox.io/archive/1670286539.005187/index.html#1598332648723976193.html) it doesn't work! #### Screenshots or log output <!-- If applicable, post any relevant screenshots or copy/pasted terminal output from ArchiveBox. If you're reporting a parsing / importing error, **you must paste a copy of your redacted import file here**. --> ![image](https://user-images.githubusercontent.com/44984918/213495557-74c65e13-27d3-403e-8200-516d9afb0b4b.png) #### ArchiveBox version <!-- Run the `archivebox version` command locally then copy paste the result here: --> ```logs ArchiveBox v0.6.2 Cpython Linux Linux-5.15.0-58-generic-x86_64-with-glibc2.35 x86_64 IN_DOCKER=False DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.10.6 valid /usr/bin/python3.10 √ DJANGO_BINARY v3.1.14 valid /usr/local/lib/python3.10/dist-packages/django/bin/django-admin.py √ CURL_BINARY v7.81.0 valid /usr/bin/curl √ WGET_BINARY v1.21.2 valid /usr/bin/wget √ NODE_BINARY v18.12.1 valid /usr/bin/node √ SINGLEFILE_BINARY v1.0.25 valid ./node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.4 valid ./node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid ./node_modules/@postlight/mercury-parser/cli.js √ GIT_BINARY v2.34.1 valid /usr/bin/git √ YOUTUBEDL_BINARY v2021.12.17 valid /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.212 valid /usr/bin/chromium X RIPGREP_BINARY ? invalid rg [i] Source-code locations: √ PACKAGE_DIR 23 files valid /usr/local/lib/python3.10/dist-packages/archivebox √ TEMPLATES_DIR 3 files valid /usr/local/lib/python3.10/dist-packages/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 6 files valid /home/<USERNAME_REDACTED>/archivebox √ SOURCES_DIR 1 files valid ./sources √ LOGS_DIR 2 files valid ./logs √ ARCHIVE_DIR 2 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 216.0 KB valid ./index.sqlite3 [!] Warning: Missing 1 recommended dependencies ! RIPGREP_BINARY: rg (unable to detect version) ``` <!-- Tickets without full version info will closed until it is provided, we need the full output here to help you solve your issue -->
kerem closed this issue 2026-03-01 14:45:29 +03:00
Author
Owner

@m-primo commented on GitHub (Jan 19, 2023):

btw, I tried to save tweets with headless chromium and i got the same result.

<!-- gh-comment-id:1397249992 --> @m-primo commented on GitHub (Jan 19, 2023): btw, I tried to save tweets with headless chromium and i got the same result.
Author
Owner

@pirate commented on GitHub (Jan 21, 2023):

Yup, you should archive the equivalent Nitter URLs (or use another alternative frontend instead of twitter). Twitter has always been very broken. This is also true for Reddit -> Teddit, Instagram -> Bibliogram, and a couple other big companies that implement advanced bot-detection and blocking, see a longer list of alternative front-ends here: https://hackmd.io/MCpUlTbLThyF6cw_fywT_g?view. It's not ideal but it's better than not having any solution.

Follow here for updates: https://github.com/ArchiveBox/ArchiveBox/issues/345

<!-- gh-comment-id:1399160524 --> @pirate commented on GitHub (Jan 21, 2023): Yup, you should archive the equivalent Nitter URLs (or use another alternative frontend instead of twitter). Twitter has always been very broken. This is also true for Reddit -> Teddit, Instagram -> Bibliogram, and a couple other big companies that implement advanced bot-detection and blocking, see a longer list of alternative front-ends here: https://hackmd.io/MCpUlTbLThyF6cw_fywT_g?view. It's not ideal but it's better than not having any solution. Follow here for updates: https://github.com/ArchiveBox/ArchiveBox/issues/345
Author
Owner

@m-primo commented on GitHub (Jan 22, 2023):

Yup, you should archive the equivalent Nitter URLs (or use another alternative frontend instead of twitter). Twitter has always been very broken. This is also true for Reddit -> Teddit, Instagram -> Bibliogram, and a couple other big companies that implement advanced bot-detection and blocking, see a longer list of alternative front-ends here: https://hackmd.io/MCpUlTbLThyF6cw_fywT_g?view. It's not ideal but it's better than not having any solution.

Follow here for updates: #345

That's what I thought at first, but I opened an issue so if anyone can help or find out any solution, because I've tried many archiving solutions, and some work arounds, ig the only one worked was pywb. But thanks, I'll take a look at the link in your reply.

<!-- gh-comment-id:1399520190 --> @m-primo commented on GitHub (Jan 22, 2023): > Yup, you should archive the equivalent Nitter URLs (or use another alternative frontend instead of twitter). Twitter has always been very broken. This is also true for Reddit -> Teddit, Instagram -> Bibliogram, and a couple other big companies that implement advanced bot-detection and blocking, see a longer list of alternative front-ends here: https://hackmd.io/MCpUlTbLThyF6cw_fywT_g?view. It's not ideal but it's better than not having any solution. > > Follow here for updates: #345 That's what I thought at first, but I opened an issue so if anyone can help or find out any solution, because I've tried many archiving solutions, and some work arounds, ig the only one worked was `pywb`. But thanks, I'll take a look at the link in your reply.
Author
Owner

@pirate commented on GitHub (Jan 22, 2023):

Yeah if you're doing a lot of twitter/fb/insta/etc. archiving I highly recommend https://github.com/webrecorder/browsertrix-crawler, it uses the same engine as pywb and is written by the same team.

Check out their whole suite here: https://webrecorder.net/

<!-- gh-comment-id:1399643766 --> @pirate commented on GitHub (Jan 22, 2023): Yeah if you're doing a lot of twitter/fb/insta/etc. archiving I highly recommend https://github.com/webrecorder/browsertrix-crawler, it uses the same engine as pywb and is written by the same team. Check out their whole suite here: https://webrecorder.net/
Author
Owner

@m-primo commented on GitHub (Jan 23, 2023):

Okay, thank you so much.

<!-- gh-comment-id:1400257597 --> @m-primo commented on GitHub (Jan 23, 2023): Okay, thank you so much.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#680
No description provided.