mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 09:06:02 +03:00
[GH-ISSUE #1670] Bug: instagram title #2507
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#2507
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @hydrargyrum on GitHub (Mar 29, 2025).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1670
Originally assigned to: @pirate on GitHub.
Provide a screenshot and describe the bug
When submitting an instagram URL into archivebox, the fetched title is only "instagram".
I noticed the title is set in javascript in some circumstances but not all:
For example, I tried this outside archivebox:
curl https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/ | htmlq titleAnd it printed
So I tried to set
CURL_USER_AGENT=curl/8.13.0-rc3env var in archivebox and submitted another instagram URL, but archivebox still only extracted title "instagram"Steps to reproduce
Logs or errors
ArchiveBox Version
How did you install the version of ArchiveBox you are using?
Docker (or Podman/LXC/K8s/TrueNAS/Proxmox/etc)
What operating system are you running on?
Linux (Ubuntu/Debian/Arch/Alpine/etc.)
What type of drive are you using to store your ArchiveBox data?
data/is on a local SSD or NVMe drivedata/is on a spinning hard drive or external USB drivedata/is on a network mount (e.g. NFS/SMB/Ceph/GlusterFS/etc.)data/is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/Google Drive/Dropbox/etc.)Docker Compose Configuration
ArchiveBox Configuration
@hydrargyrum commented on GitHub (Mar 29, 2025):
with wget:
wget --header "User-Agent: curl" -O- https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/ | htmlq titlewget -O- https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/ | htmlq title, yielding<title id="pageTitle">Update your browser</title>@hydrargyrum commented on GitHub (Mar 29, 2025):
if i'm reading the code correctly:
cmdisn't used anywhere in the function! (except for the result reporting, which is a big lie)i would have no idea how to fix this, it's quite a mess
@pirate commented on GitHub (Mar 30, 2025):
Trust me it's that way for good reason, the cmd is assembled so the user has something to copy-paste to debug further, this is the case for several of the Python based extractors. Curl wont exactly replicate what archivebox does but it's close enough that it will help the user find the source of the issue 99% of the time.
@hydrargyrum commented on GitHub (Mar 30, 2025):
sure but in the end:
all of this is very confusing for the user, and i can't see a workaround to make archivebox extract the title correctly
@pirate commented on GitHub (Mar 30, 2025):
If you'd like to submit a PR that applies
CURL_USER_AGENTcorrectly to the title curl method I'd happily review it.PRs to improve the docs are also welcome: https://github.com/ArchiveBox/docs
@pirate commented on GitHub (Apr 1, 2025):
oh wait you were on
v0.7.3! no wonder. the newer betas already added the configurable user agent when downloading the url in python. it doens't make sense to use the CURL_USER_AGENT for the download because it's done in python and doesn't actually use curl. we have a new defaultUSER_AGENTconfig option for exactly this type of situation in the new betas.@hydrargyrum commented on GitHub (Apr 1, 2025):
i'm using
latesttag from https://hub.docker.com/r/archivebox/archivebox/tags@hydrargyrum commented on GitHub (Apr 1, 2025):
yes, but that's what the whole UI says! it shows a curl command! nowhere it's documented that the download is done with python with USER_AGENT!
plus, if there's a download with dom/singlefile/wget, then that one is used, so USER_AGENT is ignored
@pirate commented on GitHub (Apr 1, 2025):
It's not ignored, USER_AGENT is applied to the other methods too. CURL_USER_AGENT just overrides USER_AGENT in the case that any curl command is rendered either for execution or for log display.
The goal is to ensure anytime curl is executed, it has that agent (to avoid ever leaking original agents). The command shown in terminal is a suggested command you can use to debug the behavior, we may use python dotted import function references in the future as well, and in that case we can show the true underlying command, until then a replica curl command that respects CURL_USER_AGENT is good enough.