[GH-ISSUE #1670] Bug: instagram title #2507

Open
opened 2026-03-01 17:59:31 +03:00 by kerem · 9 comments
Owner

Originally created by @hydrargyrum on GitHub (Mar 29, 2025).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1670

Originally assigned to: @pirate on GitHub.

Provide a screenshot and describe the bug

When submitting an instagram URL into archivebox, the fetched title is only "instagram".
I noticed the title is set in javascript in some circumstances but not all:

  • if a firefox/chrome user-agent is sent, the real title is set with javascript
  • with a curl user-agent, a full title is returned
    For example, I tried this outside archivebox: curl https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/ | htmlq title
    And it printed
<title>Instagram | Nailed it 😆🛹👏

#InTheMoment

Video by @jonjustino 
Music by The Dave Brubeck Quartet | Instagram</title>

So I tried to set CURL_USER_AGENT=curl/8.13.0-rc3 env var in archivebox and submitted another instagram URL, but archivebox still only extracted title "instagram"

Steps to reproduce

1. submit an instagram URL (like `https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/`) with at least `title` extractor

Logs or errors


ArchiveBox Version

v0.7.3

How did you install the version of ArchiveBox you are using?

Docker (or Podman/LXC/K8s/TrueNAS/Proxmox/etc)

What operating system are you running on?

Linux (Ubuntu/Debian/Arch/Alpine/etc.)

What type of drive are you using to store your ArchiveBox data?

  • some of data/ is on a local SSD or NVMe drive
  • some of data/ is on a spinning hard drive or external USB drive
  • some of data/ is on a network mount (e.g. NFS/SMB/Ceph/GlusterFS/etc.)
  • some of data/ is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/Google Drive/Dropbox/etc.)

Docker Compose Configuration


ArchiveBox Configuration


Originally created by @hydrargyrum on GitHub (Mar 29, 2025). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1670 Originally assigned to: @pirate on GitHub. ### Provide a screenshot and describe the bug When submitting an instagram URL into archivebox, the fetched title is only "instagram". I noticed the title is set in javascript in some circumstances but not all: - if a firefox/chrome user-agent is sent, the real title is set with javascript - with a curl user-agent, a full title is returned For example, I tried this outside archivebox: `curl https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/ | htmlq title` And it printed ``` <title>Instagram | Nailed it 😆🛹👏 #InTheMoment Video by @jonjustino Music by The Dave Brubeck Quartet | Instagram</title> ``` So I tried to set `CURL_USER_AGENT=curl/8.13.0-rc3` env var in archivebox and submitted another instagram URL, but archivebox still only extracted title "instagram" ### Steps to reproduce ```markdown 1. submit an instagram URL (like `https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/`) with at least `title` extractor ``` ### Logs or errors ```shell ``` ### ArchiveBox Version ```shell v0.7.3 ``` ### How did you install the version of ArchiveBox you are using? Docker (or Podman/LXC/K8s/TrueNAS/Proxmox/etc) ### What operating system are you running on? Linux (Ubuntu/Debian/Arch/Alpine/etc.) ### What type of drive are you using to store your ArchiveBox data? - [ ] some of `data/` is on a local SSD or NVMe drive - [ ] some of `data/` is on a spinning hard drive or external USB drive - [ ] some of `data/` is on a network mount (e.g. NFS/SMB/Ceph/GlusterFS/etc.) - [ ] some of `data/` is on a FUSE mount (e.g. SSHFS/RClone/S3/B2/Google Drive/Dropbox/etc.) ### Docker Compose Configuration ```shell ``` ### ArchiveBox Configuration ```shell ```
Author
Owner

@hydrargyrum commented on GitHub (Mar 29, 2025):

with wget:

  • this extracts the right title (because UA is curl): wget --header "User-Agent: curl" -O- https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/ | htmlq title
  • this extract a bad title (wget UA): wget -O- https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/ | htmlq title, yielding <title id="pageTitle">Update your browser</title>
<!-- gh-comment-id:2763633471 --> @hydrargyrum commented on GitHub (Mar 29, 2025): with wget: - this extracts the right title (because UA is curl): `wget --header "User-Agent: curl" -O- https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/ | htmlq title` - this extract a bad title (wget UA): `wget -O- https://www.instagram.com/instagram/reel/DHd6PNKs_Mj/ | htmlq title`, yielding `<title id="pageTitle">Update your browser</title>`
Author
Owner

@hydrargyrum commented on GitHub (Mar 29, 2025):

if i'm reading the code correctly:

  • this part builds a valid curl command to download the page in order to extract the title
  • but cmd isn't used anywhere in the function! (except for the result reporting, which is a big lie)
  • instead, the page is either reused from dom/wget (which aren't curl, and its UA is a different env variable)
  • or the page is downloaded with python requests and its UA is not configurable anyway

i would have no idea how to fix this, it's quite a mess

<!-- gh-comment-id:2764245285 --> @hydrargyrum commented on GitHub (Mar 29, 2025): if i'm reading the code correctly: - [this part](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/pkgs/abx-plugin-title/abx_plugin_title/extractor.py#L114) builds a valid curl command to download the page in order to extract the title - but `cmd` isn't used anywhere in the function! (except for the result reporting, which is a big lie) - instead, the page is either [reused](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/pkgs/abx-plugin-title/abx_plugin_title/extractor.py#L64) from dom/wget (which aren't curl, and its UA is a different env variable) - or the page is downloaded with python [requests](https://github.com/ArchiveBox/ArchiveBox/blob/dev/archivebox/misc/util.py#L226) and its UA is not configurable anyway i would have no idea how to fix this, it's quite a mess
Author
Owner

@pirate commented on GitHub (Mar 30, 2025):

Trust me it's that way for good reason, the cmd is assembled so the user has something to copy-paste to debug further, this is the case for several of the Python based extractors. Curl wont exactly replicate what archivebox does but it's close enough that it will help the user find the source of the issue 99% of the time.

<!-- gh-comment-id:2764363376 --> @pirate commented on GitHub (Mar 30, 2025): Trust me it's that way for good reason, the cmd is assembled so the user has something to copy-paste to debug further, this is the case for several of the Python based extractors. Curl wont exactly replicate what archivebox does but it's close enough that it will help the user find the source of the issue 99% of the time.
Author
Owner

@hydrargyrum commented on GitHub (Mar 30, 2025):

sure but in the end:

  1. it doesn't extract the title
  2. the documented env var (CURL_USER_AGENT) is worthless
  3. the "it's only to replicate 99% the issue" is not advertised as such neither in the interface or the documentation

all of this is very confusing for the user, and i can't see a workaround to make archivebox extract the title correctly

<!-- gh-comment-id:2764407758 --> @hydrargyrum commented on GitHub (Mar 30, 2025): sure but in the end: 1. it doesn't extract the title 2. the documented env var (CURL_USER_AGENT) is worthless 3. the "it's only to replicate 99% the issue" is not advertised as such neither in the interface or the documentation all of this is very confusing for the user, and i can't see a workaround to make archivebox extract the title correctly
Author
Owner

@pirate commented on GitHub (Mar 30, 2025):

If you'd like to submit a PR that applies CURL_USER_AGENT correctly to the title curl method I'd happily review it.

PRs to improve the docs are also welcome: https://github.com/ArchiveBox/docs

<!-- gh-comment-id:2764790639 --> @pirate commented on GitHub (Mar 30, 2025): If you'd like to submit a PR that applies `CURL_USER_AGENT` correctly to the title curl method I'd happily review it. PRs to improve the docs are also welcome: https://github.com/ArchiveBox/docs
Author
Owner

@pirate commented on GitHub (Apr 1, 2025):

oh wait you were on v0.7.3! no wonder. the newer betas already added the configurable user agent when downloading the url in python. it doens't make sense to use the CURL_USER_AGENT for the download because it's done in python and doesn't actually use curl. we have a new default USER_AGENT config option for exactly this type of situation in the new betas.

<!-- gh-comment-id:2768704546 --> @pirate commented on GitHub (Apr 1, 2025): oh wait you were on `v0.7.3`! no wonder. the newer betas already added the configurable user agent when downloading the url in python. it doens't make sense to use the CURL_USER_AGENT for the download because it's done in python and doesn't actually use curl. we have a new default `USER_AGENT` config option for exactly this type of situation in the new betas.
Author
Owner

@hydrargyrum commented on GitHub (Apr 1, 2025):

i'm using latest tag from https://hub.docker.com/r/archivebox/archivebox/tags

<!-- gh-comment-id:2768712778 --> @hydrargyrum commented on GitHub (Apr 1, 2025): i'm using `latest` tag from https://hub.docker.com/r/archivebox/archivebox/tags
Author
Owner

@hydrargyrum commented on GitHub (Apr 1, 2025):

it doens't make sense to use the CURL_USER_AGENT for the download because it's done in python and doesn't actually use curl

yes, but that's what the whole UI says! it shows a curl command! nowhere it's documented that the download is done with python with USER_AGENT!

plus, if there's a download with dom/singlefile/wget, then that one is used, so USER_AGENT is ignored

<!-- gh-comment-id:2768716704 --> @hydrargyrum commented on GitHub (Apr 1, 2025): > it doens't make sense to use the CURL_USER_AGENT for the download because it's done in python and doesn't actually use curl yes, but that's what the whole UI says! it shows a curl command! nowhere it's documented that the download is done with python with USER_AGENT! plus, if there's a download with dom/singlefile/wget, then that one is used, so USER_AGENT is ignored
Author
Owner

@pirate commented on GitHub (Apr 1, 2025):

It's not ignored, USER_AGENT is applied to the other methods too. CURL_USER_AGENT just overrides USER_AGENT in the case that any curl command is rendered either for execution or for log display.

The goal is to ensure anytime curl is executed, it has that agent (to avoid ever leaking original agents). The command shown in terminal is a suggested command you can use to debug the behavior, we may use python dotted import function references in the future as well, and in that case we can show the true underlying command, until then a replica curl command that respects CURL_USER_AGENT is good enough.

<!-- gh-comment-id:2768808610 --> @pirate commented on GitHub (Apr 1, 2025): It's not ignored, USER_AGENT is applied to the other methods too. CURL_USER_AGENT just overrides USER_AGENT in the case that any curl command is rendered either for execution or for log display. The goal is to ensure anytime curl is executed, it has that agent (to avoid ever leaking original agents). The command shown in terminal is a suggested command you can use to debug the behavior, we may use python dotted import function references in the future as well, and in that case we can show the true underlying command, until then a replica curl command that respects CURL_USER_AGENT is good enough.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2507
No description provided.