mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
[PR #1396] [MERGED] fix the URL_REGEX used in generic_html parsers #1398
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#1398
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1396
Author: @tqobqbq
Created: 4/7/2024
Status: ✅ Merged
Merged: 4/24/2024
Merged by: @pirate
Base:
dev← Head:fix-URL_REGEX📝 Commits (4)
4ae765efix the URL_REGEX used in generic_html parserse4dc270fix URL_REGEX 2c6f8a33Update util.py17f40f3Merge branch 'dev' into fix-URL_REGEX📊 Changes
1 file changed (+89 additions, -8 deletions)
View changed files
📝
archivebox/util.py(+89 -8)📄 Description
Fix the URL_REGEX in utils.py
I crawled a url:

url="https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ 公式サイト&hashtags=アカアオ,元祖百合,アカイイト,アオイシロ"
and the re.findall(URL_REGEX, url) in the generic_html.py will return
['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ', 'https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ'],
the latter is a wrong url which will raise error both in real browser and requests.
And In fact, the origin URL_REGEX:
it's line 3,4,5 will just catch only one char as there is no '+' follow them.
i change it to more standard one:
and it works for my example
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.