[PR #1396] [MERGED] fix the URL_REGEX used in generic_html parsers #4412

New issue

Closed

opened 2026-03-15 01:43:06 +03:00 by kerem · 0 comments

kerem commented

2026-03-15 01:43:06 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1396
Author: @tqobqbq
Created: 4/7/2024
Status: ✅ Merged
Merged: 4/24/2024
Merged by: @pirate

Base: dev ← Head: fix-URL_REGEX

📝 Commits (4)

4ae765e fix the URL_REGEX used in generic_html parsers
e4dc270 fix URL_REGEX 2
c6f8a33 Update util.py
17f40f3 Merge branch 'dev' into fix-URL_REGEX

📊 Changes

1 file changed (+89 additions, -8 deletions)

View changed files

📝 archivebox/util.py (+89 -8)

📄 Description

Fix the URL_REGEX in utils.py

I crawled a url:
url="https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ公式サイト&hashtags=アカアオ,元祖百合,アカイイト,アオイシロ"
and the re.findall(URL_REGEX, url) in the generic_html.py will return
['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ', 'https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ'],
the latter is a wrong url which will raise error both in real browser and requests.

And In fact, the origin URL_REGEX:

URL_REGEX = re.compile(
    r'(?=('
    r'http[s]?://'
    r'(?:[a-zA-Z]|[0-9]'                      # 3
    r'|[-_$@.&+!*\(\),]'                     # 4
    r'|(?:%[0-9a-fA-F][0-9a-fA-F]))'  # 5
    r'[^\]\[\(\)<>"\'\s]+' 
    r'))',
    re.IGNORECASE,
)

it's line 3,4,5 will just catch only one char as there is no '+' follow them.
i change it to more standard one:

URL_REGEX = re.compile(
    r'(?=('
    r'https?://'                        #match schemes http and https,but can't match ftp
    r'(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9-]+'#match domain
    r'(?::\d+)?'                        #match port,mabey not occur
    r'(?:/[^\\#\f\n\r\t\v]*)?'          #match path and query,maybe not occur
##    r'(?:#[^\]\[\(\)<>"\'\s]*){0,1}'  #match fragment,but don't need it actually 
    r'))',
##    re.IGNORECASE,                    #don't need to consider case problem
)

and it works for my example

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1396 **Author:** [@tqobqbq](https://github.com/tqobqbq) **Created:** 4/7/2024 **Status:** ✅ Merged **Merged:** 4/24/2024 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `fix-URL_REGEX` --- ### 📝 Commits (4) - [`4ae765e`](https://github.com/ArchiveBox/ArchiveBox/commit/4ae765ec275705d06fb9414f61530ee512ebc069) fix the URL_REGEX used in generic_html parsers - [`e4dc270`](https://github.com/ArchiveBox/ArchiveBox/commit/e4dc2701efb789f5164fedac9f3964bb75e8c932) fix URL_REGEX 2 - [`c6f8a33`](https://github.com/ArchiveBox/ArchiveBox/commit/c6f8a33a63e8b463d631f14c86b59f6ce80a5b44) Update util.py - [`17f40f3`](https://github.com/ArchiveBox/ArchiveBox/commit/17f40f3adad61a05799e15fa9f1170b6e7846e5d) Merge branch 'dev' into fix-URL_REGEX ### 📊 Changes **1 file changed** (+89 additions, -8 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/util.py` (+89 -8) </details> ### 📄 Description Fix the URL_REGEX in utils.py I crawled a url: url="https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ公式サイト&hashtags=アカアオ,元祖百合,アカイイト,アオイシロ" and the re.findall(URL_REGEX, url) in the generic_html.py will return ['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ', 'https://akaao.success-corp.co.jp&text=アカイイト＆アオイシロ'], the latter is a wrong url which will raise error both in real browser and requests. ![image](https://github.com/ArchiveBox/ArchiveBox/assets/49046901/ef4ace07-cd13-42db-8f27-f5f7cdb447fc) And In fact, the origin URL_REGEX: ``` URL_REGEX = re.compile( r'(?=(' r'http[s]?://' r'(?:[a-zA-Z]|[0-9]' # 3 r'|[-_$@.&+!*,]' # 4 r'|(?:%[0-9a-fA-F][0-9a-fA-F]))' # 5 r'[^\]\[<>"\'\s]+' r'))', re.IGNORECASE, ) ``` it's line 3,4,5 will just catch only one char as there is no '+' follow them. i change it to more standard one: ``` URL_REGEX = re.compile( r'(?=(' r'https?://' #match schemes http and https,but can't match ftp r'(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9-]+'#match domain r'(?::\d+)?' #match port,mabey not occur r'(?:/[^\\#\f\n\r\t\v]*)?' #match path and query,maybe not occur ## r'(?:#[^\]\[<>"\'\s]*){0,1}' #match fragment,but don't need it actually r'))', ## re.IGNORECASE, #don't need to consider case problem ) ``` and it works for my example --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>

kerem

2026-03-15 01:43:06 +03:00

closed this issue
added the
pull-request
label

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

starred/ArchiveBox#4412

No description provided.

Rows
Columns