[PR #1396] fix the URL_REGEX used in generic_html parsers #2908

Closed
opened 2026-03-01 18:01:06 +03:00 by kerem · 0 comments
Owner

Original Pull Request: https://github.com/ArchiveBox/ArchiveBox/pull/1396

State: closed
Merged: Yes


Fix the URL_REGEX in utils.py

I crawled a url:
url="https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ 公式サイト&hashtags=アカアオ,元祖百合,アカイイト,アオイシロ"
and the re.findall(URL_REGEX, url) in the generic_html.py will return
['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ', 'https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ'],
the latter is a wrong url which will raise error both in real browser and requests.
image

And In fact, the origin URL_REGEX:

URL_REGEX = re.compile(
    r'(?=('
    r'http[s]?://'
    r'(?:[a-zA-Z]|[0-9]'                      # 3
    r'|[-_$@.&+!*\(\),]'                     # 4
    r'|(?:%[0-9a-fA-F][0-9a-fA-F]))'  # 5
    r'[^\]\[\(\)<>"\'\s]+' 
    r'))',
    re.IGNORECASE,
)

it's line 3,4,5 will just catch only one char as there is no '+' follow them.
i change it to more standard one:

URL_REGEX = re.compile(
    r'(?=('
    r'https?://'                        #match schemes http and https,but can't match ftp
    r'(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9-]+'#match domain
    r'(?::\d+)?'                        #match port,mabey not occur
    r'(?:/[^\\#\f\n\r\t\v]*)?'          #match path and query,maybe not occur
##    r'(?:#[^\]\[\(\)<>"\'\s]*){0,1}'  #match fragment,but don't need it actually 
    r'))',
##    re.IGNORECASE,                    #don't need to consider case problem
)

and it works for my example

**Original Pull Request:** https://github.com/ArchiveBox/ArchiveBox/pull/1396 **State:** closed **Merged:** Yes --- Fix the URL_REGEX in utils.py I crawled a url: url="https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ 公式サイト&hashtags=アカアオ,元祖百合,アカイイト,アオイシロ" and the re.findall(URL_REGEX, url) in the generic_html.py will return ['https://twitter.com/share?url=https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ', 'https://akaao.success-corp.co.jp&text=アカイイト&アオイシロ'], the latter is a wrong url which will raise error both in real browser and requests. ![image](https://github.com/ArchiveBox/ArchiveBox/assets/49046901/ef4ace07-cd13-42db-8f27-f5f7cdb447fc) And In fact, the origin URL_REGEX: ``` URL_REGEX = re.compile( r'(?=(' r'http[s]?://' r'(?:[a-zA-Z]|[0-9]' # 3 r'|[-_$@.&+!*\(\),]' # 4 r'|(?:%[0-9a-fA-F][0-9a-fA-F]))' # 5 r'[^\]\[\(\)<>"\'\s]+' r'))', re.IGNORECASE, ) ``` it's line 3,4,5 will just catch only one char as there is no '+' follow them. i change it to more standard one: ``` URL_REGEX = re.compile( r'(?=(' r'https?://' #match schemes http and https,but can't match ftp r'(?:[A-Za-z0-9-]+\.)+[A-Za-z0-9-]+'#match domain r'(?::\d+)?' #match port,mabey not occur r'(?:/[^\\#\f\n\r\t\v]*)?' #match path and query,maybe not occur ## r'(?:#[^\]\[\(\)<>"\'\s]*){0,1}' #match fragment,but don't need it actually r'))', ## re.IGNORECASE, #don't need to consider case problem ) ``` and it works for my example
kerem 2026-03-01 18:01:06 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2908
No description provided.