[GH-ISSUE #1222] Bug: If document's title tag is empty, title extractor sets the snapshot title to "</title" #3770

Closed
opened 2026-03-15 00:23:52 +03:00 by kerem · 2 comments
Owner

Originally created by @rmohns on GitHub (Aug 29, 2023).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1222

Describe the bug

I saved a webpage which is terribly coded by hand, and has an empty title tag. (Literally: `'.) The resulting snapshot is named "</title". Easy to change but odd.

TBH I'm not sure if you should care, since we may not care if horribly invalid documents create errors. But on the off chance that it's easy to check for and change this in the code, am filing bug report. (Perhaps such snapshots could be named "No document title found".)

Steps to reproduce

  1. Saved this page to ArchiveBox: http://wildwestcycle.com/f_oiltempdegradation.html
  2. Snapshot title is </title

Screenshots or log output

Screenshot 2023-08-29 at 5 05 12 PM

ArchiveBox version

ArchiveBox v0.6.2
Cpython Linux Linux-4.4.302+-x86_64-with-glibc2.28 x86_64
IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /usr/local/bin/archivebox                                                   
 √  PYTHON_BINARY         v3.9.5          valid     /usr/local/bin/python3.9                                                    
 √  DJANGO_BINARY         v3.1.10         valid     /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py           
 √  CURL_BINARY           v7.64.0         valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.20.1         valid     /usr/bin/wget                                                               
 √  NODE_BINARY           v15.14.0        valid     /usr/bin/node                                                               
 √  SINGLEFILE_BINARY     v0.3.16         valid     /node/node_modules/single-file/cli/single-file                              
 √  READABILITY_BINARY    v0.0.2          valid     /node/node_modules/readability-extractor/readability-extractor              
 √  MERCURY_BINARY        v1.0.0          valid     /node/node_modules/@postlight/mercury-parser/cli.js                         
 -  GIT_BINARY            -               disabled  /usr/bin/git                                                                
 -  YOUTUBEDL_BINARY      -               disabled  /usr/local/bin/youtube-dl                                                   
 √  CHROME_BINARY         v90.0.4430.93   valid     /usr/bin/chromium                                                           
 √  RIPGREP_BINARY        v0.10.0         valid     /usr/bin/rg                                                                 

[i] Source-code locations:
 √  PACKAGE_DIR           22 files        valid     /app/archivebox                                                             
 √  TEMPLATES_DIR         3 files         valid     /app/archivebox/templates                                                   
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            5 files         valid     /data                                                                       
 √  SOURCES_DIR           136 files       valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           141 files       valid     ./archive                                                                   
 √  CONFIG_FILE           81.0 Bytes      valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             1.1 MB          valid     ./index.sqlite3                                                             
Originally created by @rmohns on GitHub (Aug 29, 2023). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/1222 #### Describe the bug I saved a webpage which is terribly coded by hand, and has an empty `title` tag. (Literally: `<title></title>'.) The resulting snapshot is named "</title". Easy to change but odd. TBH I'm not sure if you should care, since we may not care if horribly invalid documents create errors. But on the off chance that it's easy to check for and change this in the code, am filing bug report. (Perhaps such snapshots could be named "No document title found".) #### Steps to reproduce 1. Saved this page to ArchiveBox: http://wildwestcycle.com/f_oiltempdegradation.html 2. Snapshot title is `</title` #### Screenshots or log output ![Screenshot 2023-08-29 at 5 05 12 PM](https://github.com/ArchiveBox/ArchiveBox/assets/2131133/ac6d6364-3177-467d-a27e-24e57589d806) #### ArchiveBox version ```logs ArchiveBox v0.6.2 Cpython Linux Linux-4.4.302+-x86_64-with-glibc2.28 x86_64 IN_DOCKER=True DEBUG=False IS_TTY=True TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep [i] Dependency versions: √ ARCHIVEBOX_BINARY v0.6.2 valid /usr/local/bin/archivebox √ PYTHON_BINARY v3.9.5 valid /usr/local/bin/python3.9 √ DJANGO_BINARY v3.1.10 valid /usr/local/lib/python3.9/site-packages/django/bin/django-admin.py √ CURL_BINARY v7.64.0 valid /usr/bin/curl √ WGET_BINARY v1.20.1 valid /usr/bin/wget √ NODE_BINARY v15.14.0 valid /usr/bin/node √ SINGLEFILE_BINARY v0.3.16 valid /node/node_modules/single-file/cli/single-file √ READABILITY_BINARY v0.0.2 valid /node/node_modules/readability-extractor/readability-extractor √ MERCURY_BINARY v1.0.0 valid /node/node_modules/@postlight/mercury-parser/cli.js - GIT_BINARY - disabled /usr/bin/git - YOUTUBEDL_BINARY - disabled /usr/local/bin/youtube-dl √ CHROME_BINARY v90.0.4430.93 valid /usr/bin/chromium √ RIPGREP_BINARY v0.10.0 valid /usr/bin/rg [i] Source-code locations: √ PACKAGE_DIR 22 files valid /app/archivebox √ TEMPLATES_DIR 3 files valid /app/archivebox/templates - CUSTOM_TEMPLATES_DIR - disabled [i] Secrets locations: - CHROME_USER_DATA_DIR - disabled - COOKIES_FILE - disabled [i] Data locations: √ OUTPUT_DIR 5 files valid /data √ SOURCES_DIR 136 files valid ./sources √ LOGS_DIR 1 files valid ./logs √ ARCHIVE_DIR 141 files valid ./archive √ CONFIG_FILE 81.0 Bytes valid ./ArchiveBox.conf √ SQL_INDEX 1.1 MB valid ./index.sqlite3 ```
Author
Owner

@i-am-pluto commented on GitHub (Oct 24, 2023):

Hey, was this resolved?

<!-- gh-comment-id:1776870288 --> @i-am-pluto commented on GitHub (Oct 24, 2023): Hey, was this resolved?
Author
Owner

@pirate commented on GitHub (Oct 25, 2023):

Yeah should be, try the latest dev build https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch or v0.7.2

comment back if it's still happening and I'll re-open it

<!-- gh-comment-id:1780086694 --> @pirate commented on GitHub (Oct 25, 2023): Yeah should be, try the latest dev build https://github.com/ArchiveBox/ArchiveBox#install-and-run-a-specific-github-branch or v0.7.2 comment back if it's still happening and I'll re-open it
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#3770
No description provided.