[GH-ISSUE #141] Add link from lifehacker.com destroys index.html due to wrong title detection #96

Closed
opened 2026-03-01 14:40:37 +03:00 by kerem · 3 comments
Owner

Originally created by @Strubbl on GitHub (Feb 13, 2019).
Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/141

Describe the bug
when i archive a page from lifehacker, e.g. https://lifehacker.com/stop-recycling-amazons-plastic-packaging-1832536576

i only see garbage in the log output. here is it in a shortened form:

[+] [2019-02-13 15:14:54] "storybreak stars<\/title><path d="M5.146 9.01l-.19-3.623 3.057 1.985.693-1.197-3.213-1.67 3.213-1.638-.693-1.197-3.056 1.953L5.147 0H3.76l.158 3.623L.893 1.67.2 2.867l3.214 1.638L.2 6.175l.693 1.197 3.025-1.985L3.76 9.01m21.386 0l-.19-3.623 3.057 1.985.693-1.197-3.2...

I would have expected it to be something like this:

[+] [2019-02-13 15:14:54] "https://lifehacker.com/stop-recycling-amazons-plastic-packaging-1832536576"

So i guess title detection fails for this page.

Steps to reproduce
Steps to reproduce the behavior:

  1. add above link to archive
  2. open index.html page, where this link should be in the list
  3. Scroll down to the link and see garbage
  4. have a look into the archive log and see above output

Software versions (please complete the following information):

  • ArchiveBox version: e6d5cd4432
  • Python version: the one in the docker image
  • OS: docker container on Archlinux
  • Chrome version: the one in the docker image
Originally created by @Strubbl on GitHub (Feb 13, 2019). Original GitHub issue: https://github.com/ArchiveBox/ArchiveBox/issues/141 **Describe the bug** when i archive a page from lifehacker, e.g. https://lifehacker.com/stop-recycling-amazons-plastic-packaging-1832536576 i only see garbage in the log output. here is it in a shortened form: ``` [+] [2019-02-13 15:14:54] "storybreak stars<\/title><path d="M5.146 9.01l-.19-3.623 3.057 1.985.693-1.197-3.213-1.67 3.213-1.638-.693-1.197-3.056 1.953L5.147 0H3.76l.158 3.623L.893 1.67.2 2.867l3.214 1.638L.2 6.175l.693 1.197 3.025-1.985L3.76 9.01m21.386 0l-.19-3.623 3.057 1.985.693-1.197-3.2... ``` I would have expected it to be something like this: ``` [+] [2019-02-13 15:14:54] "https://lifehacker.com/stop-recycling-amazons-plastic-packaging-1832536576" ``` So i guess title detection fails for this page. **Steps to reproduce** Steps to reproduce the behavior: 1. add above link to archive 2. open index.html page, where this link should be in the list 3. Scroll down to the link and see garbage 4. have a look into the archive log and see above output **Software versions (please complete the following information):** - ArchiveBox version: e6d5cd44327a93394e8ef892452ca526acac22cd - Python version: the one in the docker image - OS: docker container on Archlinux - Chrome version: the one in the docker image
kerem closed this issue 2026-03-01 14:40:37 +03:00
Author
Owner

@Strubbl commented on GitHub (Feb 13, 2019):

Yeah, it's because of multiple

<!-- gh-comment-id:463288226 --> @Strubbl commented on GitHub (Feb 13, 2019): Yeah, it's because of multiple <title> tag in their html source :(
Author
Owner

@Strubbl commented on GitHub (Feb 13, 2019):

A workaround suggestion could be to remove all content which is between

<!-- gh-comment-id:463288692 --> @Strubbl commented on GitHub (Feb 13, 2019): A workaround suggestion could be to remove all content which is between <script> tags before title detection. Or one could just discard the title if it contains a left angle bracket and use default (url) instead. What do you think about that?
Author
Owner

@pirate commented on GitHub (Feb 19, 2019):

I think I've fixed it, try the latest master (1b36d5b) and let me know if you have any issues.

<!-- gh-comment-id:465009469 --> @pirate commented on GitHub (Feb 19, 2019): I think I've fixed it, try the latest master (1b36d5b) and let me know if you have any issues.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#96
No description provided.