[PR #1020] [MERGED] Attempted to warn on #984 and #1014 #4318

Closed
opened 2026-03-15 01:38:14 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1020
Author: @turian
Created: 9/11/2022
Status: Merged
Merged: 11/2/2022
Merged by: @pirate

Base: devHead: feature/kludge-984-UTF8-bug


📝 Commits (1)

📊 Changes

1 file changed (+16 additions, -0 deletions)

View changed files

📝 archivebox/extractors/__init__.py (+16 -0)

📄 Description

Summary

This is a kludgy workaround that "what we can do for now is just add an exception catcher that skips trying to index those files if they throw encoding errors + display a warning like (> Warning: Skipped adding some files to full-text index as they are not in UTF-8 format)."

Related issues

https://github.com/ArchiveBox/ArchiveBox/issues/984

https://github.com/ArchiveBox/ArchiveBox/issues/1014

Adapted from https://github.com/ArchiveBox/ArchiveBox/issues/984#issuecomment-1150541627

since this bug is a showstopper for me as well as @jgoerzen

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

Notes

Ideally, we would have a conf config that disables or enables hard stop on UTF8 error.

I don't understand archivebox well enough to know that if, my workaround gets halfway through, and then we get archivebox > 0.6.3 and it fixes this bug, it will complete the rest of the pipeline.


🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/1020 **Author:** [@turian](https://github.com/turian) **Created:** 9/11/2022 **Status:** ✅ Merged **Merged:** 11/2/2022 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `dev` ← **Head:** `feature/kludge-984-UTF8-bug` --- ### 📝 Commits (1) - [`2b58cce`](https://github.com/ArchiveBox/ArchiveBox/commit/2b58cce43fca64865292ccb967b8800a421e05cd) Attempted to warn on #984 and #1014 ### 📊 Changes **1 file changed** (+16 additions, -0 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/extractors/__init__.py` (+16 -0) </details> ### 📄 Description # Summary This is a kludgy workaround that "what we can do for now is just add an exception catcher that skips trying to index those files if they throw encoding errors + display a warning like (> Warning: Skipped adding some files to full-text index as they are not in UTF-8 format)." # Related issues https://github.com/ArchiveBox/ArchiveBox/issues/984 https://github.com/ArchiveBox/ArchiveBox/issues/1014 Adapted from https://github.com/ArchiveBox/ArchiveBox/issues/984#issuecomment-1150541627 since this bug is a showstopper for me as well as @jgoerzen # Changes these areas - [X] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk # Notes Ideally, we would have a conf config that disables or enables hard stop on UTF8 error. I don't understand archivebox well enough to know that if, my workaround gets halfway through, and then we get archivebox > 0.6.3 and it fixes this bug, it will complete the rest of the pipeline. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-15 01:38:14 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#4318
No description provided.