[PR #1020] Attempted to warn on #984 and #1014 #2815

Closed
opened 2026-03-01 18:00:50 +03:00 by kerem · 0 comments
Owner

Original Pull Request: https://github.com/ArchiveBox/ArchiveBox/pull/1020

State: closed
Merged: Yes


Summary

This is a kludgy workaround that "what we can do for now is just add an exception catcher that skips trying to index those files if they throw encoding errors + display a warning like (> Warning: Skipped adding some files to full-text index as they are not in UTF-8 format)."

Related issues

https://github.com/ArchiveBox/ArchiveBox/issues/984

https://github.com/ArchiveBox/ArchiveBox/issues/1014

Adapted from https://github.com/ArchiveBox/ArchiveBox/issues/984#issuecomment-1150541627

since this bug is a showstopper for me as well as @jgoerzen

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Snapshot data layout on disk

Notes

Ideally, we would have a conf config that disables or enables hard stop on UTF8 error.

I don't understand archivebox well enough to know that if, my workaround gets halfway through, and then we get archivebox > 0.6.3 and it fixes this bug, it will complete the rest of the pipeline.

**Original Pull Request:** https://github.com/ArchiveBox/ArchiveBox/pull/1020 **State:** closed **Merged:** Yes --- # Summary This is a kludgy workaround that "what we can do for now is just add an exception catcher that skips trying to index those files if they throw encoding errors + display a warning like (> Warning: Skipped adding some files to full-text index as they are not in UTF-8 format)." # Related issues https://github.com/ArchiveBox/ArchiveBox/issues/984 https://github.com/ArchiveBox/ArchiveBox/issues/1014 Adapted from https://github.com/ArchiveBox/ArchiveBox/issues/984#issuecomment-1150541627 since this bug is a showstopper for me as well as @jgoerzen # Changes these areas - [X] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Snapshot data layout on disk # Notes Ideally, we would have a conf config that disables or enables hard stop on UTF8 error. I don't understand archivebox well enough to know that if, my workaround gets halfway through, and then we get archivebox > 0.6.3 and it fixes this bug, it will complete the rest of the pipeline.
kerem 2026-03-01 18:00:50 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2815
No description provided.