mirror of
https://github.com/ArchiveBox/ArchiveBox.git
synced 2026-04-25 17:16:00 +03:00
Labels
No labels
expected: maybe someday
expected: next release
expected: release after next
expected: unlikely unless contributed
good first ticket
help wanted
pull-request
scope: all users
scope: windows users
size: easy
size: hard
size: medium
size: medium
status: backlog
status: blocked
status: done
status: idea-phase
status: needs followup
status: wip
status: wontfix
touches: API/CLI/Spec
touches: configuration
touches: data/schema/architecture
touches: dependencies/packaging
touches: docs
touches: js
touches: views/replayers/html/css
why: correctness
why: functionality
why: performance
why: security
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/ArchiveBox#4318
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
📋 Pull Request Information
Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/1020
Author: @turian
Created: 9/11/2022
Status: ✅ Merged
Merged: 11/2/2022
Merged by: @pirate
Base:
dev← Head:feature/kludge-984-UTF8-bug📝 Commits (1)
2b58cceAttempted to warn on #984 and #1014📊 Changes
1 file changed (+16 additions, -0 deletions)
View changed files
📝
archivebox/extractors/__init__.py(+16 -0)📄 Description
Summary
This is a kludgy workaround that "what we can do for now is just add an exception catcher that skips trying to index those files if they throw encoding errors + display a warning like (> Warning: Skipped adding some files to full-text index as they are not in UTF-8 format)."
Related issues
https://github.com/ArchiveBox/ArchiveBox/issues/984
https://github.com/ArchiveBox/ArchiveBox/issues/1014
Adapted from https://github.com/ArchiveBox/ArchiveBox/issues/984#issuecomment-1150541627
since this bug is a showstopper for me as well as @jgoerzen
Changes these areas
Notes
Ideally, we would have a conf config that disables or enables hard stop on UTF8 error.
I don't understand archivebox well enough to know that if, my workaround gets halfway through, and then we get archivebox > 0.6.3 and it fixes this bug, it will complete the rest of the pipeline.
🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.