[PR #378] [MERGED] fix: Use w3lib to improve the encoding extraction #2646

Closed
opened 2026-03-01 18:00:16 +03:00 by kerem · 0 comments
Owner

📋 Pull Request Information

Original PR: https://github.com/ArchiveBox/ArchiveBox/pull/378
Author: @cdvv7788
Created: 7/22/2020
Status: Merged
Merged: 7/22/2020
Merged by: @pirate

Base: djangoHead: hotfix/#257


📝 Commits (2)

  • 949f78a fix: Use w3lib to improve the encoding extraction
  • aa45f9c fix version tag

📊 Changes

5 files changed (+787 additions, -11 deletions)

View changed files

📝 archivebox/util.py (+8 -9)
📝 setup.py (+1 -0)
📝 tests/mock_server/server.py (+3 -1)
tests/mock_server/templates/shift_jis.html (+769 -0)
📝 tests/test_util.py (+6 -1)

📄 Description

Summary

Detecting the right encoding is an issue not only for rss feeds (which had a previous fix) but in general. In this PR I generalize that fix, so the flow is:

  • Headers
  • Body
  • Guess it (requests default behavior)

image

Problematic links are all having the right title now.

**Related issues: #257

Changes these areas

  • Bugfixes
  • Feature behavior
  • Command line interface
  • Configuration options
  • Internal architecture
  • Archived data layout on disk

🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.

## 📋 Pull Request Information **Original PR:** https://github.com/ArchiveBox/ArchiveBox/pull/378 **Author:** [@cdvv7788](https://github.com/cdvv7788) **Created:** 7/22/2020 **Status:** ✅ Merged **Merged:** 7/22/2020 **Merged by:** [@pirate](https://github.com/pirate) **Base:** `django` ← **Head:** `hotfix/#257` --- ### 📝 Commits (2) - [`949f78a`](https://github.com/ArchiveBox/ArchiveBox/commit/949f78aa6549c2909fe8b0f19cf60aa57aff6e87) fix: Use w3lib to improve the encoding extraction - [`aa45f9c`](https://github.com/ArchiveBox/ArchiveBox/commit/aa45f9c9ea10c0f982adb4ceedc0931d9930c569) fix version tag ### 📊 Changes **5 files changed** (+787 additions, -11 deletions) <details> <summary>View changed files</summary> 📝 `archivebox/util.py` (+8 -9) 📝 `setup.py` (+1 -0) 📝 `tests/mock_server/server.py` (+3 -1) ➕ `tests/mock_server/templates/shift_jis.html` (+769 -0) 📝 `tests/test_util.py` (+6 -1) </details> ### 📄 Description # Summary Detecting the right encoding is an issue not only for rss feeds (which had a previous fix) but in general. In this PR I generalize that fix, so the flow is: - Headers - Body - Guess it (requests default behavior) ![image](https://user-images.githubusercontent.com/5531776/88195624-dba2be00-cc05-11ea-86f3-f1b6ea4fc18f.png) Problematic links are all having the right title now. **Related issues: #257 # Changes these areas - [X] Bugfixes - [ ] Feature behavior - [ ] Command line interface - [ ] Configuration options - [ ] Internal architecture - [ ] Archived data layout on disk --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>
kerem 2026-03-01 18:00:16 +03:00
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/ArchiveBox#2646
No description provided.