[GH-ISSUE #46] UTF-8 error while uploading file #34

Closed
opened 2026-02-25 21:31:03 +03:00 by kerem · 5 comments
Owner

Originally created by @jhf2442 on GitHub (Jul 22, 2020).
Original GitHub issue: https://github.com/ciur/papermerge/issues/46

Docker image downloaded and started 5 min ago, therefore latest/greatest

while uploading a 180kB, 20-page PDF document (that opens perfectly in okular)

papermerge_service | upload for f=2020-07-17_AGB_A02092019.pdf user=admin
papermerge_service | Internal Server Error: /upload/
papermerge_service | Traceback (most recent call last):
papermerge_service |   File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/exception.py", line 34, in inner
papermerge_service |     response = get_response(request)
papermerge_service |   File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 115, in _get_response
papermerge_service |     response = self.process_exception_by_middleware(e, request)
papermerge_service |   File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 113, in _get_response
papermerge_service |     response = wrapped_callback(request, *callback_args, **callback_kwargs)
papermerge_service |   File "/usr/local/lib/python3.7/dist-packages/django/contrib/auth/decorators.py", line 21, in _wrapped_view
papermerge_service |     return view_func(request, *args, **kwargs)
papermerge_service |   File "/usr/local/lib/python3.7/dist-packages/django/views/generic/base.py", line 71, in view
papermerge_service |     return self.dispatch(request, *args, **kwargs)
papermerge_service |   File "/usr/local/lib/python3.7/dist-packages/django/views/generic/base.py", line 97, in dispatch
papermerge_service |     return handler(request, *args, **kwargs)
papermerge_service |   File "/opt/papermerge/papermerge/core/views/documents.py", line 373, in post
papermerge_service |     page_count = get_pagecount(f.temporary_file_path())
papermerge_service |   File "/usr/local/lib/python3.7/dist-packages/pmworker/pdfinfo.py", line 63, in get_pagecount
papermerge_service |     lines = compl.stdout.decode('utf-8').split('\n')
papermerge_service | UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 101: invalid start byte
Originally created by @jhf2442 on GitHub (Jul 22, 2020). Original GitHub issue: https://github.com/ciur/papermerge/issues/46 Docker image downloaded and started 5 min ago, therefore latest/greatest while uploading a 180kB, 20-page PDF document (that opens perfectly in okular) ``` papermerge_service | upload for f=2020-07-17_AGB_A02092019.pdf user=admin papermerge_service | Internal Server Error: /upload/ papermerge_service | Traceback (most recent call last): papermerge_service | File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/exception.py", line 34, in inner papermerge_service | response = get_response(request) papermerge_service | File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 115, in _get_response papermerge_service | response = self.process_exception_by_middleware(e, request) papermerge_service | File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 113, in _get_response papermerge_service | response = wrapped_callback(request, *callback_args, **callback_kwargs) papermerge_service | File "/usr/local/lib/python3.7/dist-packages/django/contrib/auth/decorators.py", line 21, in _wrapped_view papermerge_service | return view_func(request, *args, **kwargs) papermerge_service | File "/usr/local/lib/python3.7/dist-packages/django/views/generic/base.py", line 71, in view papermerge_service | return self.dispatch(request, *args, **kwargs) papermerge_service | File "/usr/local/lib/python3.7/dist-packages/django/views/generic/base.py", line 97, in dispatch papermerge_service | return handler(request, *args, **kwargs) papermerge_service | File "/opt/papermerge/papermerge/core/views/documents.py", line 373, in post papermerge_service | page_count = get_pagecount(f.temporary_file_path()) papermerge_service | File "/usr/local/lib/python3.7/dist-packages/pmworker/pdfinfo.py", line 63, in get_pagecount papermerge_service | lines = compl.stdout.decode('utf-8').split('\n') papermerge_service | UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfe in position 101: invalid start byte ````
kerem 2026-02-25 21:31:03 +03:00
  • closed this issue
  • added the
    bug
    label
Author
Owner

@ciur commented on GitHub (Jul 23, 2020):

Thank you for your feedback.

pdfinfo utility has an unexpected output :(
pdfinfo (part of poppler)- is used internally to figure out number of pages in the document.

Can you, please, run pdfinfo utility on the pdf document 2020-07-17_AGB_A02092019.pdf again and paste here the output?
Example:

eugen@dell-xps:Scans$ pdfinfo brother_003962.pdf
Creator:        Brother Scanner System
Producer:       Brother Scanner System Image Conversion
CreationDate:   Wed Jun 24 12:22:17 2020 CEST
ModDate:        Wed Jun 24 12:22:17 2020 CEST
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          3
Encrypted:      no
Page size:      610.56 x 1074.24 pts
Page rot:       0
File size:      876133 bytes
Optimized:      no
PDF version:    1.4
<!-- gh-comment-id:662813563 --> @ciur commented on GitHub (Jul 23, 2020): Thank you for your feedback. pdfinfo utility has an unexpected output :( pdfinfo (part of poppler)- is used internally to figure out number of pages in the document. Can you, please, run pdfinfo utility on the pdf document 2020-07-17_AGB_A02092019.pdf again and paste here the output? Example: ``` eugen@dell-xps:Scans$ pdfinfo brother_003962.pdf Creator: Brother Scanner System Producer: Brother Scanner System Image Conversion CreationDate: Wed Jun 24 12:22:17 2020 CEST ModDate: Wed Jun 24 12:22:17 2020 CEST Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 3 Encrypted: no Page size: 610.56 x 1074.24 pts Page rot: 0 File size: 876133 bytes Optimized: no PDF version: 1.4 ```
Author
Owner

@jhf2442 commented on GitHub (Jul 24, 2020):

Here we go :

pdfinfo 2020-07-17_AGB_A02092019.pdf 
Title:          Versicherungsbedingungen
Producer:       M/TEXT CS version 6.7.0.476
CreationDate:   ��
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          20
Encrypted:      yes (print:yes copy:yes change:no addNotes:no algorithm:RC4)
Page size:      595.276 x 841.89 pts (A4)
Page rot:       0
File size:      181193 bytes
Optimized:      no
PDF version:    1.4

-> it's the creation date field that contains some strange data ! (yes it's two diamonds)

<!-- gh-comment-id:663368898 --> @jhf2442 commented on GitHub (Jul 24, 2020): Here we go : ``` pdfinfo 2020-07-17_AGB_A02092019.pdf Title: Versicherungsbedingungen Producer: M/TEXT CS version 6.7.0.476 CreationDate: �� Tagged: no UserProperties: no Suspects: no Form: none JavaScript: no Pages: 20 Encrypted: yes (print:yes copy:yes change:no addNotes:no algorithm:RC4) Page size: 595.276 x 841.89 pts (A4) Page rot: 0 File size: 181193 bytes Optimized: no PDF version: 1.4 ``` -> it's the creation date field that contains some strange data ! (yes it's two diamonds)
Author
Owner

@jhf2442 commented on GitHub (Jul 24, 2020):

here the header of the PDF

%PDF-1.4
%\252\253\254\255
4 0 obj
<<
/Title (^S\325ޥ\\\365\202\276\240^]K@\b\324諪\213"\237^GO\276\303!\262%\335\362Ӓz\\\3443iC\357^RR\256\265'\251x\344K\2260\232)
/Producer (^S\325\336\276\\\277\202\230\240+Kq\b\343\350㪭"\276^G^Z\276\333!\260%\334\362\302\222v\\\3573nC\241^R^C\256\356'\360x\255K\3030ڹ\2474نK)
/CreationDate (^S\325\336\267\\\252\202\376\240^K^[\b\207\350\363\252\335"\337^G^O\276\234!\343%\234\362\205\222.\\\2653+C\261^R^D\256\347'\367x\263K\324)
>>
endobj
<!-- gh-comment-id:663369192 --> @jhf2442 commented on GitHub (Jul 24, 2020): here the header of the PDF ``` %PDF-1.4 %\252\253\254\255 4 0 obj << /Title (^S\325ޥ\\\365\202\276\240^]K@\b\324諪\213"\237^GO\276\303!\262%\335\362Ӓz\\\3443iC\357^RR\256\265'\251x\344K\2260\232) /Producer (^S\325\336\276\\\277\202\230\240+Kq\b\343\350㪭"\276^G^Z\276\333!\260%\334\362\302\222v\\\3573nC\241^R^C\256\356'\360x\255K\3030ڹ\2474نK) /CreationDate (^S\325\336\267\\\252\202\376\240^K^[\b\207\350\363\252\335"\337^G^O\276\234!\343%\234\362\205\222.\\\2653+C\261^R^D\256\347'\367x\263K\324) >> endobj ```
Author
Owner

@ciur commented on GitHub (Jul 24, 2020):

it's the creation date field that contains some strange data ! (yes it's two diamonds)

I think those two diamonds cause the issue (de: sind schuldig) as they might be encoded in something different than UTF-8 (just guessing).
Does the document contains sensitive information ?
In case it is just random AGB (i.e. no sensitive data) would you send me a copy of it (my email is at the very bottom of readme page)? Otherwise I have no other means of troubleshooting the issue.

<!-- gh-comment-id:663376304 --> @ciur commented on GitHub (Jul 24, 2020): > it's the creation date field that contains some strange data ! (yes it's two diamonds) I think those two diamonds cause the issue (de: sind schuldig) as they might be encoded in something different than UTF-8 (just guessing). Does the document contains sensitive information ? In case it is just random AGB (i.e. no sensitive data) would you send me a copy of it (my email is at the very bottom of readme page)? Otherwise I have no other means of troubleshooting the issue.
Author
Owner

@ciur commented on GitHub (Jul 25, 2020):

I received your document and fixed encoding issue.

Fix will be available in 1.4.0 (in about 2 weeks).

Thank you again for providing useful feedback!

<!-- gh-comment-id:663838237 --> @ciur commented on GitHub (Jul 25, 2020): I received your document and [fixed encoding issue.](https://github.com/papermerge/mglib/commit/8a8835d24368e7559e702729e35593b9c0812b63) Fix will be available in 1.4.0 (in about 2 weeks). Thank you again for providing useful feedback!
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/papermerge#34
No description provided.