[GH-ISSUE #545] No error on non UTF8 input in NormalizerFormatter #190

Closed
opened 2026-03-04 02:13:01 +03:00 by kerem · 5 comments
Owner

Originally created by @nicolashohm on GitHub (Apr 20, 2015).
Original GitHub issue: https://github.com/Seldaek/monolog/issues/545

I'm using the LogstashFormatter.When I log non UTF8 stuff the NormalizerFormatter::toJson returns false which cause finally just an \n in the log and no error message at all. Is that expected? Would it perhaps not be better to throw an error in toJson()?

Originally created by @nicolashohm on GitHub (Apr 20, 2015). Original GitHub issue: https://github.com/Seldaek/monolog/issues/545 I'm using the `LogstashFormatter`.When I log non UTF8 stuff the `NormalizerFormatter::toJson` returns false which cause finally just an \n in the log and no error message at all. Is that expected? Would it perhaps not be better to throw an error in `toJson()`?
kerem closed this issue 2026-03-04 02:13:01 +03:00
Author
Owner

@bd808 commented on GitHub (Nov 10, 2015):

This decision is causing fatal runtime errors for Wikipedia: https://phabricator.wikimedia.org/T118057

The choice to raise a RuntimeException for a logging failure seems to be a bit short sighted. @Seldaek would you be open to some patch to make this configurable behavior (preferably with the default being no errors)?

<!-- gh-comment-id:155521866 --> @bd808 commented on GitHub (Nov 10, 2015): This decision is causing fatal runtime errors for Wikipedia: https://phabricator.wikimedia.org/T118057 The choice to raise a RuntimeException for a logging failure seems to be a bit short sighted. @Seldaek would you be open to some patch to make this configurable behavior (preferably with the default being no errors)?
Author
Owner

@Seldaek commented on GitHub (Nov 10, 2015):

Reopening while we discuss. Not sure I see it the same way you do though. I see that it sucks if user input randomly blows up the logging, but before you would just lose those logs silently as well which isn't ideal.

As a quick fix for now (and I'd argue it might be a good thing to do in any case in a context such a wikipedia, you can wrap all your handlers within a WhatFailureGroupHandler, it will ignore any exception thrown by child handlers, which should get you back to where you were before (dropping log records.. but not failing hard).

There are many handlers that can fail while logging for various reasons (backend down and such) and therefore I think being strict rather than hiding errors by default there is a good thing.

<!-- gh-comment-id:155524575 --> @Seldaek commented on GitHub (Nov 10, 2015): Reopening while we discuss. Not sure I see it the same way you do though. I see that it sucks if user input randomly blows up the logging, but before you would just lose those logs silently as well which isn't ideal. As a quick fix for now (and I'd argue it might be a good thing to do in any case in a context such a wikipedia, you can wrap all your handlers within a WhatFailureGroupHandler, it will ignore any exception thrown by child handlers, which should get you back to where you were before (dropping log records.. but not failing hard). There are many handlers that can fail while logging for various reasons (backend down and such) and therefore I think being strict rather than hiding errors by default there is a good thing.
Author
Owner

@bd808 commented on GitHub (Nov 10, 2015):

There are many handlers that can fail while logging for various reasons

We've gone to pretty great lengths to keep logging failures from causing application failures in our configuration, but I understand that others may have different goals. I like the idea of not silently losing events but again I think the Exception route is a bit heavy handed.

It seems that you at least at some level agree with this as can be seen in github.com/Seldaek/monolog@63e3bfdf7e where you introduced the $ignoreErrors flag to NormalizerFormatter::toJson.

For my use case the frequency of having non-UTF-8 data in the log record is pretty low. We are mostly seeing it creep in from user-agent strings and referrer URLs which are sometimes embedded in the log record. My first thought was to create a Filter that did UTF-8 normalization of every record, but I realized that detecting encoding of strings is error prone and unnecessary most of the time. I've got a mostly working class subclass of LogstashFormatter put together that tries to recover the event by scrubbing the encoding of the string elements. It isn't passing our local PHP5.3 tests yet which I need to spend more time looking at.

If I can get some error recovery by sanitizing encoding method working at least semi-reliably would you be interested in having it upstreamed? It would probably be nicer to put it into NormalizerFormatter::toJson than to only have it benefit LogstashFormatter.

<!-- gh-comment-id:155560891 --> @bd808 commented on GitHub (Nov 10, 2015): > There are many handlers that can fail while logging for various reasons We've gone to pretty great lengths to keep logging failures from causing application failures in our configuration, but I understand that others may have different goals. I like the idea of not silently losing events but again I think the Exception route is a bit heavy handed. It seems that you at least at some level agree with this as can be seen in https://github.com/Seldaek/monolog/commit/63e3bfdf7eba6492eced7ac651c3772b7f1d2689 where you introduced the `$ignoreErrors` flag to `NormalizerFormatter::toJson`. For my use case the frequency of having non-UTF-8 data in the log record is pretty low. We are mostly seeing it creep in from user-agent strings and referrer URLs which are sometimes embedded in the log record. My first thought was to create a Filter that did UTF-8 normalization of every record, but I realized that detecting encoding of strings is error prone and unnecessary most of the time. I've got a [mostly working class](https://gerrit.wikimedia.org/r/#/c/251679/5/includes/debug/logger/monolog/LogstashFormatter.php,unified) subclass of LogstashFormatter put together that tries to recover the event by scrubbing the encoding of the string elements. It isn't passing our local PHP5.3 tests yet which I need to spend more time looking at. If I can get some error recovery by sanitizing encoding method working at least semi-reliably would you be interested in having it upstreamed? It would probably be nicer to put it into `NormalizerFormatter::toJson` than to only have it benefit LogstashFormatter.
Author
Owner

@Seldaek commented on GitHub (Nov 11, 2015):

Yeah I think that would be nice. Avoiding failures is of course better than
failing hard which is better than failing silently IMO.

Would you have a chance to store a few of those bad inputs (maybe using
serialize+base64encode so we have strings that aren't charset dependent and
can be shared around)? It would be good to have for the test suite and for
trying to find a workaround. I'm thinking doing iconv from iso-8859-15 to
UTF-8//IGNORE might help to get rid of those invalid sequences and preserve
data in most cases.

<!-- gh-comment-id:155701437 --> @Seldaek commented on GitHub (Nov 11, 2015): Yeah I think that would be nice. Avoiding failures is of course better than failing hard which is better than failing silently IMO. Would you have a chance to store a few of those bad inputs (maybe using serialize+base64encode so we have strings that aren't charset dependent and can be shared around)? It would be good to have for the test suite and for trying to find a workaround. I'm thinking doing iconv from iso-8859-15 to UTF-8//IGNORE might help to get rid of those invalid sequences and preserve data in most cases.
Author
Owner

@bd808 commented on GitHub (Nov 11, 2015):

This may not be the sexiest conversion function, but it seems to be fairly fast and capable of making any random binary input strings I throw at it safe for json_encode():

/**
 * Scrub an input string to ensure that it only contains valid UTF-8
 * characters.
 *
 * @param string $str
 * @param bool $iso885915 Convert as ISO-8859-15 rather than ISO-8859-1
 * @return string
 */
function scrubUTF8($str, $iso885915 = false) {
    if (!preg_match('//u', $str)) {

        $str = preg_replace_callback(
            '/[\x80-\xFF]+/',
            function ($m) { return utf8_encode($m[0]); },
            $str);
        if ($iso885915) {
            $str = str_replace(
                array('¤', '¦', '¨', '´', '¸', '¼', '½', '¾'),
                array('€', 'Š', 'š', 'Ž', 'ž', 'Œ', 'œ', 'Ÿ'),
                $str
            );
        }
    }
    return $str;
}

It also doesn't require any special support in the PHP install other than PCRE UTF-8 support which I think is fairly common these days.

There are quite a few UTF-8 related libraries on Packagist as well that could be used instead of something simple like this including Wikimedia's own utfnormal library.

I'll see if I can come up with a good way to gather some real error data and make it clean enough to share as a test set. For my initial testing I've just been using openssl_random_pseudo_bytes to generate random binary strings.

<!-- gh-comment-id:155896564 --> @bd808 commented on GitHub (Nov 11, 2015): This may not be the sexiest conversion function, but it seems to be fairly fast and capable of making any random binary input strings I throw at it safe for json_encode(): ``` /** * Scrub an input string to ensure that it only contains valid UTF-8 * characters. * * @param string $str * @param bool $iso885915 Convert as ISO-8859-15 rather than ISO-8859-1 * @return string */ function scrubUTF8($str, $iso885915 = false) { if (!preg_match('//u', $str)) { $str = preg_replace_callback( '/[\x80-\xFF]+/', function ($m) { return utf8_encode($m[0]); }, $str); if ($iso885915) { $str = str_replace( array('¤', '¦', '¨', '´', '¸', '¼', '½', '¾'), array('€', 'Š', 'š', 'Ž', 'ž', 'Œ', 'œ', 'Ÿ'), $str ); } } return $str; } ``` It also doesn't require any special support in the PHP install other than PCRE UTF-8 support which I _think_ is fairly common these days. There are quite a few UTF-8 related libraries on Packagist as well that could be used instead of something simple like this including [Wikimedia's own utfnormal library](https://github.com/wikimedia/utfnormal). I'll see if I can come up with a good way to gather some real error data and make it clean enough to share as a test set. For my initial testing I've just been using `openssl_random_pseudo_bytes` to generate random binary strings.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/monolog#190
No description provided.