[GH-ISSUE #1433] DeduplicationHandler passes duplicated entries to the next handler #601

Open
opened 2026-03-04 02:16:19 +03:00 by kerem · 13 comments
Owner

Originally created by @wskorodecki on GitHub (Feb 26, 2020).
Original GitHub issue: https://github.com/Seldaek/monolog/issues/1433

Monolog version: 1.25.3

I'm using DeduplicationHandler to avoid duplicated deprecation messages but it still passes all entries via $this->handler->handleBatch($this->buffer); to the next handler (RotatingFileHandler). I think this is because of the following line:
$passthru = $passthru || !$this->isDuplicate($record);
The first call to isDuplicate() method results in false because the duplication store file does not exists yet, so $passthru becomes true and will never be false again.

I'm not sure but I think the code should be modified a bit to something like this:

$passthru = null;
$deduplicatedBuffer = [];

foreach ($this->buffer as $record) {
	if ($record['level'] >= $this->deduplicationLevel) {
		$isDuplicate = $this->isDuplicate($record);
		$passthru = $passthru || !$isDuplicate;

		if ($passthru) {
			$this->appendRecord($record);
		}

		if (!$isDuplicate) {
			$deduplicatedBuffer[] = $record;
		}
	}
}

if (true === $passthru) {
	$this->handler->handleBatch($deduplicatedBuffer);
} elseif (null === $passthru) {
	$this->handler->handleBatch($this->buffer);
}

My config:

deprecations:
  type: fingers_crossed
  # Include messages with level >= "info".
  action_level: info
  channels: ["php"]
  handler: deprecation_filter
deprecation_filter:
  type: filter
  # Exclude messages with level higher than "info".
  max_level: info
  channels: ["php"]
  handler: deduplicated
  nested: true
deduplicated:
  type: deduplication
  deduplication_level: info
  time: 60
  handler: deprecations_rotated
  nested: true
deprecations_rotated:
  type: rotating_file
  path: "%kernel.logs_dir%/%kernel.environment%.deprecations.log"
  max_files: 2
  nested: true
Originally created by @wskorodecki on GitHub (Feb 26, 2020). Original GitHub issue: https://github.com/Seldaek/monolog/issues/1433 Monolog version: 1.25.3 I'm using DeduplicationHandler to avoid duplicated deprecation messages but it still passes all entries via `$this->handler->handleBatch($this->buffer);` to the next handler (_RotatingFileHandler_). I think this is because of the following line: `$passthru = $passthru || !$this->isDuplicate($record);` The first call to _isDuplicate()_ method results in `false` because the duplication store file does not exists yet, so `$passthru` becomes `true` and will never be `false` again. I'm not sure but I think the code should be modified a bit to something like this: ```php $passthru = null; $deduplicatedBuffer = []; foreach ($this->buffer as $record) { if ($record['level'] >= $this->deduplicationLevel) { $isDuplicate = $this->isDuplicate($record); $passthru = $passthru || !$isDuplicate; if ($passthru) { $this->appendRecord($record); } if (!$isDuplicate) { $deduplicatedBuffer[] = $record; } } } if (true === $passthru) { $this->handler->handleBatch($deduplicatedBuffer); } elseif (null === $passthru) { $this->handler->handleBatch($this->buffer); } ``` My config: ```yaml deprecations: type: fingers_crossed # Include messages with level >= "info". action_level: info channels: ["php"] handler: deprecation_filter deprecation_filter: type: filter # Exclude messages with level higher than "info". max_level: info channels: ["php"] handler: deduplicated nested: true deduplicated: type: deduplication deduplication_level: info time: 60 handler: deprecations_rotated nested: true deprecations_rotated: type: rotating_file path: "%kernel.logs_dir%/%kernel.environment%.deprecations.log" max_files: 2 nested: true ```
Author
Owner

@Seldaek commented on GitHub (May 11, 2020):

The deduplication is only happening/effective between requests, within one request duplicate messages will still be all sent through. So on the first request yes the file is created and it passes everything through to nested handler, then on subsequent requests it should stop sending any logs unless there is a new deprecation.

<!-- gh-comment-id:626714211 --> @Seldaek commented on GitHub (May 11, 2020): The deduplication is only happening/effective between requests, within one request duplicate messages will still be all sent through. So on the first request yes the file is created and it passes everything through to nested handler, then on subsequent requests it should stop sending any logs unless there is a new deprecation.
Author
Owner

@Seldaek commented on GitHub (May 11, 2020):

So IMO not a bug unless I misunderstand what you are saying.. At least it seems to work for me as described.

<!-- gh-comment-id:626714475 --> @Seldaek commented on GitHub (May 11, 2020): So IMO not a bug unless I misunderstand what you are saying.. At least it seems to work for me as described.
Author
Owner

@M0nik commented on GitHub (May 14, 2020):

I have the same problem. Within one request duplicate messages should not be sent.

<!-- gh-comment-id:628559357 --> @M0nik commented on GitHub (May 14, 2020): I have the same problem. Within one request duplicate messages should not be sent.
Author
Owner

@Seldaek commented on GitHub (May 14, 2020):

What's the use case you are trying to achieve exactly?

<!-- gh-comment-id:628562305 --> @Seldaek commented on GitHub (May 14, 2020): What's the use case you are trying to achieve exactly?
Author
Owner

@M0nik commented on GitHub (May 14, 2020):

I use async PHP (walkor/Workerman). Messages accumulate over time. The flush method is called by the timer (e.g. every 10 seconds). By the time the flush is called, many identical messages may have accumulated.

<!-- gh-comment-id:628566060 --> @M0nik commented on GitHub (May 14, 2020): I use async PHP (`walkor/Workerman`). Messages accumulate over time. The flush method is called by the timer (e.g. every 10 seconds). By the time the flush is called, many identical messages may have accumulated.
Author
Owner

@Seldaek commented on GitHub (May 14, 2020):

Ok for long running processes running many jobs, it is recommended to call $logger->reset() in between every job. It essentially makes Monolog behave as if a request ended and a new one started. Buffers get flushed etc, it avoids memory leaks and generally speaking makes job processing behave closer to a per-request execution model.

<!-- gh-comment-id:628567481 --> @Seldaek commented on GitHub (May 14, 2020): Ok for long running processes running many jobs, it is recommended to call `$logger->reset()` in between every job. It essentially makes Monolog behave as if a request ended and a new one started. Buffers get flushed etc, it avoids memory leaks and generally speaking makes job processing behave closer to a per-request execution model.
Author
Owner

@M0nik commented on GitHub (May 14, 2020):

This does not solve the current problem, as within the same iteration, there can still be many identical messages.

<!-- gh-comment-id:628569779 --> @M0nik commented on GitHub (May 14, 2020): This does not solve the current problem, as within the same iteration, there can still be many identical messages.
Author
Owner

@Seldaek commented on GitHub (May 14, 2020):

Yeah maybe that's still a feature request then for DeduplicationHandler to have an option to deduplicate even within its own buffer.. Feel free to send a PR if you can add that.

<!-- gh-comment-id:628570308 --> @Seldaek commented on GitHub (May 14, 2020): Yeah maybe that's still a feature request then for DeduplicationHandler to have an option to deduplicate even within its own buffer.. Feel free to send a PR if you can add that.
Author
Owner

@puleeno commented on GitHub (Aug 8, 2020):

I have the same problem. The log entry is duplicated on version 2.x

<!-- gh-comment-id:670941560 --> @puleeno commented on GitHub (Aug 8, 2020): I have the same problem. The log entry is duplicated on version 2.x
Author
Owner

@Tobion commented on GitHub (Jan 14, 2021):

This is really a problem. Messages should be deduplicated within it's buffer as there can be thousands of deprecation messages in a worker or even a single cli command / web request. E.g. https://github.com/doctrine/orm/pull/7901 triggered ALOT of deprecations for us which overwhelmed graylog.
Another problem is that the deduplication handler does not support specifying the buffer_size and $flushOnOverflow. This means the deduplcation can grow endlessly and is a memory leak. Other buffered handlers like FingersCrossed support this option.

<!-- gh-comment-id:760337259 --> @Tobion commented on GitHub (Jan 14, 2021): This is really a problem. Messages should be deduplicated within it's buffer as there can be thousands of deprecation messages in a worker or even a single cli command / web request. E.g. https://github.com/doctrine/orm/pull/7901 triggered ALOT of deprecations for us which overwhelmed graylog. Another problem is that the `deduplication` handler does not support specifying the `buffer_size` and `$flushOnOverflow`. This means the deduplcation can grow endlessly and is a memory leak. Other buffered handlers like FingersCrossed support this option.
Author
Owner

@Tobion commented on GitHub (Jan 14, 2021):

Another problem is that the deduplicationStore file can grow indefinitely:

  • if there is a non-duplicate message in the buffer, all messages are stored in the deduplicationStore - even the duplicate ones. would be better to only store the unique ones.
  • if there are messages older than yesterday, the outdated messages get removed (why yersterday?)
  • so the deduplicationStore grows with the number of messages in a day (not with the number of unique messages in ain the given $time as I would expect).
<!-- gh-comment-id:760347851 --> @Tobion commented on GitHub (Jan 14, 2021): Another problem is that the deduplicationStore file can grow indefinitely: - if there is a non-duplicate message in the buffer, all messages are stored in the deduplicationStore - even the duplicate ones. would be better to only store the unique ones. - if there are messages older than yesterday, the outdated messages get removed (why yersterday?) - so the deduplicationStore grows with the number of messages in a day (not with the number of unique messages in ain the given `$time` as I would expect).
Author
Owner

@Seldaek commented on GitHub (Jan 14, 2021):

@Tobion this was a quick fix solution which served a purpose already, but it's definitely not perfect. PRs are definitely welcome at least to deduplicate within the buffer and allow setting buffer size etc. The deduplicationStore issues you describe I'm not sure if they're fixable within a reasonable amount of added complexity, but if so that'd also be good to fix.

<!-- gh-comment-id:760470087 --> @Seldaek commented on GitHub (Jan 14, 2021): @Tobion this was a quick fix solution which served a purpose already, but it's definitely not perfect. PRs are definitely welcome at least to deduplicate within the buffer and allow setting buffer size etc. The deduplicationStore issues you describe I'm not sure if they're fixable within a reasonable amount of added complexity, but if so that'd also be good to fix.
Author
Owner

@bravik commented on GitHub (Nov 5, 2021):

What's the use case you are trying to achieve exactly?

Have a graphql API. When I query some problematic field in collection item, it will trigger same error for each collection element but will not stop execution. So I'll have numerous same log messages triggered inside 1 request. Which I tried to avoid by using deduplication filter. But looking in the code I came to same colcusion as author of this issue.

<!-- gh-comment-id:961877287 --> @bravik commented on GitHub (Nov 5, 2021): > What's the use case you are trying to achieve exactly? Have a graphql API. When I query some problematic field in collection item, it will trigger same error for each collection element but will not stop execution. So I'll have numerous same log messages triggered inside 1 request. Which I tried to avoid by using deduplication filter. But looking in the code I came to same colcusion as author of this issue.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/monolog#601
No description provided.