[GH-ISSUE #1108] Memory leak when fetching specific page #462

Open
opened 2026-02-25 23:34:16 +03:00 by kerem · 1 comment
Owner

Originally created by @rnkn on GitHub (May 24, 2025).
Original GitHub issue: https://github.com/go-shiori/shiori/issues/1108

Data

  • Shiori version: v1.7.4
  • Database Engine: SQLite
  • Operating system: Docker
  • CLI/Web interface/Web Extension:

Describe the bug / actual behavior

I can reliably reproduce a memory leak when fetching this page: https://leahroseman.substack.com/p/lawrence-english-interview

Shiori's memory/CPU usage will overload requiring a server restart.

Expected behavior

Shiori fetches page with consistent memory usage.

To Reproduce

Steps to reproduce the behavior:

  1. Add https://leahroseman.substack.com/p/lawrence-english-interview

Notes

I'm running via Docker on PikaPods.

Originally created by @rnkn on GitHub (May 24, 2025). Original GitHub issue: https://github.com/go-shiori/shiori/issues/1108 ## Data - **Shiori version**: v1.7.4 - **Database Engine**: SQLite - **Operating system**: Docker - **CLI/Web interface/Web Extension**: ## Describe the bug / actual behavior I can reliably reproduce a memory leak when fetching this page: https://leahroseman.substack.com/p/lawrence-english-interview Shiori's memory/CPU usage will overload requiring a server restart. ## Expected behavior Shiori fetches page with consistent memory usage. ## To Reproduce Steps to reproduce the behavior: 1. Add https://leahroseman.substack.com/p/lawrence-english-interview ## Notes I'm running via Docker on [PikaPods](https://www.pikapods.com/).
Author
Owner

@sakaru commented on GitHub (Jun 5, 2025):

I thought to look into this as a first issue.

First I'll say that I can replicate this, with a memory limit of 100Mi. Once I increase the limit I see it uses roughly 500Mi, then succeeds. I can also replicate the issue with any other posts on the same substack.

However using shiori add --no-archival ... avoids the problem. Naturally in the web UI unticking the "Create Archive" checkbox also avoids the problem.

In trying to nail down where this memory usage comes from, I found that processing.go's warc.NewArchive is what starts the memory usage. Also notably the boltdb for the archive is roughly 104MB, which seems really quite large.

Inspecting the boltdb warc also downloads the feeds XML and the embedded audio file:

boltdb=/tmp/archive499789963
for bucket in $(bbolt buckets $boltdb); do
  size=$(bbolt get --format bytes $boltdb $bucket content | wc --bytes)
  echo $bucket " " $size
done | sort --numeric-sort --key=2 | tail -5
https-substackcdn.com-image-fetch-w_1200,h_600,c_fill,f_jpg,q_auto-good,fl_progressive-steep,g_auto-https-substack-post-media.s3.amazonaws.com-public-images-cf50f807-049c-4a14-ad2c-5429866730ab_3000x3000.png   123580
https-substackcdn.com-image-fetch-f_auto,q_auto-best,fl_progressive-steep-https-leahroseman.substack.com-api-v1-post_preview-161727456-twitter.jpg-version=4   195555
https-substackcdn.com-image-fetch-w_720,c_limit,f_auto,q_auto-good,fl_progressive-steep-https-substack-post-media.s3.amazonaws.com-public-images-cf50f807-049c-4a14-ad2c-5429866730ab_3000x3000.png   247064
https-leahroseman.substack.com-feed   384941
https-leahroseman.substack.com-api-v1-audio-upload-922eb306-a950-42b8-828c-f35dcede2b28-src   89737965

So warc is downloading several large files to be part of the archive.

I also see #353 exists which aims to remove the warc dependency, which should solve this issue.

Finally, it's not a leak in the sense that the memory usage stays high. I saw the memory usage return to normal after the archival process.

<!-- gh-comment-id:2942215742 --> @sakaru commented on GitHub (Jun 5, 2025): I thought to look into this as a first issue. First I'll say that I can replicate this, with a memory limit of 100Mi. Once I increase the limit I see it uses roughly 500Mi, then succeeds. I can also replicate the issue with any other posts on the same substack. However using `shiori add --no-archival ...` avoids the problem. Naturally in the web UI unticking the "Create Archive" checkbox also avoids the problem. In trying to nail down where this memory usage comes from, I found that `processing.go`'s `warc.NewArchive` is what starts the memory usage. Also notably the boltdb for the archive is roughly 104MB, which seems really quite large. Inspecting the boltdb warc also downloads the feeds XML and the embedded audio file: ```bash ❯ boltdb=/tmp/archive499789963 for bucket in $(bbolt buckets $boltdb); do size=$(bbolt get --format bytes $boltdb $bucket content | wc --bytes) echo $bucket " " $size done | sort --numeric-sort --key=2 | tail -5 https-substackcdn.com-image-fetch-w_1200,h_600,c_fill,f_jpg,q_auto-good,fl_progressive-steep,g_auto-https-substack-post-media.s3.amazonaws.com-public-images-cf50f807-049c-4a14-ad2c-5429866730ab_3000x3000.png 123580 https-substackcdn.com-image-fetch-f_auto,q_auto-best,fl_progressive-steep-https-leahroseman.substack.com-api-v1-post_preview-161727456-twitter.jpg-version=4 195555 https-substackcdn.com-image-fetch-w_720,c_limit,f_auto,q_auto-good,fl_progressive-steep-https-substack-post-media.s3.amazonaws.com-public-images-cf50f807-049c-4a14-ad2c-5429866730ab_3000x3000.png 247064 https-leahroseman.substack.com-feed 384941 https-leahroseman.substack.com-api-v1-audio-upload-922eb306-a950-42b8-828c-f35dcede2b28-src 89737965 ``` So warc is downloading several large files to be part of the archive. I also see #353 exists which aims to remove the warc dependency, which should solve this issue. Finally, it's not a leak in the sense that the memory usage stays high. I saw the memory usage return to normal after the archival process.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/shiori#462
No description provided.