[GH-ISSUE #1108] Memory leak when fetching specific page #462

New issue

Open

opened 2026-02-25 23:34:16 +03:00 by kerem · 1 comment

kerem commented

2026-02-25 23:34:16 +03:00

Owner

Originally created by @rnkn on GitHub (May 24, 2025).
Original GitHub issue: https://github.com/go-shiori/shiori/issues/1108

Data

Shiori version: v1.7.4
Database Engine: SQLite
Operating system: Docker
CLI/Web interface/Web Extension:

Describe the bug / actual behavior

I can reliably reproduce a memory leak when fetching this page: https://leahroseman.substack.com/p/lawrence-english-interview

Shiori's memory/CPU usage will overload requiring a server restart.

Expected behavior

Shiori fetches page with consistent memory usage.

To Reproduce

Steps to reproduce the behavior:

Add https://leahroseman.substack.com/p/lawrence-english-interview

Notes

I'm running via Docker on PikaPods.

Originally created by @rnkn on GitHub (May 24, 2025). Original GitHub issue: https://github.com/go-shiori/shiori/issues/1108 ## Data - **Shiori version**: v1.7.4 - **Database Engine**: SQLite - **Operating system**: Docker - **CLI/Web interface/Web Extension**: ## Describe the bug / actual behavior I can reliably reproduce a memory leak when fetching this page: https://leahroseman.substack.com/p/lawrence-english-interview Shiori's memory/CPU usage will overload requiring a server restart. ## Expected behavior Shiori fetches page with consistent memory usage. ## To Reproduce Steps to reproduce the behavior: 1. Add https://leahroseman.substack.com/p/lawrence-english-interview ## Notes I'm running via Docker on [PikaPods](https://www.pikapods.com/).

kerem added the

type:bug

label

2026-02-25 23:34:16 +03:00

kerem commented

2026-02-25 23:34:17 +03:00

Author

Owner

@sakaru commented on GitHub (Jun 5, 2025):

I thought to look into this as a first issue.

First I'll say that I can replicate this, with a memory limit of 100Mi. Once I increase the limit I see it uses roughly 500Mi, then succeeds. I can also replicate the issue with any other posts on the same substack.

However using shiori add --no-archival ... avoids the problem. Naturally in the web UI unticking the "Create Archive" checkbox also avoids the problem.

In trying to nail down where this memory usage comes from, I found that processing.go's warc.NewArchive is what starts the memory usage. Also notably the boltdb for the archive is roughly 104MB, which seems really quite large.

Inspecting the boltdb warc also downloads the feeds XML and the embedded audio file:

❯ boltdb=/tmp/archive499789963
for bucket in $(bbolt buckets $boltdb); do
  size=$(bbolt get --format bytes $boltdb $bucket content | wc --bytes)
  echo $bucket " " $size
done | sort --numeric-sort --key=2 | tail -5
https-substackcdn.com-image-fetch-w_1200,h_600,c_fill,f_jpg,q_auto-good,fl_progressive-steep,g_auto-https-substack-post-media.s3.amazonaws.com-public-images-cf50f807-049c-4a14-ad2c-5429866730ab_3000x3000.png   123580
https-substackcdn.com-image-fetch-f_auto,q_auto-best,fl_progressive-steep-https-leahroseman.substack.com-api-v1-post_preview-161727456-twitter.jpg-version=4   195555
https-substackcdn.com-image-fetch-w_720,c_limit,f_auto,q_auto-good,fl_progressive-steep-https-substack-post-media.s3.amazonaws.com-public-images-cf50f807-049c-4a14-ad2c-5429866730ab_3000x3000.png   247064
https-leahroseman.substack.com-feed   384941
https-leahroseman.substack.com-api-v1-audio-upload-922eb306-a950-42b8-828c-f35dcede2b28-src   89737965

So warc is downloading several large files to be part of the archive.

I also see #353 exists which aims to remove the warc dependency, which should solve this issue.

Finally, it's not a leak in the sense that the memory usage stays high. I saw the memory usage return to normal after the archival process.

@sakaru commented on GitHub (Jun 5, 2025): I thought to look into this as a first issue. First I'll say that I can replicate this, with a memory limit of 100Mi. Once I increase the limit I see it uses roughly 500Mi, then succeeds. I can also replicate the issue with any other posts on the same substack. However using `shiori add --no-archival ...` avoids the problem. Naturally in the web UI unticking the "Create Archive" checkbox also avoids the problem. In trying to nail down where this memory usage comes from, I found that `processing.go`'s `warc.NewArchive` is what starts the memory usage. Also notably the boltdb for the archive is roughly 104MB, which seems really quite large. Inspecting the boltdb warc also downloads the feeds XML and the embedded audio file: ```bash ❯ boltdb=/tmp/archive499789963 for bucket in $(bbolt buckets $boltdb); do size=$(bbolt get --format bytes $boltdb $bucket content | wc --bytes) echo $bucket " " $size done | sort --numeric-sort --key=2 | tail -5 https-substackcdn.com-image-fetch-w_1200,h_600,c_fill,f_jpg,q_auto-good,fl_progressive-steep,g_auto-https-substack-post-media.s3.amazonaws.com-public-images-cf50f807-049c-4a14-ad2c-5429866730ab_3000x3000.png 123580 https-substackcdn.com-image-fetch-f_auto,q_auto-best,fl_progressive-steep-https-leahroseman.substack.com-api-v1-post_preview-161727456-twitter.jpg-version=4 195555 https-substackcdn.com-image-fetch-w_720,c_limit,f_auto,q_auto-good,fl_progressive-steep-https-substack-post-media.s3.amazonaws.com-public-images-cf50f807-049c-4a14-ad2c-5429866730ab_3000x3000.png 247064 https-leahroseman.substack.com-feed 384941 https-leahroseman.substack.com-api-v1-audio-upload-922eb306-a950-42b8-828c-f35dcede2b28-src 89737965 ``` So warc is downloading several large files to be part of the archive. I also see #353 exists which aims to remove the warc dependency, which should solve this issue. Finally, it's not a leak in the sense that the memory usage stays high. I saw the memory usage return to normal after the archival process.