[PR #556] [MERGED] Improve memory usage and execution time of listing objects with file system backend #712

New issue

Closed

opened 2026-03-03 12:31:16 +03:00 by kerem · 0 comments

kerem commented

2026-03-03 12:31:16 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/fsouza/fake-gcs-server/pull/556
Author: @ironsmile
Created: 8/13/2021
Status: ✅ Merged
Merged: 8/14/2021
Merged by: @fsouza

Base: main ← Head: do-not-store-all-files-in-memory-on-object-list

📝 Commits (2)

3eb4422 Do not store all objects in memory on list commands
f2340d3 List objects: filter files by prefix before reading them

📊 Changes

15 files changed (+519 additions, -304 deletions)

View changed files

📝 fakestorage/bucket_test.go (+14 -14)
📝 fakestorage/example_test.go (+10 -6)
📝 fakestorage/object.go (+98 -57)
📝 fakestorage/object_test.go (+185 -100)
📝 fakestorage/response.go (+8 -8)
📝 fakestorage/server_test.go (+10 -10)
📝 fakestorage/upload.go (+50 -40)
📝 fakestorage/upload_test.go (+7 -5)
📝 internal/backend/backend_test.go (+26 -5)
📝 internal/backend/fs.go (+13 -5)
📝 internal/backend/memory.go (+30 -10)
📝 internal/backend/object.go (+11 -5)
📝 internal/backend/storage.go (+1 -1)
📝 main.go (+8 -6)
📝 main_test.go (+48 -32)

📄 Description

I am using the GCS Fake Server to develop locally and it is mostly great. But I've noticed it is completely unable to list the objects in my fie system bucket. Even when I give it a prefix which ensures only one object will match. It consumes all of the machine's memory and never finishes, presumably because it spends all of its time swapping. Information about my bucket: 53017 files with overall size of 20.3G. Sadly, the nature of my work is such that this is a relatively small data set.

So I went in started poking around the code. It quickly became evident that two things are happening:

All the files of the bucket are loaded into memory, all at once. For every object list command.
When filters are used (such as "prefix") this does not prevent files which do not match this filter from being parsed and loaded in the process memory.

This PR fixes those two issues in its two commits. Previously the list object command was taking all of my machine's 32GB of ram and was not finishing even after I've waited on it for half an hour. Now such list commands take no memory at all (in the range of few KB) and finish instantaneously.

While the above is great, I suspect there are many more places where the emulator will be significantly faster. I just haven't clocked them. On top of my head I see that deleting the bucket will require no memory where before it had the same problem as listing objects.

Further Improvements

It would be great if it is possible to read only the meta data for blobs stored on the file system. Unfortunately with the JSON encoding I don't see how that would be possible. As it stands one have to load all of the file contents in order for the JSON parser to do its thing. This is sad, though. When we consider that in many situations we would want to get only the blob meta-data.

I think the only way to achieve this cleanly would be to drop the JSON altogether and find another way of storing the meta-data. Possible approaches are file headers similarly to the nginx file cache or separate ".attrs" files like what gocloud.dev/blob does.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/fsouza/fake-gcs-server/pull/556 **Author:** [@ironsmile](https://github.com/ironsmile) **Created:** 8/13/2021 **Status:** ✅ Merged **Merged:** 8/14/2021 **Merged by:** [@fsouza](https://github.com/fsouza) **Base:** `main` ← **Head:** `do-not-store-all-files-in-memory-on-object-list` --- ### 📝 Commits (2) - [`3eb4422`](https://github.com/fsouza/fake-gcs-server/commit/3eb4422b61210a2064ea6c5aa03ffc3b52a4d339) Do not store all objects in memory on list commands - [`f2340d3`](https://github.com/fsouza/fake-gcs-server/commit/f2340d3198a311b0cd67e48b4c95caddefb49b34) List objects: filter files by prefix before reading them ### 📊 Changes **15 files changed** (+519 additions, -304 deletions) <details> <summary>View changed files</summary> 📝 `fakestorage/bucket_test.go` (+14 -14) 📝 `fakestorage/example_test.go` (+10 -6) 📝 `fakestorage/object.go` (+98 -57) 📝 `fakestorage/object_test.go` (+185 -100) 📝 `fakestorage/response.go` (+8 -8) 📝 `fakestorage/server_test.go` (+10 -10) 📝 `fakestorage/upload.go` (+50 -40) 📝 `fakestorage/upload_test.go` (+7 -5) 📝 `internal/backend/backend_test.go` (+26 -5) 📝 `internal/backend/fs.go` (+13 -5) 📝 `internal/backend/memory.go` (+30 -10) 📝 `internal/backend/object.go` (+11 -5) 📝 `internal/backend/storage.go` (+1 -1) 📝 `main.go` (+8 -6) 📝 `main_test.go` (+48 -32) </details> ### 📄 Description I am using the GCS Fake Server to develop locally and it is mostly great. But I've noticed it is completely unable to list the objects in my fie system bucket. Even when I give it a prefix which ensures only one object will match. It consumes all of the machine's memory and never finishes, presumably because it spends all of its time swapping. Information about my bucket: 53017 files with overall size of 20.3G. Sadly, the nature of my work is such that this is a relatively small data set. So I went in started poking around the code. It quickly became evident that two things are happening: * All the files of the bucket are loaded into memory, all at once. For every object list command. * When filters are used (such as "prefix") this does not prevent files which do not match this filter from being parsed and loaded in the process memory. This PR fixes those two issues in its two commits. Previously the list object command was taking all of my machine's 32GB of ram and was not finishing even after I've waited on it for half an hour. Now such list commands take no memory at all (in the range of few KB) and finish instantaneously. While the above is great, I suspect there are many more places where the emulator will be _significantly_ faster. I just haven't clocked them. On top of my head I see that deleting the bucket will require no memory where before it had the same problem as listing objects. ### Further Improvements It would be great if it is possible to read only the meta data for blobs stored on the file system. Unfortunately with the JSON encoding I don't see how that would be possible. As it stands one have to load all of the file contents in order for the JSON parser to do its thing. This is sad, though. When we consider that in many situations we would want to get only the blob meta-data. I think the only way to achieve this cleanly would be to drop the JSON altogether and find another way of storing the meta-data. Possible approaches are file headers similarly to the nginx file cache or separate ".attrs" files like what [gocloud.dev/blob](https://gocloud.dev/howto/blob/#local) does. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>