[GH-ISSUE #682] [FBref] Change how data is cached to reduce the space required #140

New issue

Open

opened 2026-03-02 15:56:08 +03:00 by kerem · 2 comments

kerem commented

2026-03-02 15:56:08 +03:00

Owner

Originally created by @lorenzodb1 on GitHub (Aug 19, 2024).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/682

Currently, data is cached by storing the HTML page of the FBref website, which is then used to re-parse the data contained in it. However, this comes with a huge cost in terms of the required space. In my case, I ended up with a cache of over 20GB, which isn't ideal by all means.

To improve this, we could store the DataFrame with data obtained from these pages as a CSV file and just load that when needed. The space required will be reduced tenfold.

Originally created by @lorenzodb1 on GitHub (Aug 19, 2024). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/682 Currently, data is cached by storing the HTML page of the FBref website, which is then used to re-parse the data contained in it. However, this comes with a huge cost in terms of the required space. In my case, I ended up with a cache of over 20GB, which isn't ideal by all means. To improve this, we could store the `DataFrame` with data obtained from these pages as a CSV file and just load that when needed. The space required will be reduced tenfold.

kerem added the

enhancement

FBref

labels

2026-03-02 15:56:08 +03:00

kerem commented

2026-03-02 15:56:09 +03:00

Author

Owner

@probberechts commented on GitHub (Aug 20, 2024):

Thanks for your suggestion.

I follow the "Save First, Parse Later" paradigm and I am not really eager to change this. This blog post lists a number of advantages, fyi.

We could probably save some storage space by discarding irrelevant parts of the HTML page (e.g., the <head>,

@probberechts commented on GitHub (Aug 20, 2024): Thanks for your suggestion. I follow the "Save First, Parse Later" paradigm and I am not really eager to change this. [This blog post](https://betterprogramming.pub/save-first-parse-later-in-defense-of-a-different-approach-to-web-scraping-9edfe65adf04) lists a number of advantages, fyi. We could probably save some storage space by discarding irrelevant parts of the HTML page (e.g., the <head>, <script>, etc.) and by storing them in a compressed format. However, this is not a high priority for me. My storage requirements are << 500MB. I assume you must have scraped a huge part of the FBref website to end up with 20GB. That is a use case that I (ethically) actually prefer not to support at all.

kerem commented

2026-03-02 15:56:09 +03:00

Author

Owner

@lorenzodb1 commented on GitHub (Aug 20, 2024):

Thank you for sharing that blog post!

I think what's written there makes sense in the context of the ongoing development of a scraper, where the parsing of the data obtained from the website is subject to changes. In this case, I think an SFPL approach is a bit less justified as the end user of the library will work with the data obtained by soccerdata and won't be changing how the data is parsed.

I guess, though, that in the case where there's a bug in one of the functions responsible for parsing the data, then the SFPL approach would still save those calls. I guess I do understand your perspective after all.

Would you consider maybe adding a flag to specify how one would want the data to be cached and adding the ability to cache it as CSV?

@lorenzodb1 commented on GitHub (Aug 20, 2024): Thank you for sharing that blog post! I think what's written there makes sense in the context of the ongoing development of a scraper, where the parsing of the data obtained from the website is subject to changes. In this case, I think an SFPL approach is a bit less justified as the end user of the library will work with the data obtained by _soccerdata_ and won't be changing how the data is parsed. I guess, though, that in the case where there's a bug in one of the functions responsible for parsing the data, then the SFPL approach would still save those calls. I guess I do understand your perspective after all. Would you consider maybe adding a flag to specify how one would want the data to be cached and adding the ability to cache it as CSV?