[GH-ISSUE #682] [FBref] Change how data is cached to reduce the space required #140

Open
opened 2026-03-02 15:56:08 +03:00 by kerem · 2 comments
Owner

Originally created by @lorenzodb1 on GitHub (Aug 19, 2024).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/682

Currently, data is cached by storing the HTML page of the FBref website, which is then used to re-parse the data contained in it. However, this comes with a huge cost in terms of the required space. In my case, I ended up with a cache of over 20GB, which isn't ideal by all means.

To improve this, we could store the DataFrame with data obtained from these pages as a CSV file and just load that when needed. The space required will be reduced tenfold.

Originally created by @lorenzodb1 on GitHub (Aug 19, 2024). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/682 Currently, data is cached by storing the HTML page of the FBref website, which is then used to re-parse the data contained in it. However, this comes with a huge cost in terms of the required space. In my case, I ended up with a cache of over 20GB, which isn't ideal by all means. To improve this, we could store the `DataFrame` with data obtained from these pages as a CSV file and just load that when needed. The space required will be reduced tenfold.
Author
Owner

@probberechts commented on GitHub (Aug 20, 2024):

Thanks for your suggestion.

I follow the "Save First, Parse Later" paradigm and I am not really eager to change this. This blog post lists a number of advantages, fyi.

We could probably save some storage space by discarding irrelevant parts of the HTML page (e.g., the <head>,

<!-- gh-comment-id:2298498778 --> @probberechts commented on GitHub (Aug 20, 2024): Thanks for your suggestion. I follow the "Save First, Parse Later" paradigm and I am not really eager to change this. [This blog post](https://betterprogramming.pub/save-first-parse-later-in-defense-of-a-different-approach-to-web-scraping-9edfe65adf04) lists a number of advantages, fyi. We could probably save some storage space by discarding irrelevant parts of the HTML page (e.g., the <head>, <script>, etc.) and by storing them in a compressed format. However, this is not a high priority for me. My storage requirements are << 500MB. I assume you must have scraped a huge part of the FBref website to end up with 20GB. That is a use case that I (ethically) actually prefer not to support at all.
Author
Owner

@lorenzodb1 commented on GitHub (Aug 20, 2024):

Thank you for sharing that blog post!

I think what's written there makes sense in the context of the ongoing development of a scraper, where the parsing of the data obtained from the website is subject to changes. In this case, I think an SFPL approach is a bit less justified as the end user of the library will work with the data obtained by soccerdata and won't be changing how the data is parsed.

I guess, though, that in the case where there's a bug in one of the functions responsible for parsing the data, then the SFPL approach would still save those calls. I guess I do understand your perspective after all.

Would you consider maybe adding a flag to specify how one would want the data to be cached and adding the ability to cache it as CSV?

<!-- gh-comment-id:2299934296 --> @lorenzodb1 commented on GitHub (Aug 20, 2024): Thank you for sharing that blog post! I think what's written there makes sense in the context of the ongoing development of a scraper, where the parsing of the data obtained from the website is subject to changes. In this case, I think an SFPL approach is a bit less justified as the end user of the library will work with the data obtained by _soccerdata_ and won't be changing how the data is parsed. I guess, though, that in the case where there's a bug in one of the functions responsible for parsing the data, then the SFPL approach would still save those calls. I guess I do understand your perspective after all. Would you consider maybe adding a flag to specify how one would want the data to be cached and adding the ability to cache it as CSV?
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#140
No description provided.