mirror of
https://github.com/probberechts/soccerdata.git
synced 2026-04-25 10:05:53 +03:00
[GH-ISSUE #682] [FBref] Change how data is cached to reduce the space required #140
Labels
No labels
ESPN
FBref
FotMob
MatchHistory
SoFIFA
Sofascore
WhoScored
WhoScored
bug
build
common
dependencies
discussion
documentation
duplicate
enhancement
good first issue
invalid
performance
pull-request
question
question
removal
understat
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/soccerdata#140
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @lorenzodb1 on GitHub (Aug 19, 2024).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/682
Currently, data is cached by storing the HTML page of the FBref website, which is then used to re-parse the data contained in it. However, this comes with a huge cost in terms of the required space. In my case, I ended up with a cache of over 20GB, which isn't ideal by all means.
To improve this, we could store the
DataFramewith data obtained from these pages as a CSV file and just load that when needed. The space required will be reduced tenfold.@probberechts commented on GitHub (Aug 20, 2024):
Thanks for your suggestion.
I follow the "Save First, Parse Later" paradigm and I am not really eager to change this. This blog post lists a number of advantages, fyi.
We could probably save some storage space by discarding irrelevant parts of the HTML page (e.g., the <head>,
@lorenzodb1 commented on GitHub (Aug 20, 2024):
Thank you for sharing that blog post!
I think what's written there makes sense in the context of the ongoing development of a scraper, where the parsing of the data obtained from the website is subject to changes. In this case, I think an SFPL approach is a bit less justified as the end user of the library will work with the data obtained by soccerdata and won't be changing how the data is parsed.
I guess, though, that in the case where there's a bug in one of the functions responsible for parsing the data, then the SFPL approach would still save those calls. I guess I do understand your perspective after all.
Would you consider maybe adding a flag to specify how one would want the data to be cached and adding the ability to cache it as CSV?