mirror of
https://github.com/probberechts/soccerdata.git
synced 2026-04-25 10:05:53 +03:00
[GH-ISSUE #68] [FBref] Flatten the index and columns #12
Labels
No labels
ESPN
FBref
FotMob
MatchHistory
SoFIFA
Sofascore
WhoScored
WhoScored
bug
build
common
dependencies
discussion
documentation
duplicate
enhancement
good first issue
invalid
performance
pull-request
question
question
removal
understat
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/soccerdata#12
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @andrewRowlinson on GitHub (Jul 26, 2022).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/68
Hi,
Your package looks great! I am thinking of making a couple of fbref examples using soccerdata for mplsoccer. I was wondering whether you would consider flattening the multi-level index/ columns. I have previously flattened the columns like this:
https://github.com/andrewRowlinson/outliers-football/blob/master/scrape_utils.py#L34-L40
Thanks,
Andy
@probberechts commented on GitHub (Jul 26, 2022):
I think the multi-level index is very convenient here because the data is logically hierarchical — meaning that a league contains multiple seasons, each season contains multiple teams, ... Also, I think the multi-level index has three main benefits:
And of course, if you do not like the multi-level index all you need is a simple
df.reset_index()but I think using the multi-level index is a good default.I do not have a strong preference for multi-level columns or flat columns. I assume the latter is a bit more convenient for novice users, but the former has some advantages too. Again, you have logical groups of stats here, and with the multi-level columns, you can easily select one of these groups. For example, if you are only interested in the "per 90" stats, you can simply do
df["Per 90 Minutes"]. With a flat index that would require something likedf[['Gls_p90', 'Ast_p90', ...]]. Eventually, I stuck to the default because I do not see a convincing reason to change it, but I am curious why you would like to flatten them.Actually, I would drop all columns that can be derived from combining other columns. For example, all per 90 stats can be computed by diving the "performance" and "expected" groups by ("Playing time", "Min"). That would make the multi-level columns obsolete too, but I expect many people to complain that some columns are missing if I would do that 🙃 .
@andrewRowlinson commented on GitHub (Jul 27, 2022):
No worries, I agree indexes are easy to change with reset_index.
With columns, I guess all the standard stuff is harder, e.g. renaming a column or loading it into a database.
I expect the main use case is also for making pizza charts and radars. People will need to select stats from several of the top level columns and I wonder how they would do this in practice. It might get tricky quickly
@probberechts commented on GitHub (Jul 28, 2022):
Well, I guess hierarchical columns have some slight advantages in some use cases and some slight disadvantages in other use cases. For all of your use cases, you could also claim that hierarchical columns are an advantage. For example, to rename the same column in each group of stats you can do
team_season_stats.rename(columns={"Gls": "goals"}, level=1)and when creating a radar chart you probably want to select a single group of stats most of the time (e.g., you do not want to mix "per 90" stats with aggregated stats in a single chart).Another issue that I see is that it will require quite a few manual overrides to get meaningful column names. If flattened, I would prefer just "goals" instead of "Performance_gls". These manual overrides come with an additional maintenance cost.
Hence, why I think it is better to stick to the default here. After all, this is a scraper, not a data manipulation library.
I'll leave the issue open for a while and reconsider if others can provide convincing arguments.