[PR #281] [CLOSED] Fixes for issues affecting the FBref scraper #427

New issue

Closed

opened 2026-03-02 15:57:53 +03:00 by kerem · 0 comments

kerem commented

2026-03-02 15:57:53 +03:00

Owner

📋 Pull Request Information

Original PR: https://github.com/probberechts/soccerdata/pull/281
Author: @lorenzodb1
Created: 7/5/2023
Status: ❌ Closed

Base: master ← Head: lorenzodb1-fixes

📝 Commits (4)

7494536 Moved pretty-error to dev dependency and fixed bug making FBref tests fail
383d904 Fixed issue affecting cached team season stats
6c1e69a Fixed issue in read_schedule by moving the Top 5 Leagues optimisation in read_leagues
ec15e08 Fixed IndexError in _fix_nation_col

📊 Changes

5 files changed (+382 additions, -430 deletions)

View changed files

📝 poetry.lock (+335 -409)
📝 pyproject.toml (+2 -2)
📝 soccerdata/_config.py (+0 -1)
📝 soccerdata/fbref.py (+42 -16)
📝 tests/test_FBref.py (+3 -2)

📄 Description

This PR fixes the following issues:

The filename for the cached team season stats was appended with "all" regardless of the type of stats queried. This caused an issue as the cache might not have contained the table needed. It now caches these tables in different files.
For every n rows, the website adds a row in a table that replicates the table header. This caused read_schedule to fail as the number of rows in df_table would be higher than the one of the list of match URLs obtained (see https://github.com/probberechts/soccerdata/issues/277). I added the logic to remove those replicated headers when found.
The website has no specific Scores & Fixtures on the Big 5 European Leagues Stats page. Thus it'd go to the generic Scores & Fixtures page, which shows games currently being played. Because of this, I had to move the optimisation that combines the top five leagues under that label in read_leagues, as read_schedule necessarily needs the five top leagues separately rather than in their combined form.
The method _fix_nation_col throws an IndexError, supposedly when no flag is present. I fixed this by changing the logic to use regular expressions instead so that when the flag is missing no error is thrown.

Additionally, it moves pretty-error to the dev dependencies group, as it would otherwise be installed in repositories importing this library (which should not be the case). I'm not sure I've done this correctly, and I had to remove some imports, so please let me know if this breaks previous behaviour and advise me on what I should do instead. It also updates pandas to v2.0.

_{🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.}

## 📋 Pull Request Information **Original PR:** https://github.com/probberechts/soccerdata/pull/281 **Author:** [@lorenzodb1](https://github.com/lorenzodb1) **Created:** 7/5/2023 **Status:** ❌ Closed **Base:** `master` ← **Head:** `lorenzodb1-fixes` --- ### 📝 Commits (4) - [`7494536`](https://github.com/probberechts/soccerdata/commit/74945363370b0b82ab9451cab88934a91b4ab93b) Moved pretty-error to dev dependency and fixed bug making FBref tests fail - [`383d904`](https://github.com/probberechts/soccerdata/commit/383d904de3705a1e07b2dd3fa268bb13890fc5f3) Fixed issue affecting cached team season stats - [`6c1e69a`](https://github.com/probberechts/soccerdata/commit/6c1e69a92cb1afc475deb8f0275b4b06a65e3204) Fixed issue in read_schedule by moving the Top 5 Leagues optimisation in read_leagues - [`ec15e08`](https://github.com/probberechts/soccerdata/commit/ec15e087945a8ddd3114b42beed1028a516b1f3e) Fixed IndexError in _fix_nation_col ### 📊 Changes **5 files changed** (+382 additions, -430 deletions) <details> <summary>View changed files</summary> 📝 `poetry.lock` (+335 -409) 📝 `pyproject.toml` (+2 -2) 📝 `soccerdata/_config.py` (+0 -1) 📝 `soccerdata/fbref.py` (+42 -16) 📝 `tests/test_FBref.py` (+3 -2) </details> ### 📄 Description This PR fixes the following issues: 1. The filename for the cached team season stats was appended with `"all"` regardless of the type of stats queried. This caused an issue as the cache might not have contained the table needed. It now caches these tables in different files. 2. For every `n` rows, the website adds a row in a table that replicates the table header. This caused `read_schedule` to fail as the number of rows in `df_table` would be higher than the one of the list of match URLs obtained (see https://github.com/probberechts/soccerdata/issues/277). I added the logic to remove those replicated headers when found. 3. The website has no specific `Scores & Fixtures` on the `Big 5 European Leagues Stats` page. Thus it'd go to the generic `Scores & Fixtures` page, which shows games currently being played. Because of this, I had to move the optimisation that combines the top five leagues under that label in `read_leagues`, as `read_schedule` necessarily needs the five top leagues separately rather than in their combined form. 4. The method _fix_nation_col throws an `IndexError`, supposedly when no flag is present. I fixed this by changing the logic to use regular expressions instead so that when the flag is missing no error is thrown. Additionally, it moves `pretty-error` to the dev dependencies group, as it would otherwise be installed in repositories importing this library (which should not be the case). I'm not sure I've done this correctly, and I had to remove some imports, so please let me know if this breaks previous behaviour and advise me on what I should do instead. It also updates `pandas` to v2.0. --- <sub>🔄 This issue represents a GitHub Pull Request. It cannot be merged through Gitea due to API limitations.</sub>