[GH-ISSUE #277] [FBref] Non-data rows in the table body should be removed #57

Closed
opened 2026-03-02 15:55:23 +03:00 by kerem · 3 comments
Owner

Originally created by @lorenzodb1 on GitHub (Jun 29, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/277

read_player_match_stats throws ValueError: Length of values (131) does not match length of index (132) in fbref.py#L641 due to df_table having an additional element that shouldn't be there (see line 127 in the attached image).

immagine
Originally created by @lorenzodb1 on GitHub (Jun 29, 2023). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/277 `read_player_match_stats` throws `ValueError: Length of values (131) does not match length of index (132)` in [fbref.py#L641](https://github.com/probberechts/soccerdata/blob/master/soccerdata/fbref.py#L641) due to `df_table` having an additional element that shouldn't be there (see line 127 in the attached image). <img width="728" alt="immagine" src="https://github.com/probberechts/soccerdata/assets/16848175/c23ed2ec-cff2-476c-a3a3-d96d362f0ff0">
kerem 2026-03-02 15:55:23 +03:00
  • closed this issue
  • added the
    bug
    FBref
    labels
Author
Owner

@lorenzodb1 commented on GitHub (Jun 30, 2023):

Issue affects every method that calls read_schedule. It's quite annoying as it doesn't allow to download any data from many leagues, including major ones such as the UCL.

<!-- gh-comment-id:1614091053 --> @lorenzodb1 commented on GitHub (Jun 30, 2023): Issue affects every method that calls `read_schedule`. It's quite annoying as it doesn't allow to download any data from many leagues, including major ones such as the UCL.
Author
Owner

@probberechts commented on GitHub (Jul 7, 2023):

Apart from these header rows, I noticed that FBref also added "spacer" rows to the fixtures table. These can be removed with:

# remove thead rows in the table body
for elem in html_table.xpath("//tbody/tr[contains(@class, 'thead')]"):
    elem.getparent().remove(elem)

Maybe we should add the following helper method and call it everywhere before passing the HTML to Pandas.

def _clean_table(html_table):
    # remove icons
    for elem in html_table.xpath("//span"):
        elem.getparent().remove(elem)
    # remove sep rows
    for elem in html_table.xpath("//tbody/tr[contains(@class, 'spacer')]"):
        elem.getparent().remove(elem)
    # remove thead rows in the table body
    for elem in html_table.xpath("//tbody/tr[contains(@class, 'thead')]"):
        elem.getparent().remove(elem)
<!-- gh-comment-id:1624900340 --> @probberechts commented on GitHub (Jul 7, 2023): Apart from these header rows, I noticed that FBref also added "spacer" rows to the fixtures table. These can be removed with: ```python # remove thead rows in the table body for elem in html_table.xpath("//tbody/tr[contains(@class, 'thead')]"): elem.getparent().remove(elem) ``` Maybe we should add the following helper method and call it everywhere before passing the HTML to Pandas. ```python def _clean_table(html_table): # remove icons for elem in html_table.xpath("//span"): elem.getparent().remove(elem) # remove sep rows for elem in html_table.xpath("//tbody/tr[contains(@class, 'spacer')]"): elem.getparent().remove(elem) # remove thead rows in the table body for elem in html_table.xpath("//tbody/tr[contains(@class, 'thead')]"): elem.getparent().remove(elem) ```
Author
Owner

@lorenzodb1 commented on GitHub (Jul 7, 2023):

The spacer rows don't create issues, as when we're scraping the URL, it'll get an empty value for those, which maps well with the empty rows in the table. That being said, I see no problem with the _clean_table method you suggested. Let me know if you want me to add that in #284.

<!-- gh-comment-id:1625012808 --> @lorenzodb1 commented on GitHub (Jul 7, 2023): The spacer rows don't create issues, as when we're scraping the URL, it'll get an empty value for those, which maps well with the empty rows in the table. That being said, I see no problem with the `_clean_table` method you suggested. Let me know if you want me to add that in #284.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#57
No description provided.