[GH-ISSUE #889] [SoFIFA] Read_player_ratings return only 1 record #192

New issue

Open

opened 2026-03-02 15:56:33 +03:00 by kerem · 1 comment

kerem commented

2026-03-02 15:56:33 +03:00

Owner

Originally created by @mttam on GitHub (Sep 18, 2025).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/889

Describe the bug
the method read_player_ratings return only the last player. Specifically because there is an incorrect indentation of tht XPath extraction and ratings.append() are outside the player loop, so only the last player's scores are processed and appended.

Python Version
Python 3.11.4

Affected scrapers
This affects the following scrapers:

SoFIFA

Code example

import soccerdata as sd
sofifa = sd.SoFIFA(leagues="ENG-Premier League", versions="latest")
    print(sofifa.read_player_ratings(team="Arsenal")

Error message

no error message

Error output

                  fifa_edition        update overallrating  ... gk_kicking gk_positioning gk_reflexes
player                                                      ...
Takehiro Tomiyasu        FC 25  Jul 17, 2025            78  ...          6              5          11

[1 rows x 38 columns]

Additional context
I fix the problem with GPT-5 mini but im not sure is the correct way (or an effective issue) because i only dowload the collection.

Code fix sofifa.py

def read_player_ratings(
        self,
        team: Optional[Union[str, list[str]]] = None,
        player: Optional[Union[int, list[int]]] = None,
    ) -> pd.DataFrame:
        """Retrieve ratings for players.

        Parameters
        ----------
        team: str or list of str, optional
            Team(s) to retrieve. If None, will retrieve all teams.
        player: int or list of int, optional
            Player(s) to retrieve. If None, will retrieve all players.

        Returns
        -------
        pd.DataFrame
        """
        # build url
        urlmask = SO_FIFA_API + "/player/{}/?r={}&set=true"
        filemask = "player_{}_{}.html"

        # get player IDs
        if player is None:
            players = self.read_players(team=team).index.unique()
        elif isinstance(player, int):
            players = [player]
        else:
            players = player

        # prepare empty data frame
        ratings = []

        # define labels to use for score extraction from player profile pages
        score_labels = [
            "Overall rating",
            "Potential",
            "Crossing",
            "Finishing",
            "Heading accuracy",
            "Short passing",
            "Volleys",
            "Dribbling",
            "Curve",
            "FK Accuracy",
            "Long passing",
            "Ball control",
            "Acceleration",
            "Sprint speed",
            "Agility",
            "Reactions",
            "Balance",
            "Shot power",
            "Jumping",
            "Stamina",
            "Strength",
            "Long shots",
            "Aggression",
            "Interceptions",
            "Positioning",
            "Vision",
            "Penalties",
            "Composure",
            "Defensive awareness",
            "Standing tackle",
            "Sliding tackle",
            "GK Diving",
            "GK Handling",
            "GK Kicking",
            "GK Positioning",
            "GK Reflexes",
        ]

        iterator = list(product(self.versions.iterrows(), players))
        for i, ((version_id, version), player) in enumerate(iterator):
            logger.info(
                "[%s/%s] Retrieving ratings for player with ID %s in %s edition",
                i + 1,
                len(iterator),
                player,
                version["update"],
            )

            # read html page (player overview)
            filepath = self.data_dir / filemask.format(player, version_id)
            url = urlmask.format(player, version_id)
            reader = self.get(url, filepath)

            # extract scores one-by-one
            tree = html.parse(reader, parser=html.HTMLParser(encoding="utf8"))

            # get player name safely
            node_player_name_nodes = tree.xpath("//div[contains(@class, 'profile')]/h1")
            if node_player_name_nodes:
                node_player_name = node_player_name_nodes[0]
                # Extract what is before <br>
                before_br = node_player_name.xpath("string(./text()[1])").strip()
                # Extract what is after <br>
                after_br = node_player_name.xpath(
                    "string(./br/following-sibling::text()[1])"
                ).strip()
                player_name = before_br if before_br else after_br
            else:
                player_name = None

            scores = {"player": player_name, **version.to_dict()}

            # Try each XPath until one returns a result
            for s in score_labels:
                value = None
                xpaths = [
                    f"//p[.//text()[contains(.,'{s}')]]/span/em",
                    f"//div[contains(.,'{s}')]/em",
                    f"//li[not(self::script)][.//text()[contains(.,'{s}')]]/em",
                ]
                for xpath in xpaths:
                    nodes = tree.xpath(xpath)
                    if nodes:  # If at least one match is found
                        text = nodes[0].text
                        value = text.strip() if text is not None else None
                        break  # Stop checking other XPaths once we find a valid value

                scores[s] = value  # will be None if not found

            ratings.append(scores)
        # return data frame
        return pd.DataFrame(ratings).pipe(standardize_colnames).set_index(["player"]).sort_index()

Contributor Action Plan

I’m unsure how to fix this, but I'm willing to work on it with guidance.

Originally created by @mttam on GitHub (Sep 18, 2025). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/889 **Describe the bug** the method read_player_ratings return only the last player. Specifically because there is an incorrect indentation of tht XPath extraction and ratings.append() are outside the player loop, so only the last player's scores are processed and appended. **Python Version** Python 3.11.4 **Affected scrapers** This affects the following scrapers: - [x] SoFIFA **Code example** ```python import soccerdata as sd sofifa = sd.SoFIFA(leagues="ENG-Premier League", versions="latest") print(sofifa.read_player_ratings(team="Arsenal") ``` **Error message** ``` no error message ``` **Error output** ``` fifa_edition update overallrating ... gk_kicking gk_positioning gk_reflexes player ... Takehiro Tomiyasu FC 25 Jul 17, 2025 78 ... 6 5 11 [1 rows x 38 columns] ``` **Additional context** I fix the problem with GPT-5 mini but im not sure is the correct way (or an effective issue) because i only dowload the collection. **Code fix sofifa.py** ```python def read_player_ratings( self, team: Optional[Union[str, list[str]]] = None, player: Optional[Union[int, list[int]]] = None, ) -> pd.DataFrame: """Retrieve ratings for players. Parameters ---------- team: str or list of str, optional Team(s) to retrieve. If None, will retrieve all teams. player: int or list of int, optional Player(s) to retrieve. If None, will retrieve all players. Returns ------- pd.DataFrame """ # build url urlmask = SO_FIFA_API + "/player/{}/?r={}&set=true" filemask = "player_{}_{}.html" # get player IDs if player is None: players = self.read_players(team=team).index.unique() elif isinstance(player, int): players = [player] else: players = player # prepare empty data frame ratings = [] # define labels to use for score extraction from player profile pages score_labels = [ "Overall rating", "Potential", "Crossing", "Finishing", "Heading accuracy", "Short passing", "Volleys", "Dribbling", "Curve", "FK Accuracy", "Long passing", "Ball control", "Acceleration", "Sprint speed", "Agility", "Reactions", "Balance", "Shot power", "Jumping", "Stamina", "Strength", "Long shots", "Aggression", "Interceptions", "Positioning", "Vision", "Penalties", "Composure", "Defensive awareness", "Standing tackle", "Sliding tackle", "GK Diving", "GK Handling", "GK Kicking", "GK Positioning", "GK Reflexes", ] iterator = list(product(self.versions.iterrows(), players)) for i, ((version_id, version), player) in enumerate(iterator): logger.info( "[%s/%s] Retrieving ratings for player with ID %s in %s edition", i + 1, len(iterator), player, version["update"], ) # read html page (player overview) filepath = self.data_dir / filemask.format(player, version_id) url = urlmask.format(player, version_id) reader = self.get(url, filepath) # extract scores one-by-one tree = html.parse(reader, parser=html.HTMLParser(encoding="utf8")) # get player name safely node_player_name_nodes = tree.xpath("//div[contains(@class, 'profile')]/h1") if node_player_name_nodes: node_player_name = node_player_name_nodes[0] # Extract what is before <br> before_br = node_player_name.xpath("string(./text()[1])").strip() # Extract what is after <br> after_br = node_player_name.xpath( "string(./br/following-sibling::text()[1])" ).strip() player_name = before_br if before_br else after_br else: player_name = None scores = {"player": player_name, **version.to_dict()} # Try each XPath until one returns a result for s in score_labels: value = None xpaths = [ f"//p[.//text()[contains(.,'{s}')]]/span/em", f"//div[contains(.,'{s}')]/em", f"//li[not(self::script)][.//text()[contains(.,'{s}')]]/em", ] for xpath in xpaths: nodes = tree.xpath(xpath) if nodes: # If at least one match is found text = nodes[0].text value = text.strip() if text is not None else None break # Stop checking other XPaths once we find a valid value scores[s] = value # will be None if not found ratings.append(scores) # return data frame return pd.DataFrame(ratings).pipe(standardize_colnames).set_index(["player"]).sort_index() ``` **Contributor Action Plan** - [x] I’m unsure how to fix this, but I'm willing to work on it with guidance.

kerem added the

bug

label

2026-03-02 15:56:33 +03:00

kerem commented

2026-03-02 15:56:33 +03:00

Author

Owner

@crossin commented on GitHub (Sep 29, 2025):

I also encountered this issue, and I've tried adding one more level of indentation to this part of the original code can resolve it.
from for s in score_labels:
to ratings.append(scores)

    def read_player_ratings(
        self,
        team: Optional[Union[str, list[str]]] = None,
        player: Optional[Union[int, list[int]]] = None,
    ) -> pd.DataFrame:
        """Retrieve ratings for players.

        Parameters
        ----------
        team: str or list of str, optional
            Team(s) to retrieve. If None, will retrieve all teams.
        player: int or list of int, optional
            Player(s) to retrieve. If None, will retrieve all players.

        Returns
        -------
        pd.DataFrame
        """
        # build url
        urlmask = SO_FIFA_API + "/player/{}/?r={}&set=true"
        filemask = "player_{}_{}.html"

        # get player IDs
        if player is None:
            players = self.read_players(team=team).index.unique()
        elif isinstance(player, int):
            players = [player]
        else:
            players = player

        # prepare empty data frame
        ratings = []

        # define labels to use for score extraction from player profile pages
        score_labels = [
            "Overall rating",
            "Potential",
            "Crossing",
            "Finishing",
            "Heading accuracy",
            "Short passing",
            "Volleys",
            "Dribbling",
            "Curve",
            "FK Accuracy",
            "Long passing",
            "Ball control",
            "Acceleration",
            "Sprint speed",
            "Agility",
            "Reactions",
            "Balance",
            "Shot power",
            "Jumping",
            "Stamina",
            "Strength",
            "Long shots",
            "Aggression",
            "Interceptions",
            "Positioning",
            "Vision",
            "Penalties",
            "Composure",
            "Defensive awareness",
            "Standing tackle",
            "Sliding tackle",
            "GK Diving",
            "GK Handling",
            "GK Kicking",
            "GK Positioning",
            "GK Reflexes",
        ]

        iterator = list(product(self.versions.iterrows(), players))
        for i, ((version_id, version), player) in enumerate(iterator):
            logger.info(
                "[%s/%s] yyRetrieving ratings for player with ID %s in %s edition",
                i + 1,
                len(iterator),
                player,
                version["update"],
            )

            # read html page (player overview)
            filepath = self.data_dir / filemask.format(player, version_id)
            url = urlmask.format(player, version_id)
            reader = self.get(url, filepath)

            # extract scores one-by-one
            tree = html.parse(reader, parser=html.HTMLParser(encoding="utf8"))
            node_player_name = tree.xpath("//div[contains(@class, 'profile')]/h1")[0]
            # Extract what is before <br>
            before_br = node_player_name.xpath("string(./text()[1])").strip()
            # Extract what is after <br>
            after_br = node_player_name.xpath("string(./br/following-sibling::text()[1])").strip()
            scores = {
                "player": before_br if before_br else after_br,
                **version.to_dict(),
            }

            # Try each XPath until one returns a result
            for s in score_labels:
                value = None
                xpaths = [
                    f"//p[.//text()[contains(.,'{s}')]]/span/em",
                    f"//div[contains(.,'{s}')]/em",
                    f"//li[not(self::script)][.//text()[contains(.,'{s}')]]/em",
                ]
                for xpath in xpaths:
                    nodes = tree.xpath(xpath)
                    if nodes:  # If at least one match is found
                        value = nodes[0].text.strip()  # Take only the first match
                        break  # Stop checking other XPaths once we find a valid value

                scores[s] = value if value is not None else None  # Assign only once
            ratings.append(scores)
        # return data frame
        return pd.DataFrame(ratings).pipe(standardize_colnames).set_index(["player"]).sort_index()

@crossin commented on GitHub (Sep 29, 2025): I also encountered this issue, and I've tried adding one more level of indentation to this part of the original code can resolve it. from `for s in score_labels:` to `ratings.append(scores)` ```python def read_player_ratings( self, team: Optional[Union[str, list[str]]] = None, player: Optional[Union[int, list[int]]] = None, ) -> pd.DataFrame: """Retrieve ratings for players. Parameters ---------- team: str or list of str, optional Team(s) to retrieve. If None, will retrieve all teams. player: int or list of int, optional Player(s) to retrieve. If None, will retrieve all players. Returns ------- pd.DataFrame """ # build url urlmask = SO_FIFA_API + "/player/{}/?r={}&set=true" filemask = "player_{}_{}.html" # get player IDs if player is None: players = self.read_players(team=team).index.unique() elif isinstance(player, int): players = [player] else: players = player # prepare empty data frame ratings = [] # define labels to use for score extraction from player profile pages score_labels = [ "Overall rating", "Potential", "Crossing", "Finishing", "Heading accuracy", "Short passing", "Volleys", "Dribbling", "Curve", "FK Accuracy", "Long passing", "Ball control", "Acceleration", "Sprint speed", "Agility", "Reactions", "Balance", "Shot power", "Jumping", "Stamina", "Strength", "Long shots", "Aggression", "Interceptions", "Positioning", "Vision", "Penalties", "Composure", "Defensive awareness", "Standing tackle", "Sliding tackle", "GK Diving", "GK Handling", "GK Kicking", "GK Positioning", "GK Reflexes", ] iterator = list(product(self.versions.iterrows(), players)) for i, ((version_id, version), player) in enumerate(iterator): logger.info( "[%s/%s] yyRetrieving ratings for player with ID %s in %s edition", i + 1, len(iterator), player, version["update"], ) # read html page (player overview) filepath = self.data_dir / filemask.format(player, version_id) url = urlmask.format(player, version_id) reader = self.get(url, filepath) # extract scores one-by-one tree = html.parse(reader, parser=html.HTMLParser(encoding="utf8")) node_player_name = tree.xpath("//div[contains(@class, 'profile')]/h1")[0] # Extract what is before <br> before_br = node_player_name.xpath("string(./text()[1])").strip() # Extract what is after <br> after_br = node_player_name.xpath("string(./br/following-sibling::text()[1])").strip() scores = { "player": before_br if before_br else after_br, **version.to_dict(), } # Try each XPath until one returns a result for s in score_labels: value = None xpaths = [ f"//p[.//text()[contains(.,'{s}')]]/span/em", f"//div[contains(.,'{s}')]/em", f"//li[not(self::script)][.//text()[contains(.,'{s}')]]/em", ] for xpath in xpaths: nodes = tree.xpath(xpath) if nodes: # If at least one match is found value = nodes[0].text.strip() # Take only the first match break # Stop checking other XPaths once we find a valid value scores[s] = value if value is not None else None # Assign only once ratings.append(scores) # return data frame return pd.DataFrame(ratings).pipe(standardize_colnames).set_index(["player"]).sort_index() ```

kerem referenced this issue

2026-03-02 15:57:33 +03:00

[PR #192] [MERGED] Update dependency pre-commit to v3.2.0 #351