[GH-ISSUE #252] Data differents from WhoScored #51

Closed
opened 2026-03-02 15:55:20 +03:00 by kerem · 1 comment
Owner

Originally created by @REM4125 on GitHub (May 24, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/252

Hello,

Thank you for this great package!

Two small questions:

  1. I scrapped the WhoScored data via the loader but the data is not exactly the same as it is supposed to be when looking at socceraction (I'm thinking of the keypass column for example). Is this a problem and where do these differences come from?
  2. Some libraries work with opta data but they are slightly different: for example edd_webster uses ScraperFC and it doesn't get the same keys in the "qualifiers" dictionary. According to the official Opta documentation he offers, this is more in line with what he offers. Do you know where these differences come from and what you recommend to use? Thanks
Originally created by @REM4125 on GitHub (May 24, 2023). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/252 Hello, Thank you for this great package! Two small questions: 1) I scrapped the WhoScored data via the loader but the data is not exactly the same as it is supposed to be when looking at socceraction (I'm thinking of the keypass column for example). Is this a problem and where do these differences come from? 2) Some libraries work with opta data but they are slightly different: for example edd_webster uses ScraperFC and it doesn't get the same keys in the "qualifiers" dictionary. According to the official Opta documentation he offers, this is more in line with what he offers. Do you know where these differences come from and what you recommend to use? Thanks
kerem 2026-03-02 15:55:20 +03:00
Author
Owner

@probberechts commented on GitHub (May 25, 2023):

Opta data is available in many formats (JSON and XML) and levels of detail (F24, MA3, ...). The data stream that can be scraped from WhoScored is based on Opta data, but WhoScored uses its own JSON format and level of detail. It mostly contains the same information as F24 streams but there are some minor differences.

The output of the WhoScored scraper can be directly used in socceraction. The two libraries are perfectly compatible. If you look at the schema for Opta data in socceraction you'll notice that the "keypass" column is optional. This is one difference between the F24 streams and the WhoScored JSON format. The WhoScored JSON data does not have a "keypass" attribute at the root level, but you can probably derive it from the list of qualifiers of each pass event.

I can't check the difference between the output of soccerdata and ScraperFC, as the WhoScored module was removed from ScraperFC. Nevertheless, I assume it is just a matter of how you massage the JSON data to get a dataframe and maybe some updates in the data stream itself. With soccerdata you can get the raw output JSON using:

events = ws.read_events(match_id=1485184, output_fmt="raw")
<!-- gh-comment-id:1562452976 --> @probberechts commented on GitHub (May 25, 2023): Opta data is available in many formats (JSON and XML) and levels of detail (F24, MA3, ...). The data stream that can be scraped from WhoScored is based on Opta data, but WhoScored uses its own JSON format and level of detail. It mostly contains the same information as F24 streams but there are some minor differences. The output of the WhoScored scraper can be directly used in socceraction. The two libraries are perfectly compatible. If you look at [the schema for Opta data in socceraction](https://github.com/ML-KULeuven/socceraction/blob/6c3428ca08cda6c958ab3fde501cee35f5db03e7/socceraction/data/opta/schema.py#L53) you'll notice that the "keypass" column is optional. This is one difference between the F24 streams and the WhoScored JSON format. The WhoScored JSON data does not have a "keypass" attribute at the root level, but you can probably derive it from the list of qualifiers of each pass event. I can't check the difference between the output of soccerdata and ScraperFC, as the WhoScored module was removed from ScraperFC. Nevertheless, I assume it is just a matter of how you [massage the JSON data to get a dataframe](https://github.com/probberechts/soccerdata/blob/5a3a904a8784e8e2d3ce968547739d34634a4ebe/soccerdata/whoscored.py#L765) and maybe some updates in the data stream itself. With soccerdata you can get the raw output JSON using: ```python events = ws.read_events(match_id=1485184, output_fmt="raw") ```
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#51
No description provided.