[GH-ISSUE #332] [WhoScored] Errors in event data #66

Closed
opened 2026-03-02 15:55:27 +03:00 by kerem · 5 comments
Owner

Originally created by @ksbharaj on GitHub (Aug 18, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/332

HI all,

I have been scraping EPL data from recent seasons to run some models on, but I've spotted quite a few errors in event data scraped from WhoScored. One common one is registering a cross/pass as a dribble and assigning it to the wrong player. Take this example from Fulham vs Arsenal on 12th March 2023 (screenshot attached). Trossard provided a cross to Martinelli, who scored a header. However, WhoScored registered it as a failed cross, and as a dribble into the box by Martinelli from where Trossard actually crossed it. I have seen several of these, so happy to share more. Curious to know if anyone has spotted this, and found a solution?

I have also spotted some dribbles not registered (again, Martinelli's 80m solo goal vs Chelsea in Jan 2020 is instead registered as an 80m clearance by Mustafi!).

image

Originally created by @ksbharaj on GitHub (Aug 18, 2023). Original GitHub issue: https://github.com/probberechts/soccerdata/issues/332 HI all, I have been scraping EPL data from recent seasons to run some models on, but I've spotted quite a few errors in event data scraped from WhoScored. One common one is registering a cross/pass as a dribble and assigning it to the wrong player. Take this example from Fulham vs Arsenal on 12th March 2023 (screenshot attached). Trossard provided a cross to Martinelli, who scored a header. However, WhoScored registered it as a failed cross, and as a dribble into the box by Martinelli from where Trossard actually crossed it. I have seen several of these, so happy to share more. **Curious to know if anyone has spotted this, and found a solution?** I have also spotted some dribbles not registered (again, Martinelli's 80m solo goal vs Chelsea in Jan 2020 is instead registered as an 80m clearance by Mustafi!). ![image](https://github.com/probberechts/soccerdata/assets/34198137/5b7e1a41-c3dd-4244-acd4-d6db41793160)
kerem 2026-03-02 15:55:27 +03:00
Author
Owner

@probberechts commented on GitHub (Aug 21, 2023):

Event stream data is primarily collected manually. Therefore, there will always be some errors in the data. Particularly in live data, the urgency of real-time constraints contributes to more inaccuracies. While data providers engage in post-game processing to rectify these errors, I'm uncertain whether WhoScored updates the data post-game.

It's important to note that event stream data isn't commonly employed for analyzing individual actions; rather, it is used for creating aggregated metrics across multiple matches. Consequently, the presence of a few errors within the data doesn't typically have a significant impact on the analysis, as these discrepancies tend to balance each other out.

If you need superior data quality, you'll have to pay for the data. StatsBomb stands out for its commitment to upholding data integrity. Opta also might have higher-quality (and more detailed) feeds than the ones provided on WhoScored. If you do not need up-to-date data, StatsBomb also provides a substantial amount of data free of charge at https://github.com/statsbomb/open-data.

To confirm, it appears you're working with the SPADL rendition of the data. Do these errors also appear in the raw event data? It's plausible that the errors might originate from issues within the SPADL conversion process.

<!-- gh-comment-id:1686002595 --> @probberechts commented on GitHub (Aug 21, 2023): Event stream data is primarily collected manually. Therefore, there will always be some errors in the data. Particularly in live data, the urgency of real-time constraints contributes to more inaccuracies. While data providers engage in post-game processing to rectify these errors, I'm uncertain whether WhoScored updates the data post-game. It's important to note that event stream data isn't commonly employed for analyzing individual actions; rather, it is used for creating aggregated metrics across multiple matches. Consequently, the presence of a few errors within the data doesn't typically have a significant impact on the analysis, as these discrepancies tend to balance each other out. If you need superior data quality, you'll have to pay for the data. StatsBomb stands out for its commitment to upholding data integrity. Opta also might have higher-quality (and more detailed) feeds than the ones provided on WhoScored. If you do not need up-to-date data, StatsBomb also provides a substantial amount of data free of charge at https://github.com/statsbomb/open-data. To confirm, it appears you're working with the SPADL rendition of the data. Do these errors also appear in the raw event data? It's plausible that the errors might originate from issues within the SPADL conversion process.
Author
Owner

@ksbharaj commented on GitHub (Aug 21, 2023):

Hi-Thank you for your response! Your explanation helps me make sense of it all.
In the example provided, the errors don't really appear in the raw event data. The main issue seems to be that the raw event data does not have any "carry" data (as Statsbomb defines it). The conversion to SPADL seems to stitch on this carry/dribble data and does so erroneously for some edge cases.

For the example provided, Trossard's cross is actually listed as "fail" in the event data because it takes a touch off Kenny Tete before landing on Martinelli's head. The SPADL conversion process seems to assume that Martinelli collected it from Tete's touch, dribbled toward the goal, and scored (image attached).

I have attached an Excel file that shows the raw event vs spadl data for a Martinelli goal for Arsenal vs Chelsea. As mentioned earlier, spadl registers it as an 80m clearance by Mustafi, followed immediately by a goal by Martinelli. However, the event data correctly shows a shorter clearance, and a recovery in his own half by Martinelli, and finally a goal a few seconds later.

Over the next few days, I can try to run this spadl conversion on Stasbomb open data to see if the same errors are seen. I'll get back to you on this.

raw_vs_spadl_data.xlsx

image

<!-- gh-comment-id:1686554571 --> @ksbharaj commented on GitHub (Aug 21, 2023): Hi-Thank you for your response! Your explanation helps me make sense of it all. In the example provided, the errors don't really appear in the raw event data. The main issue seems to be that the raw event data does not have any "carry" data (as Statsbomb defines it). The conversion to SPADL seems to stitch on this carry/dribble data and does so erroneously for some edge cases. For the example provided, Trossard's cross is actually listed as "fail" in the event data because it takes a touch off Kenny Tete before landing on Martinelli's head. The SPADL conversion process seems to assume that Martinelli collected it from Tete's touch, dribbled toward the goal, and scored (image attached). I have attached an Excel file that shows the raw event vs spadl data for a Martinelli goal for Arsenal vs Chelsea. As mentioned earlier, spadl registers it as an 80m clearance by Mustafi, followed immediately by a goal by Martinelli. However, the event data correctly shows a shorter clearance, and a recovery in his own half by Martinelli, and finally a goal a few seconds later. Over the next few days, I can try to run this spadl conversion on Stasbomb open data to see if the same errors are seen. I'll get back to you on this. [raw_vs_spadl_data.xlsx](https://github.com/probberechts/soccerdata/files/12397993/raw_vs_spadl_data.xlsx) ![image](https://github.com/probberechts/soccerdata/assets/34198137/e5fd6018-c6c8-47e2-8ee4-49cde581e921)
Author
Owner

@probberechts commented on GitHub (Aug 22, 2023):

I see, then it is probably related to this bug: https://github.com/ML-KULeuven/socceraction/issues/519

<!-- gh-comment-id:1687976700 --> @probberechts commented on GitHub (Aug 22, 2023): I see, then it is probably related to this bug: https://github.com/ML-KULeuven/socceraction/issues/519
Author
Owner

@probberechts commented on GitHub (Sep 10, 2023):

These edge cases were resolved in https://github.com/ML-KULeuven/socceraction/pull/585. You should upgrade socceraction to v1.4.2 to get these fixes.

The SPADL representation of Trossard's cross now looks like this:
deflection

The SPADL representation of Martinelli's goal now looks like this:
recoveries

Thanks for flagging this!

<!-- gh-comment-id:1712754710 --> @probberechts commented on GitHub (Sep 10, 2023): These edge cases were resolved in https://github.com/ML-KULeuven/socceraction/pull/585. You should upgrade socceraction to v1.4.2 to get these fixes. The SPADL representation of Trossard's cross now looks like this: ![deflection](https://github.com/probberechts/soccerdata/assets/2175271/d6225fc4-3c58-4301-8f1b-fde8873310a5) The SPADL representation of Martinelli's goal now looks like this: ![recoveries](https://github.com/probberechts/soccerdata/assets/2175271/5532aa87-3da6-4388-964e-e751b4a68d31) Thanks for flagging this!
Author
Owner

@ksbharaj commented on GitHub (Sep 10, 2023):

Thank you very much!
I wasn't able to replicate the result locally due to the use of an "is" rather than "==" in one of the methods you created. However, I have created a pull request to address this.

<!-- gh-comment-id:1712898927 --> @ksbharaj commented on GitHub (Sep 10, 2023): Thank you very much! I wasn't able to replicate the result locally due to the use of an "is" rather than "==" in one of the methods you created. However, I have created a pull request to address this.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/soccerdata#66
No description provided.