mirror of
https://github.com/probberechts/soccerdata.git
synced 2026-04-26 02:25:51 +03:00
[GH-ISSUE #332] [WhoScored] Errors in event data #66
Labels
No labels
ESPN
FBref
FotMob
MatchHistory
SoFIFA
Sofascore
WhoScored
WhoScored
bug
build
common
dependencies
discussion
documentation
duplicate
enhancement
good first issue
invalid
performance
pull-request
question
question
removal
understat
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/soccerdata#66
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @ksbharaj on GitHub (Aug 18, 2023).
Original GitHub issue: https://github.com/probberechts/soccerdata/issues/332
HI all,
I have been scraping EPL data from recent seasons to run some models on, but I've spotted quite a few errors in event data scraped from WhoScored. One common one is registering a cross/pass as a dribble and assigning it to the wrong player. Take this example from Fulham vs Arsenal on 12th March 2023 (screenshot attached). Trossard provided a cross to Martinelli, who scored a header. However, WhoScored registered it as a failed cross, and as a dribble into the box by Martinelli from where Trossard actually crossed it. I have seen several of these, so happy to share more. Curious to know if anyone has spotted this, and found a solution?
I have also spotted some dribbles not registered (again, Martinelli's 80m solo goal vs Chelsea in Jan 2020 is instead registered as an 80m clearance by Mustafi!).
@probberechts commented on GitHub (Aug 21, 2023):
Event stream data is primarily collected manually. Therefore, there will always be some errors in the data. Particularly in live data, the urgency of real-time constraints contributes to more inaccuracies. While data providers engage in post-game processing to rectify these errors, I'm uncertain whether WhoScored updates the data post-game.
It's important to note that event stream data isn't commonly employed for analyzing individual actions; rather, it is used for creating aggregated metrics across multiple matches. Consequently, the presence of a few errors within the data doesn't typically have a significant impact on the analysis, as these discrepancies tend to balance each other out.
If you need superior data quality, you'll have to pay for the data. StatsBomb stands out for its commitment to upholding data integrity. Opta also might have higher-quality (and more detailed) feeds than the ones provided on WhoScored. If you do not need up-to-date data, StatsBomb also provides a substantial amount of data free of charge at https://github.com/statsbomb/open-data.
To confirm, it appears you're working with the SPADL rendition of the data. Do these errors also appear in the raw event data? It's plausible that the errors might originate from issues within the SPADL conversion process.
@ksbharaj commented on GitHub (Aug 21, 2023):
Hi-Thank you for your response! Your explanation helps me make sense of it all.
In the example provided, the errors don't really appear in the raw event data. The main issue seems to be that the raw event data does not have any "carry" data (as Statsbomb defines it). The conversion to SPADL seems to stitch on this carry/dribble data and does so erroneously for some edge cases.
For the example provided, Trossard's cross is actually listed as "fail" in the event data because it takes a touch off Kenny Tete before landing on Martinelli's head. The SPADL conversion process seems to assume that Martinelli collected it from Tete's touch, dribbled toward the goal, and scored (image attached).
I have attached an Excel file that shows the raw event vs spadl data for a Martinelli goal for Arsenal vs Chelsea. As mentioned earlier, spadl registers it as an 80m clearance by Mustafi, followed immediately by a goal by Martinelli. However, the event data correctly shows a shorter clearance, and a recovery in his own half by Martinelli, and finally a goal a few seconds later.
Over the next few days, I can try to run this spadl conversion on Stasbomb open data to see if the same errors are seen. I'll get back to you on this.
raw_vs_spadl_data.xlsx
@probberechts commented on GitHub (Aug 22, 2023):
I see, then it is probably related to this bug: https://github.com/ML-KULeuven/socceraction/issues/519
@probberechts commented on GitHub (Sep 10, 2023):
These edge cases were resolved in https://github.com/ML-KULeuven/socceraction/pull/585. You should upgrade socceraction to v1.4.2 to get these fixes.
The SPADL representation of Trossard's cross now looks like this:

The SPADL representation of Martinelli's goal now looks like this:

Thanks for flagging this!
@ksbharaj commented on GitHub (Sep 10, 2023):
Thank you very much!
I wasn't able to replicate the result locally due to the use of an "is" rather than "==" in one of the methods you created. However, I have created a pull request to address this.