Football (soccer in the US) is one of the most popular sports worldwide, and it is able to draw the attention of millions of enthusiasts on a single game at the highest leagues: millions (x2) of eyes fixated on the same images showing 22 players fighting for the possession of a ball.
Well, that isn’t really all there is to it when watching a football game, and if we analyze the amount of data we’re able to process out of a single game, we might just get a hint on why.
Soccer Analytics does just that and in this article, I’d like to share my own experience in tackling a sub-problem of this field: extracting as much knowledge as possible from a video stream of a football match recorded by a single broadcast-like camera.
Also, make sure to take a look at my friend’s work @matteoronchetti, with whom I worked throughout this fun and challenging project!
The problem itself is already quite ill-posed as extracting positional and semantics information from a single moving camera with sudden changes of perspective sure isn’t framing it in an easy way, as you might get away with a simpler problem by just placing multiple fixed cameras around the field. But well, given obvious budget and permission constraints you probably wouldn’t be allowed to do that on an actual stadium.
Nonetheless, there are multiple ways to process (at least approximately) such video data on a budget and without leaving your comfy chair.
We approached the task like any textbook (good) software engineer would: we decomposed the problem into smaller, more manageable and specific ones.
We came up with the following division:
- Reference system and homography estimation (how to project players’ position from camera-view to a 2D plane).
- Object detection (aka what and where are the players/ball/referee).
- Object tracking (aka how do I track entities across frames).
- Player Identification (aka how do I recognize the players across frames).
- Team Recognition (how do I figure what team does a player play for).
So let us begin from the overall architecture of the system before going into the details of each task, in a “positional to semantic” order.
We get a sequence of frames in input, we process each of them sequentially employing object detection (field and entities); once we have series of nearly-consecutive detection, we can start tracking each entity. At the same time, we estimate the position of the field wrt the camera, we project the position of each entity from frame to pitch coordinates. Moreover, we can keep track of each player by identifying and assigning him to a team.
Overall architecture of the system. Image by the author.
And then we just repeat frame by frame, until the end of the video. At that point, we have a phase we called smoothing, in which essentially we grant ourselves the possibility of looking back to all the knowledge we extracted frame by frame so far, and we “make backward adjustments” in order to make trajectories and detections more coherent throughout the whole sequence.
Let us now try to follow step by step what happens inside the system from the moment a frame is fed into it.
The first thing you might notice when approaching a problem like this from a ML perspective is that it is very hard to find available labeled data of decent quality. Therefore, it is time to throw one of the most famous objected detectors there is at the problem: YoloV3.
YOLO net being fed a frame of the game with a sliding window approach. Image by the author.
You’ll soon find out that simply cropping the frame and expecting the pre-trained net to deliver good results just won’t do. Since we highly prioritized accuracy over speed here, we fed YOLO with the original resolution image using a sliding window to make the network work on the whole frame, piece by piece. Results you obtain this way are way better allowing you to detect players/referee and ball consistently.
Yeah, the COCO dataset (the one YOLO is trained on) has a class just for soccer balls in case you’re wondering.
As a brief consideration, we based almost the entire stack of the system assuming (pretty) reliable detections, therefore accuracy was the top priority here. Luckily enough, this is probably the task that allows you more freedom as you have a wide range of possibilities at your disposal to obtain a super reliable and efficient detector, the sky’s the limit here. You might want to train your own super-light detector from scratch, distill a pre-trained net, leverage spatial contiguity, maximize parallelization… We were heavily time-constrained so we barely scratched the surface on these options.
This was probably the most challenging task of all as it requires the estimation of the homography needed to project players’ position from frame-relative to pitch-absolute coordinate system.
Pre-masked image of the pitch. Image by the author.
To do so, we masked out the frame as depicted in the picture, removed all objects detected at the previous step and matched the current frame of the pitch against a pre-computed set of pitch images coming from a simple model of the field, taken from different angles of rotations and translations. In order to match efficiently, we leveraged an index and treated the input as a “query”.
Projection of the players seen in the picture to a 2D rendered field, using a simple Flask server. Image by the author.
Finding the right match for the field allows you to build the homography matrix which can take the detections boxes you get from YOLO and project them onto the pitch. In order to build the matrix tho, you’ll also need to know from where you’re looking at the picture, aka 3D coordinates of the camera. This is very hard to estimate in real-time, but luckily enough each stadium has a more or less fixed position from which games are broadcasted (mind that the camera can move almost freely parallel to the field tho).
So far we just applied the same steps to each frame with no temporal connection, but what if we wanted to maintain the current state of detection through time, as to be able to recognize the trajectory of each player and actually understand consequent detections referring to the same entity, while still being robust to detector fails from time to time?
For that we’ll need a tracker, a multi-tracker to be precise, and opted to implement it as Kalman filter.
Min-weight matching in bipartite graph, assign observations to predictions. Image by the author.
Player tracking with positional information. GIF by author.
The tracker is entirely positional-based (doesn’t actually see the frame) and receives as observations a series of detections positions from
YOLO, which are matched to tracked objects (Kalman predictions) by formulating the assignment problem as a Minimum Weight Matching in a Bipartite Graph.
After that, a simple tracking logic defines the tracked objects lifecycle (active/inactive) and makes sure to update Kalman states with observations.
This was actually one of those occasions in which a simple but yet brilliant implementation from scratch proves to be much more effective than any off-the-shelf solution you might experiment with.
From house numbers to jersey numbers, they can be made to be quite similar. Image by the author.
We trained a CNN to recognize numbers you see on jerseys backside with pretty good accuracy and skew tolerance by augmenting the popular Street View House Numbers (SVHN) dataset. As we mentioned above, finding decent and accessible labeled images of football games proved to be way too hard. That’s why we had to get creative and we believe this is a pretty good example of applied domain adaptation.
Once you’re able to detect one player’s number, you’re able to assign an id to them and you can leverage the same tracking logic described above to make the assignment of id more resilient (+smoothing mentioned earlier).
Also, keep in mind the distribution of possible numbers can be further constrained once you know the player’s team (next up) or have additional external information (like team line-up and roasters).
You have two teams composed of players wearing the same shirt (plus the referee) on the pitch: seems to me like the perfect setup to make use of the good-old K-means algorithm.
You might go for something way fancier here and translate everything to spectral domain, but still, you can get good results clustering the 2 teams on HSV color-space using players bounding boxes and finishing everything up with a tf-idf weighting to filter out the green of the field from each bounding box.
Result of the team recognition algorithm. Image by the author.
P.S: the referee can be detected as a special ‘outlier’ in one of the 2 team clusters applying any outlier detection technique. DBSCAN yielded good results for instance.
We close out this post with a few visual results which will hopefully summarise all the steps we’ve described here. Unfortunately, we couldn’t really cover every single detail in-depth, but still, we hope you were able to follow along and get an intuition of the approach, possibly enjoying the process.
For any suggestions, questions or just to have a chat, do not hesitate to contact me at email@example.com or at @nick_lucche on Twitter.
Thanks for reading and good luck on the pitch!