Visualize Soccer Data using Mplsoccer in Python

Football analytics for everyone.

Photo by Thomas Serer on Unsplash

Football Analytics has become a trend in recent years. Many football clubs are starting to recruit data scientists to be part of their teams. Even BBC created a headline that data experts are the best signings in football [1].

Because of the high demands and exposures, people are starting to get into football analytics. There are lots of open-source tools and data that can be used for getting started in this field. Mplsoccer is one of the tools for creating visualizations of football data [2].

In this article, I will show you how to implement those visualizations by using libraries like mplsoccer and matplotlib. Without further ado, let’s get started!

About the library

Mplsoccer is a Python-based library for visualizing soccer charts. Visualizing soccer data in Python is not straightforward as visualizing a scatter plot or a histogram.

Thanks to this library, we can present any soccer charts based on the available data. Some visualizations that we can create using mplsoccer are radar charts, heatmaps, shot maps, and many more, and this library helps you generate the visualization straightaway.

Also, this library can help to load the StatsBomb data. For this article, we will not use functions for loading the StatsBomb data, in which we will try to load the data from scratch.

To install the library, it’s simple. All you need is a pip command that looks like this:

! pip install mplsoccer

After running the command, you can use the library to visualize the soccer data.

Load the data

But before we can visualize the data, we need to access our data first. In this article, we will use the data from StatsBomb, which you can access through StatsBomb’s GitHub repository here.

Unlike other datasets, using and accessing the soccer data, especially StatsBomb, is quietly challenging.

There are three steps that we need to take. Those steps are looking at the competition ID, the match ID, and lastly, loading the messy JSON file. So let’s get into it.

We need to reach the event data for the 2005 UEFA Champions League final between Liverpool and AC Milan.

But because the events folder contains lots of files, and they named it using ID, we need to open the competitions.json file first.

We open the data as a data frame and filter the data that contains the Champions League as its competition name. Here is the code for doing that:

import pandas as pdcompetitions = pd.read_json('open-data/data/competitions.json')
competitions[competitions.competition_name == 'Champions League']

Here is the preview of the data frame:

As you can see from the data frame, there is information like when the match is held and its corresponding ID for season and competition, respectively. The game between Liverpool and AC Milan happened in 2005. Therefore, we take the competition ID of 16 and the season ID of 37.

Because a competition contains a tremendous amount of matches, we need to look at the match ID for the corresponding game. For looking that, you can run these lines of code:

import jsonwith open('open-data/data/matches/16/37.json') as f:
data = json.load(f)
for i in data:
print(i['match_id'], i['home_team']['home_team_name'],
i['home_score'], "-", i['away_score'], i['away_team']
['away_team_name'])

From that code, we retrieved only one match that is provided by StatsBomb, which is the final match. The corresponding ID for the match is 23202764. With that ID, we can access the event data for analyzing the game.

As you know, like the competition.json file, the event data also uses the JSON format, and it contains a nested form to it.

At first, it seemed challenging to load such a file as the dataframe. But we don’t have to worry about it because the Pandas library provides a function called json_normalize.

Here is the code for doing that:

with open('open-data/data/events/2302764.json') as f:
data = json.load(f)
df = pd.json_normalize(data, sep="_")
df.head()

From this data frame, now we can create any visualizations that we like. For convenience, let’s divide the data into the first and second half. So let’s do that. Here is the code for doing that:

first_half = df.loc[:1808, :]
second_half = df.loc[1809:3551, :]

After getting the data, let’s create some visualizations from it. The first visualization that I want to show you is the shot map. But before doing that, let’s find out how to create the pitch first.

Visualizing pitch is an important step for visualizing the football data. Before mplsoccer existed, people created their soccer charts, which I knew was challenging because we had to paint the lines on our own.

Therefore, visualizing the soccer data is not for everyone until the mplsoccer library comes in. To visualize the pitch, all we have to do is to add these lines of code:

from mplsoccer import Pitchpitch = Pitch(pitch_type='statsbomb')
pitch.draw()

Here is the preview of the result:

We don’t have to add lines or specify the length of the pitch. All you need is an object, and boom, there you have it.

Because we want to visualize a shot map, we need a half-vertically-oriented soccer pitch. For creating that, all you need to do is to modify the previous code like this:

from mplsoccer import VerticalPitchpitch = VerticalPitch(pitch_type='statsbomb', half=True)

Here is the preview of the result:

Isn’t that simple?! Now let’s create the shot map.

Before creating the visualization, we need to prepare the data that contains information about the shot itself, ranging from the shot location, which team shot the ball, who shot that, and how likely it became a goal.

Let’s prepare a dataframe that contains shots from AC Milan in the first half. Here is the code for doing that:

# Retrieve rows that record shots
shots = first_half[first_half.type_name == 'Shot']
# Filter the data that record AC Milan
shots = shots[shots.team_name == 'AC Milan']
# Select the columns
shots = shots[['team_name', 'player_name', 'minute', 'second', 'location', 'shot_statsbomb_xg', 'shot_outcome_name']]
# Because the location data is on list format (ex: [100, 80]), we extract the x and y coordinate using apply method.
shots['x'] = shots.location.apply(lambda x: x[0])
shots['y'] = shots.location.apply(lambda x: x[1])
shots = shots.drop('location', axis=1)
# Divide the dataset based on the outcome
goals = shots[shots.shot_outcome_name == 'Goal']
shots = shots[shots.shot_outcome_name != 'Goal']
shots.head()

Here is the preview of the data:

After we have the data, the next step is to create the visualization. Let’s build the pitch first. Here is the code for doing that:

from mplsoccer import VerticalPitchpitch = VerticalPitch(pitch_type='statsbomb', half=True, goal_type='box', goal_alpha=0.8, pitch_color='#22312b', line_color='#c7d5cc')fig, axs = pitch.grid(figheight=10, title_height=0.08, endnote_space=0, axis=False,title_space=0, grid_height=0.82, endnote_height=0.05)fig.set_facecolor("#22312b")

After that, let’s add the shot points. Add these lines of code below:

scatter_shots = pitch.scatter(shots.x, shots.y, s=(shots.shot_statsbomb_xg * 900) + 100, c='red', edgecolors='black', marker='o', ax=axs['pitch'])scatter_goals = pitch.scatter(goals.x, goals.y, s=(goals.shot_statsbomb_xg * 900) + 100, c='red', edgecolors='black', marker='*', ax=axs['pitch'])

After adding the points, let’s add the text that describes the visualization itself. Add these lines of code below:

axs['endnote'].text(0.85, 0.5, '[YOUR NAME]', color='#c7d5cc', va='center', ha='center', fontsize=15)axs['title'].text(0.5, 0.7, 'The Shots Map from AC Milan', color='#c7d5cc', va='center', ha='center', fontsize=30)axs['title'].text(0.5, 0.25, 'The Game's First Half', color='#c7d5cc', va='center', ha='center', fontsize=18)

And finally, we need to add an arrow for clearing the attacking directions of an occurring match. Add these lines of code below:

pitch.arrows(70, 5, 100, 5, ax=axs['pitch'], color='#c7d5cc')

The complete code looks like this:

from mplsoccer import VerticalPitchpitch = VerticalPitch(pitch_type='statsbomb', half=True, goal_type='box', goal_alpha=0.8, pitch_color='#22312b', line_color='#c7d5cc')fig, axs = pitch.grid(figheight=10, title_height=0.08, endnote_space=0, axis=False, title_space=0, grid_height=0.82, endnote_height=0.05)fig.set_facecolor("#22312b")scatter_shots = pitch.scatter(shots.x, shots.y, s=(shots.shot_statsbomb_xg * 900) + 100, c='red', edgecolors='black', marker='o', ax=axs['pitch'])scatter_goals = pitch.scatter(goals.x, goals.y, s=(goals.shot_statsbomb_xg * 900) + 100, c='red', edgecolors='black', marker='*', ax=axs['pitch'])pitch.arrows(70, 5, 100, 5, ax=axs['pitch'], color='#c7d5cc')axs['endnote'].text(0.85, 0.5, '[YOUR NAME]', color='#c7d5cc', va='center', ha='center', fontsize=15)axs['title'].text(0.5, 0.7, 'The Shots Map from AC Milan', color='#c7d5cc', va='center', ha='center', fontsize=30)axs['title'].text(0.5, 0.25, 'The Game's First Half', color='#c7d5cc', va='center', ha='center', fontsize=18)plt.show()

In the end, the visualization of the shots will look like this:

Pressure Heat Map

The second visualization I want to show you is the pressure heat map. This heat map represents the frequency of pressure at a location. The higher the pressure is, the brighter the color at that location is.

Generating the heat map is the same as creating the previous shot map. The only difference is we visualize a statistical summary on the pitch. But before doing that, we prepare the data first. Here is the code for doing that:

pressure = first_half[df.type_name == 'Pressure']
pressure = pressure[['team_name', 'player_name', 'location']]
pressure = pressure[pressure.team_name == 'AC Milan']
pressure['x'] = pressure.location.apply(lambda x: x[0])
pressure['y'] = pressure.location.apply(lambda x: x[1])
pressure = pressure.drop('location', axis=1)
pressure.head()

Here is the preview of the data:

Now let’s create the chart. The code looks like this:

from scipy.ndimage import gaussian_filter
import matplotlib.pyplot as plt
pitch = Pitch(pitch_type='statsbomb', line_zorder=2, pitch_color='#22312b', line_color='#efefef')fig, axs = pitch.grid(figheight=10, title_height=0.08, endnote_space=0, axis=False, title_space=0, grid_height=0.82, endnote_height=0.05)fig.set_facecolor('#22312b')bin_statistic = pitch.bin_statistic(pressure.x, pressure.y, statistic='count', bins=(25, 25)) bin_statistic['statistic'] = gaussian_filter(bin_statistic['statistic'], 1)pcm = pitch.heatmap(bin_statistic, ax=axs['pitch'], cmap='hot', edgecolors='#22312b')cbar = fig.colorbar(pcm, ax=axs['pitch'], shrink=0.6)cbar.outline.set_edgecolor('#efefef')cbar.ax.yaxis.set_tick_params(color='#efefef')plt.setp(plt.getp(cbar.ax.axes, 'yticklabels'), color='#efefef')axs['endnote'].text(0.8, 0.5, '[YOUR NAME]', color='#c7d5cc', va='center', ha='center', fontsize=10)axs['endnote'].text(0.4, 0.95, 'Attacking Direction', va='center', ha='center', color='#c7d5cc', fontsize=12)axs['endnote'].arrow(0.3, 0.6, 0.2, 0, head_width=0.2, head_length=0.025, ec='w', fc='w')axs['endnote'].set_xlim(0, 1)
axs['endnote'].set_ylim(0, 1)
axs['title'].text(0.5, 0.7, 'The Pressure's Heat Map from AC Milan', color='#c7d5cc', va='center', ha='center', fontsize=30)axs['title'].text(0.5, 0.25, 'The Game's First Half', color='#c7d5cc', va='center', ha='center', fontsize=18)

Can you see the difference between this code and the previous one? Almost nothing!

Except, we add the gaussian_filter function for generating the pressure distribution by AC Milan at the first half. With that result, we create the heat map using it.

Here is the final result of the visualization:

Well done! You have learned how to create visualizations on soccer data using mplsoccer in python.

I hope you learn new things from here and also guide you in analyzing matches, especially on the StatsBomb data. You can read about the mplsoccer library through this website here.

Thank you for reading my article!

References

[1] BBC. Data experts are becoming football’s best signings. https://www.bbc.com/news/business-56164159
[2] Mplsoccer Documentation. https://mplsoccer.readthedocs.io/en/latest/index.html

Leave a Comment