02 Jun '21

Money (Foot)Ball – how will our virtual football team selected entirely by Machine Learning compete in the big leagues?

Introduction

At DTSQUARED, one of our recent graduates as part of their training, showed nearly three times increase in accuracy when predicting the US counties that would develop Covid-19 cases using AutoML on census data. AutoML stands for Automated Machine Learning and the tools allow users without a coding background to access the power of Machine Learning with relative ease. As an emerging area of technology, innovations like this are at the heart of the ever-changing landscape of data strategy and this real-life application made the possibilities it holds even more fascinating. Given this, we were keen to explore where else AutoML could successfully be applied and the opportunities that might arise from a solution with such potential. 

To consider this in more detail, we needed a new, more complicated AutoML project, so we decided to see if we could apply it to the Fantasy Premier League (FPL) game.  Could a virtual football team selected entirely by Machine Learning compete in our highly competitive in-house league?

For those not familiar with the FPL concept, all competitors select a virtual premier league squad comprising 15 players (from the 713 available) and then each week, select a team of just 11 players from their squad of 15.  Points are awarded to the virtual team based upon the performance of the actual players in the Premier League games played each week (e.g. goals scored, assists, clean sheets, penalties saved etc). This real-life league lent us the perfect data-led experiment to really test our AutoML skills, with a healthy competitive edge against our DTSQUARED colleagues playing the league traditionally! 

In this blog we outline the approach and tooling used and share the very promising results achieved over the last two months of the season.  We then use our ML model to select a possible England squad for this summer’s Euro ’21 tournament and finally, for a bit of fun, we see how our squad compares with the actual squad just announced by Gareth Southgate.

The Data

Machine Learning needs good data. The Official Fantasy Premier League provides an API giving access to player and game data and each week a Github repository managed by Vaastav who makes the data available in CSV format.

The Tools

At DTSQUARED we have a fast-growing Cloud Data Warehousing practice in which we specialise in Snowflake.  As a place to store the data, Snowflake was just perfect.

Snowflake has invested in both DataRobot and DataIku since their IPO last September.  As we used Data Robot for our Covid based ML work we tried DataIku for this analysis.  In addition to Machine Learning, you can also carry out data cleansing and engineering with a very user-friendly UI and deploy the whole process at a click of a button while all your source and target data resides in your cloud data warehouse.

The Process

Just before the start of April we used the raw, unimproved data for the season to date to train an ML model and we selected our first team and left this team unchanged throughout the April fixtures. This team would have placed 2nd in our work FPL league for the month; perhaps the process was going to work.

With a few further improvements in the process (e.g. including the opposing team details for the next game, renaming and recalculating columns), back tested on April data, we finalised a new ML model for the May fixtures.  Each week in May we used the ML results to make one substitution (replacing the squad member with the lowest points prediction with another player) and also changed the captain, who scores double points.

For all the geeks out there, the best performing model type for this project was an XGBoost.

The Team

Our starting XI going into the first games in May is shown below:

A collage of a person

Description automatically generated with medium confidence

And the result?  We came 2nd again.  

(If both April and May were combined there our team fell short of the overall lead by just FOUR points!!)

Various factors made our job of winning just a little harder. Established teams in the league had risen in value and so had more than the £100m budget we were restricted to.  We were a little unlucky in that the cancelled Manchester United v Liverpool game hit us hard and we couldn’t field a full team.  We also didn’t use any of our ‘chips’ (wildcard, free hit, and bench boost) whereas the overall leader did. Plus, the overall leader is placed in the top 0.5% of over 8 million entrants globally, so just coming close to his score was an achievement in itself.

Data Driven England Squad

Our plan for the next steps had been to continue to refine the model and to enter a team into the league for the entirety of the next season. However, first, with the Euros approaching we realised there was another way to have a little fun with our AutoML.  If we predicted points scored per player regardless of the opposition, we would have a way to see who is playing well and should continue to play well.  If we shortlist this just to England players, we can pick our theoretical best team and squad for England in the competition.

We trained the model on the whole season’s data to see which players rank the highest and selected the squad based purely a player’s performance and underlying stats over the whole season.

There are a few caveats to this team. We only have access to English Premier League data, but some notable English players are playing abroad. Therefore, we decided that Sancho, Bellingham (both won the German domestic cup this year with Borussia Dortmund) and Trippier (Won the Spanish League with Atletico Madrid this year) are automatically selected.

One final thing before we go into a comparison of the squads, there were a few rules set for each position selected which I will list below:

  • Goalkeepers: Only taking 3
  • Defenders: 4 centre backs and 3 of each full back
  • Midfielders: 6 players that play in central midfield
  • Forwards: 2 out and out strikers and 5 wingers

The Squad & Comparison

GoalkeepersDefenders MidfieldersForwards
Sam JohnstoneHarry MaguireJude BellinghamHarry Kane
Aaron RamsdaleTyrone MingsJames Ward-ProwseJadon Sancho
Alex McCarthyBen ChilwellMason MountMarcus Rashford
Luke ShawDeclan RiceRaheem Sterling
Trent Alexander-ArnoldJames MaddisonPhil Foden
Kieran TrippierAshley WestwoodJack Grealish
Michael KeanePatrick Bamford
Aaron Wan-Bissaka
Ezri Konsa
Aaron Cresswell

At the time of writing this, only a 33 man provisional squad has been announced, and maybe Gareth Southgate did use data to support some of his decisions as the two squads are not too dissimilar. Above in the table the players in green are included in both the real squad as well as our data driven one. If highlighted in blue, then they were selected in our squad but left out by Gareth Southgate completely.

Now to discuss the differences in more detail.

Goalkeepers: 

Nick Pope was the best performing goalkeeper according to our model but with knee surgery imminently, he wasn’t included. Dean Henderson and Jordan Pickford both of whom were included in the real England squad, were nowhere near the pace required when it came to our model.

Defenders: 

Trent is included in both squads, however interestingly even after rough patches through the season he was top of the list when it came to English defenders, 2nd in the league overall only to Stuart Dallas according to our model. I don’t think you need data to prove the inclusion of Cresswell. Gareth Southgate showing his affection for right backs once again. 

Midfielders: 

Biggest shock in this position. Jordan Henderson left out of our data driven team! An injury hampered season really affected him, however James Maddison was also out for large periods and still made the cut. Ashley Westwood’s inclusion was a surprise and shows how much of an unsung hero he is to Burnley, maybe he is the kind of player we need in those tougher games, food for thought Gareth?

Forwards: 

Patrick Bamford has been unfortunate to not make the real squad at all, he was clear of Greenwood, Watkins and Calvert-Lewin when it came to the data. Finally, even with missing half the season, Grealish still squeezes in. He’s just that good.

Over to you Gareth Southgate.

Conclusion

This project has undoubtedly proved the impressive power of AutoML; by simply using statistical analysis, a random group of players were able to beat all but the absolute best managers based on human decision making. 

As much as this was a fun side project, the process is applicable to so many ‘real world’ scenarios and most importantly, can be carried out by anyone, even without an extensive granular level of expertise in Machine Learning and Artificial Intelligence. At DTSQUARED, going forward we will be taking our learnings and continuing to grow our expertise in the AI/ML space, applying it to our projects and using it to enhance the value of our client’s data.

Will we keep running this analysis to help with our own personal Fantasy Premier League teams? Of course, who wouldn’t.  And if England don’t bring it home this year, then maybe we’ll get in contact with Gareth to help out for the World Cup in 2022. Watch this space…

Thank you for reading, see you at the top end of the FPL table next year! 

If you would like to find out more, speak to one of our experts to discuss how Machine Learning and Artifical Intelligence can help you and your business, please click here.

Get in touch with our data experts

Get in touch for a free session with our data experts