Coming Up With The Largest Public Clash Royale Dataset To Date (37.9M matches)

Long time Clash Royale player here and I am also a software developer by profession and now slowly dipping my toes into Data Sciences and Cloud computing (via Microsoft Azure). I tried searching for public Clash Royale datasets, but the ones I saw don’t quite have that much data from my perspective, so I decided to create one for the whole community. Early this month, I signed up to https://developer.clashroyale.com/ and created a new account to get insights from the data.

This is also a great opportunity as it almost coincided with the Season 18 ladder reset, the release of the Mother Witch Legendary Card, and new balance changes.

Image for post
Image for post

At first, I explored using Azure Python Functions (serverless), but as one needs to enumerate the IP ADDRESSES to create your API key in the developer website, I was having problems with the implementation as the Azure Function’s outbound IP Address(es) were regularly changing even If I did follow this article. As I regularly needed to tweak my Azure function, the IP Address was changing too often that I had to abandon this effort. It is definitely possible, but I don’t want to spend too much time trying to make this work.

Image for post
Image for post
Need To Supply IP Addresses For Security Token

I eventually decided to use an Azure Machine Learning compute that runs a Jupyter notebook. This compute is basically an Ubuntu VM and it has a constant IP address that solved the issue I had with the Azure Function changing outbound IP address. As I am only using monthly Azure free credits, I selected the cheapest VM STANDARD_DS1_V2, which cost around USD 0.06 an hour. This VM is (just) a lightweight worker VM, so I had to tweak the pulling of match data do it in batches, also, I’ve put on a time and limit of battles extracted per member. Doing it all in one-swoop will cause the Jupyter Notebook to throw out of memory exceptions. We also risk getting throttled by the Supercell servers, thus I’ve put some logic that will pause the execution after every X clans fetched.

Image for post
Image for post
Getting Clan Members’ Ladder Data/ Battles

Eventually, I had 2 VMs running a total of 14 processes, and for each of these processes, I’ve divided a pool of 300k+ clans into the same number of groups.

Using HTOP To See VM Utilization

This goes on 24/7, non-stop, for Season 18 duration. Each process will then randomize the list of clans it is assigned to and will iterate through each clan, and get that clan’s members’ ladder data. I’ve put on a time and limit of battles extracted per member.

It is important to note that I also have a pool of 470 hand-picked clans that I always get data from, as these clans were the starting point that eventually enabled me to get the 300k+ clans. There are clans who have minimal ladder data, there are some clans who have A LOT.

Image for post
Image for post
One Of The 9 Notebooks (in 2 VMs) That Perpetually Gets Ladder Data

The notebooks saves the ladder data into an Azure container in tab-delimited format, I also gzipped the files before writing them into the containers to save space and for easier moving around of the files.

Image for post
Image for post
Compressing and Saving Pandas Dataframe Into An Azure Container

After a whole day is done, I then combine the previous day’s files and do some simple cleaning and data engineering on them, such as removing duplicate battles, computing card levels, flattening nested data, and discarding unwanted data to name a few.

Image for post
Image for post
Combining Multiple Files And Doing Some Data Engineering And Cleaning

The amount of data that I have, with the latest dataset, has ballooned to around 37.9M distinct/ unique ladder matches that were (pseudo) randomly being pulled from a pool of 300k+ clans. If you think that this is A LOT, this could only be a percent of a percent (even lower) of the real amount of ladder battle data. It may still not reflect the whole population as the majority of my data are matches between players of 4000 trophies or more. But this dataset is surely enough to get anyone started.

Image for post
Image for post
The Latest Dataset With A File Size Of 8.4 Gb

I’ve hosted the datasets into Kaggle as I don’t see any reason for me not to share this with the public as the data is now considerably large that working on it and producing insights will take more than just a few hours of “hobby” time to do.

Image for post
Image for post
The Kaggle Page For My Datasets

Feel free to use it on your own research and analysis, but please don’t forget to credit me. I’m also interested to see the results and the insights that you’ve extracted from the data!

Also, if you have specific questions on the implementation that I did, feel free to drop a comment.

Also, please don’t monetize this dataset.

Stay safe.

Stay healthy.

Happy holidays!

Data Architect dipping his toes into Data Science

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store