By Harald Carlens, independent machine learning researcher.
If you’ve ever tried your hand at a Kaggle-style competition, you’ve probably seen how competitive they can be. Each one can draw thousands or even tens of thousands of competitors, including academic and corporate teams.
In running MLContests.com, I’ve been keeping track of all of these competitions since the end of 2019 and recently set out to look at the data and answer the question: what makes a winning entry?
I enlisted Eniola Olaleye–a top 5 competitor on the Zindi platform–to painstakingly go through the winning solution for every competition that took place in 2020 and figure out what made them win. What frameworks did they use? How much experience did they have? What type of modelling did they use? Did they use deep learning? How big was their team? In this article, I’ll highlight some of the findings that might help you increase your chances of winning a competition.
Which competition should I enter?
The first, and probably the most important thing to decide, is which competition to enter in the first place! Clearly, you’ll have a better chance of winning if you focus your efforts on one great entry for one competition at a time, rather than having mediocre entries for several competitions.
One thing that likely increases your chances of winning is to have less competition. It’s clear from our data that the number of teams entering varies a lot across different competitions and platforms! HackerEarth, DrivenData, Unearthed, Kaggle, and CodaLab competitions have more than a thousand competitors on average.
On the other end of the spectrum, we have one-off competitions like Waymo’s Open Dataset Challenges or IARAI’s Traffic4Cast Challenge, which had only a few dozen competitors. While these can be more domain-specific and don’t have the convenience of Kaggle Kernels and Notebooks, if you have relevant expertise in that area, it probably makes sense to go for one of these competitions instead. On that note, this year, Waymo is once again running its Open Dataset Challenges of four competitions with a $22k prize pool each! For other independent competitions, check MLContests.com. If you see any that aren’t listed there, you can submit a pull request on GitHub to share it with the ML Contests community.
Which deep learning framework is best?
Unsurprisingly, deep learning is extremely common in competitive ML, especially excelling in NLP and vision-based tasks.
Probably due to its relative maturity and increased ease of deployment, TensorFlow has generally been the more popular framework in ‘production’ industry environments, as measured by job postings or GitHub stars.
In research, on the other hand, PyTorch has gone very quickly from irrelevance to the clear preference for most researchers. Its pythonic approach, relatively stable API, and ease of customisation have propelled it to being used in around 80% of academic deep learning papers (only counting those where either TensorFlow or PyTorch are mentioned).
Source: Horace He, “The State of Machine Learning Frameworks in 2019,” The Gradient, 2019.
Machine learning contests clearly sit on the research side of the research-production continuum, and this is reflected in the use of a deep learning framework: over 70% of winning solutions used PyTorch, with just under 30% using TensorFlow.
And here we have our answer: while successful ML competition entrants, just like most other researchers, prefer PyTorch, if you’re comfortable with TensorFlow, there’s no reason that should stop you from winning.
You don’t need to be an expert at everything
It’s easy to be put off entering a competition because you feel like others have more experience or because you’re new to a particular area. Maybe you’re competent with Pandas and statistics but new to deep learning. Or maybe you’re a domain expert and understand ML concepts but aren’t much of a programmer.
Well, you’re in luck! The first bit of good news is that, in our data set, over 80% of winners had never won a competition before. So while there are certainly some breakthrough repeat winners, in general, they don’t dominate competitions, and newcomers as a group have a much better chance of winning.
The second bit of good news is that you don’t need to understand the full depth of every component you’re using in order to win.
Taking deep learning as an example, around 15% of deep learning solutions used very high-level APIs such as Keras for TensorFlow, or fast.ai for PyTorch. As long as you’re comfortable with general machine learning and careful not to overfit, it turns out that these APIs can be enough to win. You don’t necessarily need to write your own models and training loops from scratch, and using the higher-level APIs can allow you to focus your efforts on gaining an edge elsewhere.
Another interesting example is in competitions where domain knowledge is important. Let’s consider the Gawler Challenge for predicting areas with mineral deposits in South Australia, with a $250k prize pool. We contacted the winning team to find out how they managed to bag the $100k top prize. It turns out that they are expert geologists who mainly relied on their knowledge in the field to get their edge. They combined geomechanical modelling with random forests. The interesting part is that they used Weka, a GUI tool for the random forests implementation. This allowed them to focus on the area where they could excel, and it clearly paid off.
So here it is: if you want to win:
- Pick the right competition. Consider where you might have an edge and which competitions might be overlooked by others.
- Focus on where your edge is. If you’re doing a competition where you have some domain knowledge, it might be worth spending more time on fully using that than on tuning other parts of your solution.
For more stats on 2020 ml contest winners, see this summary.