Marathon, Chapter 1
So you’ve scaled up your mobile infrastructure to support the ever-growing number of tests for your application, and you think everything will be fast, right?… Wrong! In the next several minutes I’ll teach you why scaling infrastructure doesn’t necessarily lead to faster test execution and how you can address this.

The approaches and test runner described in this article have been battle tested by several companies and executed millions of UI tests. All the source code is available on GitHub. I humbly ask you to star, share and spread the word.
If you’re interested in scaling up your infrastructure, then you can check out the approach in a previous article Android CI with kubernetes.
Quick intro to history of mobile testing
By default, Apple and Google provide you with some simple means of executing your tests. This will be enough if you have a simple application but as soon as you write more tests you’ll start facing issues. Apart from the default test runners there are third party solutions which try to optimize the performance like Spoon, Fork and fbxctest. Problem is none of them addresses all the issues that we have in the real world of testing at scale.
Gathering requirements
The biggest problem with large projects and large test suites is the flakiness. It comes in all shapes and sizes: the infrastructure might fail, some test might incorrectly handle the shared state that affects other tests and so on.
Development cycle duration plays a crucial part in software developer’s life. What do I mean by development cycle? Well, you pushed your source code, and you’re waiting for the feedback whether you broke something or not. This, in turn, depends on the time it takes to execute your tests, so test execution performance is critical.
Last but not least: you don’t want to use a different approach for a different platform. Wouldn’t it be nice to use the same testing logic when you switch from Android to iOS? Basically, unify the testing platform code as much as possible.
Assumptions
I assume that the testing happens on multiple devices at the same time and is controlled via test runner that is executed separately. For example, you have your laptop, and you connect to multiple Android devices like emulators or real phones via USB or TCP/IP. Let’s call these devices execution units for the rest of the article.
Flakiness
If you do not know what flakiness is it’s basically executing the same test without any changes and getting a different result. Sounds familiar? It should:
Insanity is doing the same thing over and over again but expecting different results
So how do we address this? Well, for starters we need to measure flakiness with something tangible like numbers: we can measure all of our tests that get executed and store all of their data like duration and success status. Given this data, we can calculate the probability of a test passing at a specific moment in time, for example now. If the test failed 95 times out of 100 during the last 24 hours, chances are it will probably fail when we execute it.
This probability addresses the source code problems that you might have as well as some intercommunication with external dependencies. But what happens to the test execution if there is an infrastructure failure? Well, we shouldn’t assume that our test execution units are always there, so the test runner must always check if the execution unit is healthy and immediately take action if some execution unit is dead.
The other kind of problems that you might have is related to the non-uniformity of execution units. Even if they’ve always spawned from a clean image like a VM or a container, they still are affected by what happened just before the test execution (a corrupted shared state for example). In this case, we should retry the same test on a different unit than it was executed before.
Now that we know what we’re up against let’s see what we can do about the performance.
Performance
To improve performance we need to understand what happened during the test run. Let’s see a report that one of the current test runners produced:
I want you to put yourself in the shoes of someone trying to optimise the performance of this test run. Do you have enough information here? Can you make useful suggestions on how to improve the test performance?
A curious reader will immediately notice that test performance directly relates to time and we know almost nothing from this report about it. Our current report shows each execution unit’s timeline as follows:

In order to optimise the performance we basically want to get rid of these gaps in each timeline as much as possible. These gaps can happen due to multiple reasons but worst case scenario is when only one unit is working and everything else is just sleeping like at the end of execution here:

Why these gaps happen you ask?
- Duration of tests is completely ignored by the test runner. Every test is assumed to be of the same duration or that in the general chaos of the test run it will balance out. It won’t. 5 tests 1 minute long are not the same thing as 5 tests 5 seconds long. This leads to the imbalance of the tests execution.
- Success rate of tests is also completely ignored. This rate is something we have to live with because we can’t get rid of flakiness completely. This leads to unaccounted retry executions which we could balance between devices if we had known about them before the test run.
So what can we do about it? We can start by setting the goal of all the execution units finishing at almost the same time. This will reduce the gaps during the execution where a small number of execution units are still not finished when everything else is already done. Secondly, if we know that a particular test failed before then, we want to add preemptive retries to this test before we even start the execution. Why? Because we can parallelise these executions instead of doing them sequentially and losing our chance to use more execution units¹. For example, we have a test with probability of passing 0.5, but we want something like 0.8 at least. Then executing this test 3 times in parallel will give us the probability
Another important part about executing tests is how to sort them in a queue for execution. Each use-case is individual, but general suggestions are to take into account the success rate and duration:
- Sort by success rate (ascending and descending order).
The logic for ascending order is that we know that retries are almost non-existent at the end of execution, so we start with the unexpected (=sometimes failing) tests first and then proceed to the stable ones.
On the other hand from experience, most of the tests that are unstable are in the probability bucket [0.9–0.99]. This leads to another unexpected retry at the end even though the test is considered stable. So in fact mostly failing tests are much more stable than these rarely failing ones. When you scale your test executions rarely failing becomes every day or even every hour and since the number of tests is always growing this problem starts to be a very prominent one. To battle this, we can sort the tests with probability descending so that stably failing tests are at the end and we definitely know the retry count for them and the sometimes failing tests are executed at the beginning of the execution when we still can adapt to unexpected retries without huge sacrifices to the performance.
- Sort by duration: longest tests first
This approach minimizes the performance impact during the end of the execution. If you retry the longest test you have it will affect the performance much more than if you retry the shortest one.
So far everything we’ve talked about is not platform dependent so it should be possible to reuse the same code for multiple platforms.
What’s next
This wraps up the approach that is used in the Marathon test runner. In the next parts, I’ll explain the basics of Marathon test runner implementation such as Batch and Shard, walk you through different testing strategies and when to use them.
Please share this article and GitHub repo to help reach people who are struggling with testing performance and stability.
Links
https://github.com/Malinskiy/marathon
[1] There is an assumption in testing that if the test passed at least once after multiple executions, then this test is considered passed