Flaky tests are poisoning your productivity

I freaking HATE flaky tests.

The first time I worked in an environment that had real Continuous Integration with Actual Automated Tests that Actually Ran, it was like... freedom. We literally got the green light that our new code was working as expected, and that any changes we made hadn't broken anything. And refactoring... before then, I don't think I had ever really refactored anything. Even a simple rename was fraught with danger, you never knew if reflection or some sort of odd log-file parsing was dependent upon specific class or method names. With a comprehensive suite of unit, acceptance and performance tests, we had this blissful safety net that would tell us "Everything Is OK" after we'd done simple or extensive refactoring.

Except.

The tests were never all green at the same time. All the tests went green regularly throughout the day. But the information radiator always had at least one build segment that was red, because some test somewhere had failed.

While it was to be expected that tests would sometimes fail, especially the acceptance tests (they took too long to run all of them locally before committing), to have every commit cause at least one failure every time? This can't be right.

Welcome to the world of Flaky Tests.

Integration tests, particularly UI tests, are very prone to intermittently failing due to some sort of time-out or resource contention. I found this the hard way. I spent the next four years helping to identify flaky tests, fixing the ones I understood, and encouraging people smarter than me to figure out what was happening with the ones I couldn't. I mean, I did Actual Development too, lots of it, but my goal was to get that Information Radiator green.

15 year later and I'm still fighting the battle, only now I'm hoping to help you with your flaky test journey. In my experience, this journey looks a bit like:

  1. We don't have no stinking flaky tests
  2. Oh, we do? Let's find them and fix them
  3. We should write tests that are less likely to be flaky

If you've made it as far as Step 3, well done!! In my experience, very few organisations make it that far.

If you are on this journey and you need some encouragement for yourself, your colleagues or your management, there may be something helpful in this video I did for Dave Farley's Continuous Delivery YouTube Channel. Transcript below.

Other stuff I've created on this topic:

Transcript

Are your tests lying to you? When they go red, when they fail, are they really failing? Or when you rerun them, do they go green? Flaky tests are the bane of my existence. And I'm going to talk about why that is and what you can do about your flaky tests.

So, flaky tests. Flaky tests, intermittent tests, non-deterministic tests. They're all different names for the same thing. A test that, under the same circumstances, sometimes passes and sometimes fails.

I find flaky tests so frustrating. We spend all this time, effort, and energy writing automated acceptance tests, and then time and money running them, only for them to, you know, sometimes pass and sometimes fail. They don't tell us the information that we need. They don't tell us, does it work? Yeah, I mean, sometimes.

That's not helpful. Why are flaky tests so bad? Well, they erode our trust in the test. Not only in the flaky tests, because we learn perhaps this one's flaky, and we don't really pay attention to it. And that's bad enough because. It's failing. There might actually be something wrong, or it might fail for a different reason, like something's broken and we ignore it because it's flaky.

But it's not just about the flaky tests that we're ignoring. We start to ignore all of our tests. Well, this build just has some flaky tests and sometimes it just goes red. And we start to ignore all these tests that we've put so much time and effort into writing. But worse than that, it has a bad impact on team morale.

We start to think, well, these tests don't really matter. And maybe quality doesn't really matter. And maybe the work I'm doing doesn't really matter. Like, what am I doing here? None of this makes any difference. And that's bad for us as a team. So what can we do?

The first thing we need to do is we need to find the intermittently failing tests.

I've spoken to engineers about the failures in their test suite, and asked them, do you have intermittently failing tests? And the answer is... we don't know. Sometimes as engineers we understand "Oh, I see this test fail quite a lot. It seems to sometimes pass and fail". We get a feeling for which ones are intermittently failing, but we don't have a list of them.

So the first thing to do is identify your flaky tests. We could do this the hard way, manually. You could have a look at every single failing test, rerun them, and if they pass, make a list of flaky tests versus actually failing tests. This is something I used to do in my day job years ago. Or we can do it the smarter way, which is what my colleagues did for me in that job, and automate this.

One thing you can use is you can use the automatic retries functionality of the build tool. For example, Maven and Gradle both support the ability to retry failed tests for a set number of times. And therefore, if you run it, say, three times, and it passes once and fails twice, that's a flaky test.

You can also write your own tool (or buy a tool) which will look at your various different builds and look at the tests across the different builds. It can figure out: if a test ran under the same circumstances on different builds, but sometimes passed and sometimes failed, it's probably a flaky test. This is a kind of "cross build" flaky test detection.

The next step is stop poisoning your tests. Take these flaky tests and move them out of your standard test suite. Either run them on a different build, a different agent, a different something, or just leave them there. Or you can write an annotation to temporarily disable the test. We wrote an @IgnoreUntil tag, which we would put on a flaky test with a date to start re-running the tests. So for the next two weeks, or whatever, you wouldn't run that test, giving you time to fix that test. If the test wasn't fixed in two weeks time, it would pop back up and start failing again. That's one way to sort of temporarily quarantine your tests.

I would argue sometimes for deleting the tests. If you're ignoring the result, if you don't understand what it's doing when it fails, that test isn't doing anything. You should probably delete it. Maybe instead you can make a note of the test title, a note of what you think it's doing, delete it, and schedule the time to write a new one, perhaps. Maybe deleting code isn't all it's cracked up to be. But it's not giving you any value, get that test out of there.

Now, a very important step. Fix your flaky tests. I mean, if you didn't delete them! But fix your flaky tests. It's not enough to just put them over here and start ignoring them. You also have to actively fix them. These tests were written for a reason. Somebody thought that these things needed to be checked. Figure out what it is that the test is doing.

Figure out the root cause of the problem with the intermittently failing test and fix that. I'm not going to go into the details of the root causes of intermittently failing tests. Dave's already done an excellent video on that in this particular channel, so go ahead and have a look at that.

What I do want to talk about is when to fix them. Now, ideally, as soon as possible, but that's not always possible given the working on other things. But there are several ways to make sure that you do spend time fixing intermittently failing tests, not just do them "later". Like we address tech debt "later".

Something you could do, for example, is create a card with the names of the tests you want to actually fix, put it in the backlog so that it gets done as part of your normal work. Something else you can do is schedule a fixed amount of time every couple of weeks, every month, to work on intermittently failing tests.

There's an advantage of having the whole team swarm on intermittently failing tests, because you may find there are some root causes common across all the tests, and if the whole team is working on it, you might be able to identify those root causes. Something else you could do, which is less than ideal, is slot in some time for working on intermittently failing tests when you're waiting for other stuff. Waiting for some long batch process, waiting for someone to get back to you, waiting in between two meetings. Maybe you might want to spend an hour or so diving into the possible cause of your test intermittency.

"But Trisha", I hear you say, "some tests are just inherently flaky". Yes, it's true. We can't control a whole environment. There are some types of tests, ones which are waiting for external services, or ones which require a database to come up, or ones which have a complicated set of dependencies to manage, which just don't always respond in time for us to complete the test.

That is the reality of the situation. What we want to do with those tests which are inherently flaky by their very nature is to put them somewhere else, run them somewhere else, and know that these are definitely flaky tests. We don't have them in our normal test suite poisoning the results of the test suite.

However, just because some types of tests are inherently flaky doesn't mean that we can accept that all of our tests are inherently flaky. And it doesn't mean that we can write all of our tests so that they use dependencies which will introduce flakiness. Not every test needs to be an end-to-end test. Not every test needs to be a UI test. Not every test needs test containers to start up databases or whatever. Many of the things we are testing in some of these bigger tests can be separated out into a smaller, more reliable, more controllable tests. So that's what we should be aiming for. Yes, some tests will be intermittent, a small number of them, and they should be run elsewhere. But in our key tests, we should be able to use stubs, test harnesses, something else to create a framework for stability to get consistent results for the tests that are most important to us.

I accept there is a reality where there are organizations and teams where they have a lot of end-to-end tests. A lot of their tests are big and complicated, and they are inherently unreliable. And these teams might be listening to me thinking, I don't have time to rewrite all of my tests so they're faster and more reliable.

It's true, you have other stuff to do. But I do think you should be investing time in making your big, clumsy, unreliable tests smaller, faster and reliable. Where would you rather invest your time? Do you want to be spending your time debugging flaky tests every time you break something that's nothing to do with you? Do you want to be investing your time re-running tests and waiting for a result only to find out, oh, it's nothing to do with the thing that you did? Or do you want to invest your time making these tests faster and more reliable so that you don't worry about them anymore?

Ask yourself, is it worthwhile having a test suite which lies to you? Or are you going to invest time finding flaky tests and scheduling time to fix those flaky tests?

Author

  • Trisha Gee

    Trisha is a software engineer, Java Champion and author. Trisha has developed Java applications for finance, manufacturing and non-profit organisations, and she's a lead developer advocate at Gradle.

    View all posts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.