Risk based testing 2: Donald Rumsfeld and the Martian that didn’t make it

Donald Rumsfeld has been widely ridiculed for his statement about unknown unknowns. Unfairly, if you ask me. When you move into testing, unknown unknowns are your biggest concern.

In my previous post I wrote about asking the right questions. I’d like to illustrate why that is so important, with a story about Martians.

In 1998, NASA sent out a probe called the Mars Climate Orbiter. It was going to orbit Mars and collect data on the red planet’s climate, preparing for Matt Damon’s future expedition. Unfortunately, unlike Matt Damon, the probe never made it. It crashed into the atmosphere and disintegrated.

What happened?

Now, NASA are pretty meticulous about their testing. When NASA make a mistake of this magnitude, we’re talking about billions of wasted dollars. If there’s anything they know they don’t know, they will find out before sending their toy into space.

Of course, there are always some risks you have to mitigate in other ways than testing. If you know about something that could go wrong, and you know you cannot physically or practically test the exact conditions, you can take action to reduce the risks. If you’re at sea in a small rowing boat, you don’t know whether you’ll be hit by a wave and be thrown in the water. Waves are unpredictable. That’s why you wear a life vest. In the same way, NASA’s spacecrafts have plenty of systems designed to handle events with uncertain outcomes.

Sometimes these systems fail too. There are calculated risks. At some point we have to draw the line and say that the cost of further mitigation is so high that we will accept the remaining risk. In the rowing boat, the life vest is good enough – you don’t put on a survival suit and sling a bag of emergency flares over your shoulder before you go out to pull in the fishing net.

The tricky part, though, is what you don’t know that you don’t know. That’s what happened to the Mars Climate Orbiter. It wasn’t a deliberately calculated risk.

One of NASA’s subcontractors delivered a subsystem which used imperial units. NASA, like the rest of the scientific world, uses the metric SI units. And so it happened that a series of events occurred during the probe’s flight, leading to a a situation during descent to orbit where the measured values from one subsystem and the calculated values from another gave conflicting data. After its nine month journey from Earth to Mars, the probe, as Wikipedia puts it, “unintentionally deorbited”.

You can be pretty sure that no engineer in NASA at any point during the project said in a risk evaluation meeting, “Well, we don’t know for sure whether all the subsystems use the correct units of measurement”. And it is equally certain that no manager answered, “I understand, but we can’t be bothered to test that before we launch. It will take too much time and we can’t spare the resources”. To NASA, the subsystem’s use of pound-seconds was an unknown unknown. It was something they didn’t know that they didn’t know.

You need a combination of creativity, method and experience to find those unknown unknowns. Even then, you’re never going to find them all, but as a skilled tester you should never stop hunting. Even if your unknowns cost significantly less than a space probe.

Risk based testing 1: Why?

Testing, in its core, is the act of figuring out things you didn’t know.

A test is to ask yourself a question, and then to perform actions designed to find the answer to that question.

The answers will tell you whether the product is ready for release – but only if you asked the right questions in the first place.

So how do you know whether you are asking the right questions?

Risk based testing is a methodical way to find the relevant questions, to prioritize them, and to challenge your beliefs and intuitions by exposing them both to yourself and to others.

An experienced tester will always have an intuitive sense of the relevant risks and priorities while testing, even without a formal identification process. For a small project, that might be enough. However, even the experienced tester is prone to cognitive bias. Making yourself think through a topic by making an explicit list of risks is a good way to overcome bias. Inviting others to join in the process is even better.

For a larger project, a formal risk analysis process is vital. On one hand, it is your tool to increase the likelihood that you will be asking the right questions. You know you won’t have time to test everything, so you better make sure you know you’re spending your limited time on the most important things.

On the other hand, the documentation you get out of the risk analysis process is an important tool for learning from the project after it’s done. It will let you look back at a decision you made and discuss whether it was the right one – and if it turned out not to have been, then you’ll know why you made that decision at the time, and what you need to change in order to make a better decision next time.

 

A techie in heels is still a techie

In a discussion forum, a while back, someone posted a link to an interesting article about the oil industry. I can’t remember the exact context nor the details of the article. Something about the volatility of the market, I think. The article was interesting, and relevant to the discussion at hand.

Then a guy in the discussion forum followed up with an annoyed post: Sure, great article, but why was the article illustrated with a pretty woman in a short skirt and high heels?

This person was genuinely looking at this from a feministic point of view, which I can appreciate. It is annoying when pretty women are mindlessly used as eyepieces to draw attention to something completely unrelated.

However, the next poster quickly noted that the woman in the photo was Thina Saltvedt, one of Norway’s most highly qualified oil analytics, and that she was the main source for the article. This brings the situation into a different light. A woman doesn’t dress in a masculine power suit. The assumption is automatically that her reason for being placed there is because she’s pretty, not because she has relevant qualifications.

Why do I tell this story here? After all, it has nothing to do with testing, and very little to do with technology.

Well, it’s March 8. Among the links being shared on the web today, there was a story about how programmers who look stereotypically feminine are automatically assumed to be less competent than programmers who look less stereotypically feminine. Even if they are both women.

Even as more women are joining IT, and more people are starting to catch on to the idea that women can actually code, we’re still looking at a gap where women who look like women are not taken seriously. Wear jeans and a geeky t-shirt, and you might just be a real techie, even if you’re a woman. Wear a cute dress, however… you must be the designer, or the project manager, or at best a non-technical tester with no clue about anything beyond the buttons in the UI.

IT folks are pretty logical people. I think we can all see that when we make these assumptions – with a tribute to the late Leonard Nimoy – we are being highly illogical.

 

Is ISO 29119 useful?

On a regular basis, I get questions from customers or our sales folks whether we have any software testing certifications. The answer is no, and I usually follow up with explaining that we also do not plan on acquiring any such certifications unless it becomes an unavoidable business requirement. And in that case, its explicit purpose will be to satisfy contracts, not to improve our testing.

Multiple standards cover software testing, some general and some in specific areas. The latest thing is ISO 29119, which aims to replace several older ones.

I’m a big advocate for standards. Standards are awesome, in scenarios where they are useful. A good standard ensures interoperability between products created by different vendors. A bad standard fails at ensuring such interoperability. I have not worked closely enough with ISO 29119 to decide whether it’s a good or a bad standard. My argument is that the point is moot, because interoperability between vendors is not necessary in testing.

Thus, I argue that ISO 29119 is not useful.

There’s certainly a lot of useful content in ISO 29119. Many of the standard’s requirements are things I consider good practice in most contexts. But as a standard for testing, the only thing it is really good for is discouraging thinking. It lets the testers, test managers and the customers off the hook by letting them ask “do we follow the standard” rather than “do we test well enough”.

When a customer asks about our testing practices before entering into a contract (as they should!), I’d much rather reply by explaining what we do and how we do it, rather than pointing to a set of instructions we follow. And if the customer would rather have an ISO number instead of the engagement of the testers, it is not quality they care about.

When do you know that the test passed?

New testers tend to be preoccupied with the motions of the test. They’ve studied methods for identifying boundaries, and know the importance of negative tests. If they have been diligent, they even know to test error handling and recovery. Still, the bright but inexperienced tester often stops a step short of actually knowing whether the test passed or failed.

Let’s look at an example: Testing that an element can be saved to a database. You prepare the element, save it, and the application displays a happy message saying that the element was saved. Done, test passed! Right?

The experienced tester, of course, would not think of stopping there. All you have tested so far is that you get feedback when saving to the database. And you haven’t even tested that it’s the correct feedback. If the save happened to fail behind the scenes, you’d actually have a much more serious issue – the dreaded silent fail. And, of course, you haven’t tested what you said you would test: That the element was saved to the database.

For every test you perform or design, whether manual or automated, the most important question you can ask yourself is: “Does this really prove to me that the application did what I am trying to test if it does?”

For the database example, there are multiple ways to complete the test. You can simply reopen the saved element, or you can continue by using the element in a new operation that needs to read and use the element to work. Or you can inspect the element directly in the database. What you choose to do will depend on the application itself – for example, if it caches elements, reopening the element may not be proof enough. It’s up to you to know what is proof enough.

Guest post: Thoughts on designing unit and system tests

This week in Tech and Test brings a guest post from my esteemed colleague and minion, excuse me, Minion, Tony Meijer, on the topic of automated testing:


Why do we write unit tests? A simple question, right? Think about it for a few minutes.

Most people I ask answer with ‘to avoid regressions’ or ‘to find bugs’. Let’s examine that. Unit tests are built to test one independent unit of code and most regressions are due to subtle compatibility issues between many units of code, so that seems to be an incorrect assumption. However, when we are doing refactoring then unit tests are actually a very good defense against bugs since we normally restructure code without changing its behavior.

So, how about bugs then? Again, a component may behave as you expect it to behave and it will not matter. Most bugs, at least most severe bugs, are due to a sum of many incremental quirks over a series of code-units that results in a faulty behavior, at least in my experience.

So, why do we write unit tests then?

I would like to say that we do it because, when done correctly, it creates higher quality code through cleaner interfaces. And higher quality code is a worthwhile cause indeed because it decreases the number of bugs.

But that brings us to how we avoid regressions and bugs and what I think is the one way to do that (apart from continuous refactoring and continuous code reviews), namely system tests and integration tests. System tests and their more avoided cousin integration tests are automatic tests that test a group of code-units and their behaviors.

So, what constitutes a well-written unit, system and integration test then?

For unit tests, these practices usually lead to reasonable tests:

For each functionality in the code unit, test a basic value that should work (to see that it works) and a value that should not work (to see that it handles bad input correctly).

Also, mock out everything not in the code unit. If you cannot do that, then the code is most likely too interdependent on other pieces of code.

Avoid unnecessary asserts like the plague. I know what you are thinking (but it cannot hurt!). In my opinion, unit tests are part of the design specification created to test a very specific piece of functionality. If you push in a bunch of checks, that commonly means that you do not know what you are testing.

For system and integration tests I recommend the following:

For each functionality in a system- and integration test, test with good reasonable data (to see that it works), data that should not work (to see that it reacts to problems reasonably), as many boundaries as you can find (this is usually where bugs are found), and, if you are dealing with networks, as weird a load as you can easily simulate (this does not imply only a high load, simply sending data in the wrong order or unevenly is simple and a tough enough test).

Toss in as many asserts as you can think up, these are system tests and should be considered a fishing-expedition, see what you can find.

Test with different configurations and change it on the fly.

Avoid mockups as far as you can in this stage.

Author: Tony Meijer

User Acceptance testing done wrong

If you’re a test manager for a software product that has business customers, you’ve gotten it: The request for a suitable set of your test cases that the customer can run on site to do their UAT, or User Acceptance Test. Usually, their managers demand that they do UAT before they accept that the contract has been fulfilled.

This way of doing UAT is wrong on so many levels. I have usually responded to such requests with an explanation of why this is a bad idea both for my own company and for the customer.

Here’s why it’s a bad idea:

User acceptance testing has two aspects, tied to two different understandings of the word “user”. The aspect typically intended in a UAT-prescribing software delivery contract understands the “user” as “the company buying the software”, and the testing is aimed at checking that functional and performance requirements are fulfilled.

The other aspect refers to the actual end user of the software, where the intention is to verify that the software does what the users need it to do, in a way that makes sense to them. It also has a psychological aspect, of letting end users become familiar with and feel that they have a say in the process of building the new software that they will be made to use in the future.

So why is it so bad for the vendor to provide the test cases they have written anyway, to save the customer some time and hassle of creating their own?

It is bad for the customer, because the bugs that slip through internal QA are usually the ones that are triggered by customer workflows or data that the internal QA team had no idea would be used. By repeating the same set of tests in UAT, the bugs continue to go undiscovered.

It is also, mostly, a waste of time. While there is some merit to running test cases in a production or near-production environment instead of in a test lab, the actual gain from repeating the exact same tests is likely to be low. If you are going to spend time and resources doing testing, you want to maximise the value you get out of it.

It is also bad for the vendor. This may seem counterintuitive, at least to sales folks. After all, if you give the customer a set of tests that you know will pass, the contract is fulfilled and the vendor get their money. Everyone’s happy, right?

Wrong. The sales guy is happy, because he gets his quota fulfilled. Everyone else is miserable. Those bugs I mentioned further up, that go undiscovered in UAT, won’t stay undiscovered forever. Their discovery has just been delayed to where you really don’t want it: In production.

Fixing bugs in production is expensive. They are often urgent, which means that developers who were done with that project and now started on the next project have to be pulled out, causing delays in their current project. They require patching or updating the production environment, which may require downtime for the customer, and usually requires meticulous planning and overtime for the support staff. And, of course, the users’ first impression of your buggy product will take a long time to mend.

The next time your friendly sales or project manager comes to you and asks for UAT test cases, politely explain to them, and ask them to explain to the customer, why that is not in either party’s best interest. Offer to supply all your documentation of the testing your team has done, while explaining why it is a really good idea for the customer to design their own tests for UAT.

If they still insist, company policy may require you to go along with their request. If that happens, however, I strongly suggest you take up that policy with your managers for future reference. Learn to use catchphrases like ROI when you discuss this, they really like that.

Test better: Do customer support

Earlier today, I had a conversation with one of my excellent, new colleagues from a small company my employer recently acquired. New colleagues bring new perspectives. This particular company has practiced something that a lot of software companies would benefit from: A large portion of their developers and project folks have started out in customer support.

Customer support is great. Everyone should do it on a regular basis. Developers, certainly, and definitely testers. There is no better way to learn about all the weird ways customers configure and integrate the systems you are developing, the workflows they employ, and which things tend to go wrong out there in the real world. All of it information that should inform the testers and test process, much more than it often does.

Too many R&D folks like us hardly ever meet the customers. Possibly during the planning phase, or perhaps a guided tour of the customer’s facilities, observing users over the shoulders. Until you’ve been in there trying to solve an actual, complicated problem the customer is facing, you have seen nothing. Thus, I find myself on the phone with a customer overseas in the evening, remoting into their systems running procmon to nail a strange problem that only occurs on some users’ computers and never in our labs – and enjoying it.

If you are a manager of testers and developers and don’t want them to spend time doing work that the regular support folks do anyway, typically for less money, think again. There is no amount of training, conferences and courses that will teach them what they learn doing support for their own product. In the end, this is the best customer service you can provide – an organization that really understands what your customers need, top to bottom.

The five whys and the which

After the product ships and the bug reports come trickling in, the question that always comes back to haunt the testers is “Why did this bug slip through?”

There are some obvious answers. “We didn’t have time to test this” is common, as is “We didn’t have the equipment”.

“We had no idea the product could even do that” is another, particularly for the more complex applications with tricky integrations and sneaky business logic.

The meticulous tester will then go on to create a specific regression test covering the case in the bug report. Too often, in my experience, we stop there.

A standard quality improvement process to use in this case is The five whys1. Why did we not know the product could do that? The documentation was missing. The requirements were missing. The testers didn’t understand the business requirements well enough to understand that this was something the customers would want to do. The feature had low discoverability. Continue following the whys all the way down until it’s no longer useful, then try to do something about these things. Books have been written on this topic, so I won’t go into details.

What I wanted to bring up, though, is an important question to ask that is NOT a why, but a “which”:

Which other tests did we fail to do?

Just like bugs, blind spots in testing rarely appear alone. If a bug report triggers only a single new regression test, be wary. There’s almost certainly some other, related functionality you missed the first time around. The whys above can help you find the extent of the blind spot. Make sure it is covered for the next release!

This process is also called root cause analysis, but I prefer the whys, because “root cause analysis” has the connotation to me of following a step-by-step formula to end up with a specific answer that logically follows from the premises, while “why” sparks creativity. Your connotation mileage may vary.

My favorite tool: Wireshark

Early in my career as a tester, a developer gave me a ten-minute introduction to Ethereal (as it was called then). I don’t remember why, we were probably trying to debug something. But ever since, Wireshark (as it is called now) has been my most trusted companion. Hardly a day, and never a week, goes by without me poring over tcp dumps.

Curiosity is an important characteristic in a skilled tester. And there is really nothing that can satisfy curiosity like raw network logs, at least not when you’re working with any kind of server-client or other networked architecture. The error in application A – does it happen because B sent the wrong data to A? Or maybe A sent a wrong request to B in the first place? Or perhaps B is sending the correct data but A is interpreting it incorrectly? Wireshark will tell you!

Application logs can only take you so far. Even if the application offers ridiculously verbose debug log levels (as some do, which I love), you still have to trust the application’s presentation of the data. Sometimes that’s enough, but even then it is often faster to whip out Wireshark than to configure the required log levels, restarting the applications and then set everything back to normal again afterwards to be free of the disk-space-eating log file monster.

And network tracing is not just good for debugging! It is an excellent way to get to know and understand the APIs and communication protocols used by the applications you’re testing, and spot opportunities for new tests. Wireshark and similar tools can seem daunting at first, but they’re really not that hard to get started with, provided you have a basic grasp of networking concepts. Check out http://www.wireshark.org/docs/ to get started!