We Need Some New Categories Of Testing Software For Humans
During my research for the just-published document "For Developers, Dog Food And Champagne Can't Be The Only Items On The Menu," I had an interesting conversation with the team at Adobe that handles internal pilots, which in their case entails more than just putting the next version of an Adobe product on the network for people to try. Instead of the typical "spaghetti against the wall" approach to "eating your own dog food" (to mix food metaphors), the Adobe team actively looks for use cases that fit the product. If a product like Flex or Photoshop is a tool, then it should be the right tool for some job. (And if you can't find any use for the software, you're definitely in trouble.)
This approach might require additional work above the "spaghetti against the wall" approach, but it definitely has dividends for many different groups. The product team identifies functionality gaps or usability flaws. Marketers and salespeople have a much easier time figuring out what to demo. As a result, Adobe runs a better chance of both building technology that's compelling, and then explaining what's compelling about it to potential customers.
The Adobe approach stands out because, most of the time, internal pilots don't have very clear objectives or rigorous approaches to achieving them. The phrases "eating your own dog food" and "drinking your own champagne" imply discipline, particularly since they seem like an extra step that teams are not strictly obliged to take. Unfortunately, internal pilots are usually anything but disciplined, especially when compared to other forms of testing.
For example, if an internal development team builds a custom wiki application, we expect the team to subject the code to a battery of tests. Functional testing, unit testing, regression testing, acceptance testing – we know exactly what these terms mean. We know how to measure them, along dimensions like test coverage. As soon as we get into testing that involves real human beings using the code, we accept an enormous drop in specificity and rigor. If the team uses the wiki for its own purposes, there's no standard by which we judge the right amount of time for this testing, or how much "coverage" of the functionality is required. We also don't get too worried that there may be a huge difference between how IT people might use a wiki, and how someone in finance, customer support, legal or shipping would use it. We shouldn't be surprised, therefore, that no matter how generous a helping of dog food the dev team consumes, the technology delivered doesn't completely satisfies the user's tastes.
The few methods that have any degree of rigor, such as usability testing, are a good first step, but they're hardly sufficient. I've sat through a lot of usability lab experiments, and while I've been happy for the feedback, I was always aware of how limited the scope of testing was. Usability testing does exactly what it says, and no more: you learn how hard it is to figure out what you should click to finish a task, and how many clicks it takes to complete. Usability testing does not tell you whether the software solves a real problem, or in the case of packaged applications, whether someone is willing to pay for the privilege.
We accept an approach to Real Human Testing (RHT) that's inferior to Poking The Code (PTC) because we often don't know exactly what we want from RHT. What do software vendors expect from a beta program? Some mix of bug hunting, design feedback, deployment and configuration testing, and reference customer recruitment, to be sure, but which is the top priority? Typically, different groups in the company that have a vested interest in the beta program – development, support, marketing, etc. – usually don't share the same view of what the priorities are, or should be. The beta program checklist that fellow blogger Saeed Khan developed is a great start, especially since few companies have a beta program strategy as clear in both ends and means as what he describes.
Since all new disciplines start with categorization, I'll propose a few categories of RHT that we need to have:
- Usefulness. Does the software solve a real problem?
- Scope. In how many use cases does the software solve a problem?
- Comprehensibility. Is it clear what problem it solves?
- Usability. This is a fairly unambitious measure. You could have a very usable application that fails to impress or entice people to use it. Therefore, we need an additional measure.
- Coolness. Does the software appeal to people? The "gotta have it" factor requires a lot more than just usability.
- Client compatibility. How well does the software work with the devices and other applications that people are likely to use?
- Perceived value. For IT, this is a measure of ROI. For tech vendors, it's a measure of market potential.
I'm sure that I've missed some other possible categories of RHT, but it should be clear from even this cursory list how little rigor we apply once we "test" software by putting it in the hands of prospective users or customers. Even checklist standards like ITIL and CMMI don't impose this type of discipline, even though, at the end of the day, the real test of software quality is whether someone uses it, or spends money to use it.