Chapter 4: Evaluation

One of the core lessons over decades of HCI research and user experience design is that even with the most experienced designers and meticulous processes, the rich complexity of real people means we do not get things right first time. This chapter looks at the evaluation of potential designs and deployable systems, including the techniques that can be applied and the different purposes for which it is used. In some cases this is ‘summative’, an acceptance test before deploying or delivering a system to a client. More often the purpose is ‘formative’, finding potential problems in order to make things better. Testing with real users is usually the gold standard, but is not always possible or sensible. We will discuss options for testing with real users including more controlled experiments ‘in the lab’ and more realistic evaluation ‘in the wild’. We will also look at tools and techniques for expert evaluation, including heuristics and walkthrough methods and also semi-automated evaluation especially for accessibility. Evaluation does not stop when a system is deployed. Logs of real usage can be analysed to tune systems or prompt major cycles of re-design, and minor variants may be deployed simultaneously and compared using A/B testing.

Role of evaluation

End-user testing

Lab vs in the wild
Control in the wild
Ecological validity in the lab
Novel technology, lash-ups and Wizard of Oz
Online evaluation
Friends and fun

Measuring and recording — quantitative vs qualitative

The metric is the goal
Quantitative methods
Qualitative methods
Mixed methods — strength through diversity

Evaluation without users

Existing knowledge
Expert evaluation
Automated tools

Long-term evaluation

Post-deployment evaluation

Chapter Keypoints

Additional reading

Glossary items referenced in this chapter

actor-network theory, agile software development, AI-based systems, AI-based tools, alt text, attention, human, augmented reality, Awen Institute, coding in inductive methods, cognitive explanations, colour blindness tools, confabulation, constructive learning theory, control, experimental, convenience sample, cussed participants, dialectic (re)coding, direct observation, eating your own dogfood, end-to-end measures, episodic interactions, exploratory evaluation, formative evaluation, GPS, grounded theory, halo-effect, iterative development, lash-up technology, long-term benefits, long-term tests, mechanism, observable behaviour, online evaluation, online participant platforms, Perceptual Experience Laboratory, physiological measures, Post-it notes, post-task interview, post-task reflection, post-test questionnaire, prototype, questionnaires, reaction times, Sea Hero Quest, simulated user, summative evaluation, surveys, think aloud, triangulation, usability lab, user satisfaction, user testing, validation, video annotations, virtual reality, VR cave, W3C accessibility guidelines, within-subjects, Xerox Star

Contents

Glossary items referenced in this chapter