Archives for posts with tag: test design

In the fall of 2012, I seized the opportunity to do some research I’ve wanted to do for a long time. Millions of users would be available and motivated to take part. But I needed to figure out how to do a very large study in a short time. By large, I’m talking about reviewing hundreds of websites. How could we make that happen within a couple of months?

Do election officials and voters talk about elections the same way?

I had BIG questions. What were local governments offering on their websites, and how did they talk about it? And, what questions did voters have?  Finally, if voters went to local government websites, were they able to find out what they needed to know?

Brain trust

To get this going, I enlisted a couple of colleagues and advisors. Cyd Harrell is a genius when it comes to research method (among other things). Ethan Newby sees the world in probabilities and confidence intervals. Jared Spool came up with the cleverest twist, which actually prevented us from evaluating using techniques we were prone to use just out of habit. Great team, but I knew we weren’t enough to do everything that needed doing.

Two-phases of research: What first, then whether

We settled on splitting the research into 2 steps. First, we’d go look at a bunch of county election websites to see what was on them. We decided to do this by simply cataloging the words in links, headings, and graphics on a big pile of election sites. Next, we’d do some remote, moderated usability test sessions asking voters what questions they had and then observe as they looked for satisfactory answers on their local county websites.

Cataloging the sites would tell us what counties thought was important enough to put on the home pages of their election websites. It also would reveal the words used in the information architecture. Would the labels match voters’ mental models?

Conducting the usability test would tell us what voters cared about, giving us a simple mental model. Having voters try to find answers on websites close to them would tell us whether there was a gap between how election officials talk about elections and how voters think about elections. If there was a gap, we could get a rough measure of how wide the gap might be.

When we had the catalog and the usability test data, we could look at what was on the sites and where it appeared against how easily and successfully voters found answers. (At some point, I’ll write about the usability test because there were fun challenges in that phase, too. Here I want to focus on the cataloging.)

Scoping the sample

Though most of us only think of elections when it’s time to vote for president every four years, there are actually elections going on all the time. Right now, at this very moment, there’s an election going on somewhere in the US. And, contrary to what you might think, most elections are run at the county or town level.  There are a lot of counties, boroughs, and parishes in the US. And then there’s Wisconsin and New England where elections are almost exclusively run by towns. There are about 3,057 counties or equivalent. If you count all the towns and other jurisdictions that put on elections in the US and it’s territories and protectorates, there are over 8,000 voting jurisdictions. Most of them have websites.

We decided to focus on counties or equivalents, which brings us back to roughly 3,000 to choose from. The question then was how to narrow the sample to be big enough to give us reliable statistics, but small enough to gather the data within a reasonable time.

So, our UX stats guy, Ethan, gave us some guidance. 200 counties seemed like a reasonable number to start with. Cyd created selection criteria based on US Census data. In the first pass, we selected counties based on population size (highest and lowest), population density (highest and lowest), and diversity (majority white or majority non-white). We also looked across geographic regions. When we reviewed which counties showed up under what criteria, we saw that there were several duplicates. For example, Maricopa County, Arizona is highly populated, densely populated, and mostly racial minorities. When we removed the duplicates, we had 175 counties left.

The next step was to determine whether they all had websites. Here we had one of our first insights: Counties with populations somewhere between 7,000 and 10,000 are less likely to have websites about elections than counties that are larger. We eliminated counties that either didn’t have websites or had a one-pager with the clerk’s name and phone number. This brought our sample down to 147 websites to catalog. Insanely, 147 seemed so much more reasonable than 200.

One more constraint we faced was timing. Election websites change all the time, because, well, there are elections going on all the time. Because we wanted to do this before the 2012 Presidential election in November, we had to start cataloging sites in about August. But with just a few people on the team, how would we ever manage that and conduct usability test sessions?

Crowd-sourced research FTW

With 147 websites to catalog, if we could get helpers to do 5 websites each, we’d need about 30 co-researchers. Could we find people to give us a couple of hours in exchange for nothing but our undying gratitude?

I came to learn to appreciate social networks in a whole new way. I’ve always been a big believer in networking, even before the Web gave us all these new tools. The scary part was asking friends and strangers for this kind of favor.

Fortunately, I had 320 new friends from a Kickstarter campaign I had conducted earlier in the year to raise funds to publish a series of little books called Field Guides To Ensuring Voter Intent. Even though people had already backed the project financially, many of them told me that they wanted to do more, to be directly involved. Twitter and Facebook seemed like options for sources of co-researchers, too. I asked, and they came. All together, 17 people cataloged websites.

Now we had a new problem: We didn’t know the skills of our co-researchers, and we didn’t want to turn anyone away. That would just be ungrateful.

A good data collector, some pilot testing, and a little briefing

Being design researchers, we all wanted to evaluate the websites as we were reviewing and cataloging them. But how do you deal with all those subjective judgements? What heuristics could we apply? We didn’t have the data to base heuristics on. And though Cyd, Ethan, Jared, and I have been working on website usability since the dawn of time, these election websites are particular and not like e-commerce sites and not exactly like information-rich sites. Heuristic evaluation was out of the question. As Jared suggested — and here’s the twist — let the data speak for itself rather than evaluating the information architecture or the design. After we got over the idea of evaluating, the question was how to proceed. Without judgement, what did we have?

Simple data collection. It seemed clear that the way to do the cataloging was to put the words into a spreadsheet. The format of the spreadsheet would be important. Cyd set up a basic template that looks amazingly like a website layout. It had different regions that reflected different areas of a website: banner, left column, center area, right column, footer. She added color coding and instructions and examples.

I wrote up a separate sheet with step-by-step instructions and file naming conventions. It also listed the simple set of codes to mark the words collected. And then we tested the hell out of it. Cyd’s mom was one of our first co-researchers. She had excellent questions about what to do with what. We incorporated her feedback in the spreadsheet and the instructions, and tried the process and instruments out with a few other people. After 5 or 6 pilots, when we thought we’d smoothed out the kinks, we invited our co-researchers to briefing sessions through GoToMeeting, and gave assignments.

To our delight, the data that came back was really clean and consistent. And there were more than 8,000 data items to analyze.

Lessons learned: focus, prepare, pilot, trust

It’s so easy in user research to just say, Hey, we’ll put it in front of people and ask a couple of questions, and we’ll be good.  I’ve been a loud voice for a long time crying, Just do it! Just put your design in front of users and watch. This is good for some kinds of exploratory, formative research where you’re early in a design.

But there’s a place, too, for specific, tightly bounded, narrowed scope, and a thoroughly designed research study. We wanted to answer specific questions at scale. This takes a different kind of preparation from a formative study. Getting the data collection right was key to the success of the project.

To get the data collecting right, we had to take out as much judgement as possible for 2 reasons:

• we wanted the data to be consistently gathered

• we had people whose skills we didn’t know collecting the data

Though the findings from the study are fascinating (at least to me), what makes me proud of this project was how we invited other people in. It was not easy letting go. But I just couldn’t do it all. I couldn’t even have got it done with the help of Cyd and Ethan. Setting up training helped. Setting up office hours helped. Giving specific direction helped. And now 17 people own parts of this project, which means 17 people can tell at least a little part of the story of these websites. That’s what I want out of user research. I can’t wait to do something like this with a client team full of product managers, marketers, and developers.

If you’d like to see some stats on the 8,000+ data items we collected, check out the slide deck that Ethan Newby created that lays out when, where, and how often key words that might help voters answer their questions appeared on 147 county election websites in November 2012.

When I say “usability test,” you might think of something that looks like a psych experiment, without the electrodes (although I’m sure those are coming as teams think that measuring biometrics will help them understand users’ experiences). Anyway, you probably visualize a lab of some kind, with a user in one room and a researcher in another, watching either through a glass or a monitor.

It can be like that, but it doesn’t have to. In fact, I’d argue that for early designs it shouldn’t be like that at all. Instead, usability testing should be done wherever and whenever users normally do the tasks they’re trying to do with a design.

Usability testing: A great tool
It’s only one technique in the toolbox, but in doing usability testing, teams get crisp, detailed snapshots about user behavior and performance. As a bonus, gathering data from users through observing them do tasks can resolve conflict within a design team or assist in decision-making. The whole point is to inform the design decisions that teams are making already.

Lighten up the usability testing methodology
Most teams I know start out thinking that they’re going to have a hard time fitting usability testing into their development process. All they want is to try out early ideas, concepts and designs or prototypes with users. But reduced to its essence, usability testing is simple:

  • Develop a test plan and design
  • Find participants
  • Gather the data by conducting sessions
  • Debrief with the team

That test plan/design? It can be a series of lists or a table. It doesn’t have to be a long exposition. As long as the result is something that everyone on the team understands and can agree to, you have written enough. After that, improvising is encouraged.

The individual sessions should be short and focused on only one or two narrow issues to explore.

But why bother to do such a quick, informal test?
First, doing any sort of usability test is good for getting input from users. The act of doing it gets the team one step closer to supporting usable design. Next, usability testing can be a great vehicle for getting the whole team excited about gathering user data. There is nothing like seeing a user use your design without intervention.

Most of the value in doing testing – let’s say about 70% – comes from just watching someone use a design. Another valuable aspect is the team working together to prepare for a usability test. That is, thinking about what Big Question they want answered and how to answer it. When those two acts align, having the team discuss together what happened in the sessions just comes naturally.

When not to do testing in the wild: Hard problems or validation
This technique is great for proving concepts or exploring issues in formative designs. It is not the right tool if the team is facing subtle, nuanced, or difficult questions to answer. In those cases, it’s best to go with more rigor and a test design that puts controls on the many possible variables.

Why? Well, in a quick, ad hoc test in the wild, the sample of participants may be too small. If you have seized a particular opportunity (say, with a seatmate on an airplane or a bus, as I have been known to do – yeah, you really don’t want me to sit next to you on a cross-country flight), a sample of one may not be enough to instill confidence with the rest of the team.

It might also happen, because the team is still forming ideas, that the approach in conducting sessions is not consistent from session to session. When that goes on, it isn’t bad necessarily. It can just mean that it’s difficult to draw meaningful inferences about what the usability problems are and how to remedy them.

If the team is okay with all that and ready to say, “let’s just do it!” to usability testing in the wild, then you can just do more sessions.

So, there are tradeoffs
What might a team have to consider in doing quick, ad hoc tests in the wild rather than a larger, more formal usability test? If you’re in the right spot in a design, for me doing usability testing in the wild is a total win:

  • You have some data, rather than no data (because running a larger, formal test is daunting or anti-Agile).
  • The team gets a lot of energy out of seeing people use the design, rather than arguing among themselves in the bubble of the conference room.
  • Quick, ad hoc testing in the wild snugs nicely into nearly any development schedule; a team doesn’t have to carve out a lot of time and stop work to go do testing.
  • It can be very inexpensive (or even free) to go to where users are to do a few sessions, quickly.

Usability testing at its essence: something, someone, and somewhere
Just a design, a person who is like the user, and an appropriate place – these are all a team needs to gather data to inform their early designs. I’ve seen teams whip together a test plan and design in an hour and then send a couple of team members to go round up participants in a public place (cafes, trade shows, sporting events, lobbies, food courts). Two other team members conduct 15- to 20-minute sessions. After a few short sessions, the team debriefs about what they saw and heard, which makes it simple to agree on a design direction.

It’s about seizing opportunity
There’s huge value in observing users use a design that is early in its formation. Because it’s so cheap, and so quick, there’s little risk of making a mistake in making inferences from the observations because a team can compensate for any shortcomings of the informality of the format by doing more testing – either more sessions, or another round of testing as follow-up. See a space or time and use it. It only takes four simple steps.

Last winter I worked with a team that wanted to find out whether a prototype they had designed for a new intranet worked for users. Their new design was a radical change from the site that had been in place for five years and in use by 8,000 users. Going to this new design was a big risk. What if users didn’t like it? Worse, what if they couldn’t use it?

We went on tour. Not to show the prototype, but to test it. Leading up to this moment we had done heaps of user research: stakeholder interviews, field observations (ethnography, contextual inquiry – pick your favorite name), card sorting, taxonomy testing. We learned amazing things, and as our talented interaction designer started translating all that into wireframes, we got pressure to show them. We knew what we were doing. But we wanted to be sure. So we made the wireframes clickable and strung them together to make them feel like they were doing something. And then we asked (among other things):

  • How well does the design support the tasks of each user group?
  • How easily do users move through the site for typical tasks?
  • Where do they take wrong turns? What trigger words are missing? What trigger words are wrong?

Validating the research
In some ways, you could look at this as a validation test – not validating the design necessarily, but instead validating the user research we had done. Did we interpret our observations correctly by making the right inferences, in turn getting us to the design we got to?

What was possible: where the design might break
To find out, we had to answer those Big Questions. What were the issues within them that we wanted to investigate? Let’s take an example: How easily do users move through the site for typical tasks? We wanted to know whether users took the same path we wanted them to take, and if they didn’t, why not. On a task to find forms to open a brokerage account, we listed the possible issues. Users might

  • start at the wrong place in the site
  • get lost
  • pick the wrong form
  • not recognize they’ve reached the right place

From that discussion of the disasters that we could imagine came a list of behaviors to observe for, or as my friends at Tec-Ed say, issues to explore:

  • Where do participants start the task?
  • How easily do participants find the right form? How many wrong turns do they take on the way? Where in the navigation do they make wrong turns?
  • How easily and successfully do they recognize the form they need on the gallery page?
  • How well do participants understand where they are in the site?

What we saw
From these questions, we learned that we got the high-level information architecture right – most participants recognized where to enter the site to find the forms. We also learned that there were a couple of spots in the task path that had a combination of weak trigger words and other distractions that drew attention away from the things that would have gotten participants to the goal more quickly. But the groupings on the gallery page were pretty successful; most participants picked the right thing the first or second time. It was easy to see all of this in the way participants performed, but we also heard clues from them about what they were looking for and why.

And, by the way, the participants loved it. We knew because they said so.

In a clear and thoughtful article in the May 3, 2007 Journal of Usability Studies (JUS) put out by the Usability Professionals’ Association, Rich Macefield blasts the popular myths around the legendary Hawthorne effect. He goes on to explain very specifically how no interpretation of the Hawthorne effect applies to usability testing.

Popular myth – and Mayo’s (1933) original conclusion – says that human subjects in any kind of research will perform better just because they’re aware they’re being studied.

Several researchers have reviewed the original study that generated the finding, and they say that’s not what really happened. Parsons (1974) was the first to say that the improvement in performance of subjects in the original study was more likely due to feedback they got from the researchers about their performance and what they learned from getting that feedback.

Why it doesn’t apply to usability tests

Macefield convincingly demonstrates why the Hawthorne effect just doesn’t figure in to well designed and professionally executed usability tests:

  • The Hawthorne studies were longitudinal, most usability tests are not.
  • The subjects were experts, most participants are novices at something in a usability test because what they are using is new.
  • The metrics used in the Hawthorne studies were different from most usability tests.
  • The subjects in the Hawthornestudies had horrible, boring jobs, so they may have been motivated to perform better because of attention they got from researchers; it’s possible in usability tests that participants are experiencing unwanted interruptions by being included or that they’re just doing the test to get paid. 
  • The Hawthorne subjects may have thought that taking part in the study would improve their chances for raises or promotions; the days of usability test participants thinking that their participating in studies might help them get jobs are probably over.

What about feedback and learning effects?

We want feedback to be part of a good user interface, don’t we? Yes. And we want people to learn from using an interface, don’t we? Again, yes. But, as Macefield says, let’s make sure that all the feedback and learning from a usability test comes from the UI and not the researcher/moderator. Instead, get to the cause of problems from qualitative data such as the verbal protocol from participants’ thinking aloud to see how they’re thinking about the problem.

Look at effects across tasks or functions

Macefield suggests that if you’re getting grief, add a control group to compare against and then look at performance across tasks. For example, you might expect that the test group (using an “improved” UI) would be more efficient or effective in all elements of a test than a control group. But it’s possible that the test group did better on one task but both groups had a similar level of problems on a different task. If this happens, it is unlikely that the moderator has given feedback or prompted learning to create the effect of improved performance because the effect should be global across tasks across groups.

Macefield closes the article with a couple of pages that could be a lesson out of Defense Against the Dark Arts, setting out very specific ways to argue against any assertion that your findings might be “contaminated.” But don’t just zoom to the end of the piece. The value of the article is in knowing the whole story.

I get a lot of clients who are in a hurry. They get to a point in their product cycle that they’re supposed to have done some usability activity to exit the development phase they are in and now find they have to scramble to pull it together. How long can it take to arrange and execute a discount usability test, anyway?

Well, to do a usability test right, it does take a few steps. How much time those steps take depends on your situation. Every step in the process is useful.

The steps of a usability test
Jeff Rubin and I think there are these steps to the process for conducting a usability test:

  1. Develop a test plan
  2. Set up the testing environment and plan logistics
  3. Find and select participants
  4. Prepare test materials
  5. Conduct the sessions
  6. Debrief participants and observers
  7. Analyze data and observations
  8. Create findings and recommendations

Notice that “develop a test plan” and “prepare test materials” are different steps.

It might seem like a shortcut to go directly to scripting the test session without designing the test. But the test plan is a necessary step.
Test plan or test design?
There’s a planning aspect to this deliverable. Why are you testing? Where will you test? What are the basic characteristics of the participants? What’s the timing for the test? For the tasks? What other logistics are involved in making this particular test happen? Do you need bogus data to play with, userids, or other props?

To some of us, a test design would be about experiment design. Will you test a hypothesis or is this an exploratory test? What are your research questions? What task scenarios will get you to the answers? Will you compare anything? If so, is it between subjects or within subjects? Will the moderator sit in the testing room or not? What data will you collect and what are you measuring?

It all goes together.

 

Why not just script the session without writing a plan?
Having a plan that you’ve thought through is always useful. You can use the test plan to get buy-in from stakeholders, too. As a representation of what the study will be, it’s understanding the blueprints and renderings before you give the building contractor approval to start building.

With a test plan, you also have a tool for documenting requirements (a frozen test environment, anyone?) for the test and a set of unambiguous details that define the scope of the test. Here, in a test plan, you define the approach to the research questions. In a session script, you operationalize the research questions. Writing a test plan helps you know what you’re going to collect data about and what you’re going to report on, as well as what the general content of the report will be.
Writing a test plan (or design, or whatever you want to call it) will give you a framework for the test in which a session script will fit. All the other deliverables of a usability test stem from the test plan. If you don’t have a plan, you risk using inappropriate participants and getting unreliable data.