Presented by Michelle White, QualityMetric, now part of Optum
Learn more about the crucial decisions in the design and analysis of mode equivalence studies and how these decisions can affect study conclusions. Decisions include study design (basic design, target population, recruitment, choice of modes, crossover period, etc.); analytic approaches, and principles for interpretation of results.
So when I was thinking about this and talking with Paul about what to present, I thought, oh my gosh, there’s people in this room that I know have much more experience than I do in conducting mode equivalence studies. But I also thought there might be some people in the room that were less experienced, or had conducted a few studies and, like me, had tried to implement some of the suggestions from ISPOR guidelines and other places and maybe found that, no matter how much care you made in making decisions along the way, that ultimately at the end there’s just this one question that people have, which is, well is it equivalent or is it not. And you think back to the different decisions you made and what might have made a difference in one thing or another. So, quite selfishly, I wanted to actually learn from people in the roomroom and came up with this topic.
Our goal, as it’s more of a workshop format here, is what are the crucial decisions made when conducting mode equivalence studies in terms of research design, analytic approaches, and interpretation. And how do the decisions made at each of these phases affect your study conclusions? The spirit of the conversation today sort of is in curiosity, it’s not as much instructional as it is a support group for me, kind of. And I also wanted to say that I had to sort of narrow the focus down to what we could manage in an hour and still do in a workshop format. So I’m going to go with something I have more experience in, which is more generic multi-dimensional PROs and paper versus electronic discussion. I also want to do that because I feel like it, with the multi-dimensional and generic, provides more questions and odd ways to go with it compared to more straightforward PROs. I’m not going to be focusing much on multiple modes or BYOD because those are being covered in other sessions.
So what is discretion? It’s actually the right to choose what should be done in a particular situation. And when I thought about that, I’m thinking, well who has discretion in a mode equivalence study. It’s really more than just you the researcher. If you think of yourself, if I think of myself, training for a race or my kids studying for a test, and you as a researcher have a lot of decisions and things that you come up with in advance in a research design and your statistical analysis plan and everything else that seems very straightforward. But other people also have discretion over your study, whether you like it or not. And I like to think of this as when you go speeding past a police officer and you’re just hoping that the officer will use that discretion and not pull you over and wait for the next fastest person, or maybe he stops and gives you a warning or the judge shows discretion. And in mode equivalence studies, that can be in a cognitive debriefing, an interviewer being faced with a question that you didn’t think of in advance, and they have to make on-the-spot decisions on how to handle that. So there’s lots of different situations we’ll talk about today that fall into that other category. Why does it matter? I think it matters because we’re all really interested in scientific integrity and ensuring that the answer we come to is true. Also, if you come to the wrong answer you could compromise drug development efforts and ultimately patients’ lives, and you have a big loss of time and money.
So I’m going to break this into two different sections. First we’re going to talk about research design and some of the discretion involved with that. And then I’m going to talk about analysis and interpretation, just to make it a little more manageable. And this is actually taken from Sonya’s mixed mode paper that she just discussed. But I like it for this because we’re talking about, how do you know whether you need to do a study or not, right. And she starts with the question, will PROs be used for regulatory submission or labelling claim? If no, you still really should do what you would do for a labeling claim anyway. But ultimately they may not. If yes, I like that it goes to, is there published evidence already of equivalence. And unfortunately a lot of things don’t get published. But I think sometimes people don’t start with what already has been done before they get into the planning mode. And that’s important.
So you have a decision first to say, will you need to do a study at all. And you should be doing some background research into that. So if we’re saying, if it’s going to be used for a label claim, maybe that’s more important to have it structured, what does the FDA say? Well, the PRO guidance from 2009 says that the FDA will review the original paper version and screenshots of ePRO. And in the section that talks about any modification, it said that sponsors should provide evidence to confirm a new instrument’s adequacy, and one of the things that it talks about there is changing an instrument from paper to electronic format. But it does specify that not every small change in application or format necessitates extensive studies to document the final version’s measurement properties. And then, as Sonia discussed, it also has a comment about multiple data collection modes within a single study, that I’m not going to cover in my presentation. But you know, this is fairly vague here. Not every small change necessitates a full study. Okay, well what does? Well, when I look at what’s been done with the SF-36, which is a generic health-related quality of life instrument that I think most of you are familiar with in the room, I’ve counted so far at least 23 separate published studies of mode equivalence, using several different types of ePRO formats and different modes that have all shown equivalence. Also, we have three meta-analyses now of paper and ePRO mode equivalence that have been published that show that if you have a faithful migration with minor changes, you kind of get equivalence regardless of your PRO used. And in a great session at ISPOR that Paul did with Jason Lundy and Dr. Coons, they had a really great debate about how much really is necessary or should you do a study at all. And they did bring up a few extra points about, what about publication bias, there may be studies that aren’t getting out there. And we know that all migrations are not faithful and well implemented. And BYOD presents some additional challenges. So when I have a pharmaceutical company call me and talk about what’s the evidence for the SF-36 in a particular mode, and I can say well there’s 23 different—I can say all of this about all these studies, and the question that I get is just, where is the stamp of approval. Where is your, as the developer, stamp of approval. They don’t care what else has already been done. So to me, the first question in discretion really is making that decision about should a study be done or not and, you know, do you have enough arguments to say that maybe it shouldn’t.
Then if you are going to decide to have a study, you have to think about what type of study will you do. And of course we here have the great recommendations from the ePRO task force, from Dr. Coons and colleagues, that sort of has a table that goes through what type of studies should be done. And we’ve got minor changes—which is sort of the thing that we talked about right before lunch in terms of circling a change to checking a response on a screen or pushing a button—to moving to one item per screen rather than multiple items on a page, would be minor and include cognitive testing and usability testing. Moderate would have some changes in item wording that might alter interpretability and that would require equivalence testing and usability testing. And then substantial changes, I’m not really going to deal with today, because I feel like that’s a full modification and goes beyond the scope. And since usability testing is required for both minor and moderate, I’m not going to talk about that either, because I think it just has to be done either way.
So that seems simple enough, right. So for me, working with the SF-36, but with others, different PROs, most of the migrations that are done today have minor changes and they’re appropriately classified that way. They’ve learned a lot from past lessons and they’re doing a more faithful migration. And you say, well that’s pretty simple. So with the SF-36, the developers took lessons from past-published research and formal discussions with ePRO companies and scientific experts when migrating the paper form and creating a table version and a handheld version—unfortunately not one version because the tablet version is so old that it was created before people envisioned handheld-type things. And there really isn’t that much change on it. There’s a slight modification: instead of saying mark the one box on the paper version, it says select the one response. So that’s fairly flexible to use across different electronic formats, doesn’t matter what you’re marking or checking, you’re just selecting the one response. It also changes from a grid format to a single item per screen, and it displays the response choices horizontally instead of vertically. And I think the table that we just looked at, you could pretty easily say that this change makes it minor, and this change is probably still minor, it’s formatting.
But I want to talk a little bit about the grid format to single item. Because in the table we just talked about, changing from multiple items on a page to one should be minor. And you all have one page on—there’s a handout on your table that you can grab that has this, you can see it more clearly. These are examples from CRF Health’s handheld device. And you could see, the example on the far right. “During the past four weeks, how much of the time have you accomplished less than you would like as a result of your physical health?” And there’s some response choices there. But then when you compare it to the paper version, you can see, we talked about with the directionality of the response choices, but then you also have a little bit of unpacking of the words, you’re not taking the words exactly as they are, because it’s not just a list of multiple items in a row, they’re in this grid. So here you have a stem on the paper version: “During the past four weeks, how much of the time have you had any of the following problems with your work or other daily activities as a result of your physical health?— Accomplished less than you would like.” And when you move it over to ePRO, we actually sort of unpack it to make it a single question, a little less complicated. So here’s a moment of discretion here, if you’re going to do a study. Is this a minor change or is it a moderate change. Are the changes in the wording different enough to make this moderate. I’m not going to answer that yet, because you guys are.
So when you’re thinking about it, let’s think about what you would have to do, based on the decision you’re going to make. One thing that you would do if you decided it was minor would be to do cognitive debriefing tests. And that type of study is used to explore the ways in which members of a target population understand, mentally process, and respond to the items on a questionnaire. Some of the things that you have to think about or have discretion in when you’re designing a study for cognitive debriefing is, what kind of population are you going to have. Well, if you were working with a sleep tool, you know that you would choose people who have some sort of sleep disorder. Maybe different types of sleep disorders. But one of the reasons I wanted to talk about what you do with a generic instrument is that it’s not quite so clear cut, right, because the SF-36 is used across hundreds of different disease populations, and with general population, and with those covered by different types of insurances in different locations, etc. But if you tested all of them, your end size would be huge, tremendous. So you really have to think about what is appropriate to test to answer your question about equivalence. You have to think about how much of the interview are you going to focus on the think-aloud process and having the spontaneous responses come to you and how much verbal probing will you then do.
We’re going to assume for the case today that we’re doing individual interviews, which is usually the case for cognitive debriefing, and that it’s in person, although I know some people have done webcam. And we’re going to assume that you’re going to try to recruit people with equal gender and other demographic characteristics. But probably need to think abut how many interviews you’re going to do. You have to think about what modes you’re going to do. And for this I don’t mean what modes you’re going to do in your forthcoming clinical trial that you may be doing this in preparation for. What I mean is, when you sit down in the cognitive debriefing, are you going to just present the paper, just present the ePRO, present both, present both and vary the order, present one for some people, one for others. That sort of thing. And how are you going to design an interview guide to be able to distinguish between understanding problems with the survey generally and mode effects. And I know Sonya touched on this a little bit before, and others have too, but it’s a really important issue, because you may have things reported to you on a survey that has been around for 25 years that, is it preference or is it something else. But it is really kind of unrelated to mode. So that’s one type of study that you might do.
If it’s moderate, if you’re saying the SF-36 warrants having some sort of study due to moderate changes, you would need to do equivalence testing, which is designed to evaluate the comparability between PRO scores from electronic mode and paper and pencil. And again, you have the same types of questions about population, recruitment, that sort of thing. You additionally have a question of whether or not you’re going to do a randomized parallel groups design or a randomized crossover design, and this is on the back of your handout. They’re not very pretty but I did the best I could.
So you see in a parallel groups design, you have a randomization, but then each group only takes one survey in one mode. So you're limited in the types of analyses that you can do and the interpretations you can make. The randomized crossover design, at least from what we see in meta-analyses, is the more popular route in which case you’re going to randomize the participant. In the first assessment, one group would do the paper first and the other would do electronic. In this example I have handheld. And then in the second assessment, they would switch—those that did the handheld first would do paper, and vice versa. And in between you have distraction activities. Those distraction activities may be for five minutes, it may be for an hour, it may be for two hours, it may be for two weeks. And there’s implications for each of those decisions that you would make. For example, there’s memory effects. If you do it right away again, five or ten minutes later, you may say the person probably remembers everything they just said. Possible, but it’s also remarkable how many people answer differently. If you wait a whole week or two, you could have health changes. If you have changes in your health status you’re not measuring the same thing, so you then have to figure out how you’re going to determine the level of change that actually occurred. If you are going to do it in the same day you have to think about what type of distraction activity are you going to do—are you going to do something more passive like have them watch a movie or read a book or go on a walk, or are you going to have them do something that makes them use their brain like a puzzle or even answering other survey questions, maybe about their preferences about electronic and paper, but that could bias your results for the next part of the study. So you really have to think about what are you going to do in that time. I should also say that there’s pretty big cost implications if you do have them not do it in the same day. And you have lost a follow-up that you don’t have if you do it in the same day. So those are some other things that you have to think about when you’re making these decisions.
So in addition to your design, you have to think again about modes to be tested. For the paper group, are you going to do double data entry, just a partial checking, or are you going to have some sort of scanning or faxing of forms to ensure that you don’t compromise the data quality. And I already talked about passive or active. And one other thing that I didn’t put on this figure is that, with the randomized crossover, ideally you would also have a group that starts with paper and ends with paper and one that starts with electronic and ends with electronic, because then you wouldn't be able to compare how much variability is natural even if you don’t change mode. And this is fun for some people and not for others.
So for cognitive debriefing, again, we talked a little bit before about how you determine what is a mode effect and what are issues across modes, like some of you noticed the paper version was a little blurry. Should you set a threshold in advance to detect the mode effect in a qualitative study. This is a contentious point among qualitative researchers, some who say you never should, and others who recognize the distinction between concept elicitation studies and cognitive debriefing and think that maybe you should have an idea of how many participants have to report the same issue in order to consider it something worthy of change. If so, what’s an appropriate threshold. And in this case, for threshold I mean the percent of participants that report or show a problem understanding or completing any portion or item in the survey, could be a response choice, could be anything, not just the item.
You need to determine what would be considered a problem versus a minor issue, because when I’ve done several cognitive debriefing studies, and I mean a lot of people report different things that really aren’t issues. But until you’ve seen quite a bit of them, you have to think about what’s the difference here. If there is an issue that truly is a problem and truly is due to mode, could you change it to make it better, or should you. We’ve all sort of thought about what if the electronic version makes it better and so you’re still going to maybe have a difference and maybe you shouldn’t change it, and would you go back and do anything to change the paper after twenty-some years of using it the same way, which could affect external validity of comparisons across other literature. So there’s a lot of things to think about, especially when you’re using an established, not a newly developed, PRO.
Then I want you guys to have, sort of in your minds—and this time we’re not going to necessarily do group work but I have a couple scenarios that I want people to think about and shout out.
So the SF-36 has 36 items, which is in eight domains, two summary scores, and a rich history of use. You have conducted a cognitive debriefing study with 15 participants using a handheld device, and three of the participants, or 20%, read one item two or three times before answering, but when you probe them on it later they say that they understood it, they just needed to read it a couple times. Is this an issue to address, and if so, how?
AUDIENCE MEMBER: Smarter patients.
[laughter] Smarter patients, okay. Okay, Sonya?
AUDIENCE MEMBER: I think you could look into was it the font size, was it too hard to read. Is that why they had to read it two or three times, and if that was, if there’s room on the screen to increase the font size, you might consider increasing the font size.
Okay. Next one. You’re conducting a study, and one of the inclusion criteria is to have either Type 2 diabetes or a major depressive disorder. Despite that the SF-36 instructions state that it’s about your health in general, participants remember these screening items and they believe they’re in a study about depression or a study about diabetes. So when they’re sitting there with you, they may look up and say, is this supposed to be just about my diabetes or is it about everything about me. How do you address that in qualitative research without biasing the results?
Ooh, I stumped the room. Yes, Greg?
AUDIENCE MEMBER: Well I mean, this seems kind of obvious. You tell them. You just tell them, what is it, you answer the question. Because you know, if somebody gives you a form, is it about my whole body or is it about a piece of my body, tell them what context I’m filling the form out. So if you don’t tell them it’s your fault. I don’t see why this is a big issue, you just—yeah, tell them.
And hopefully you have that in your instructions at the beginning, but I think you also could ask them first: why don’t you read the instructions and tell me what you think it means, and go from there. And let them fill out the whole survey and then maybe probe on it later and bring it up later.
Last one. On an SF-36 item with three response choices, one of 15 participants wants a fourth response choice. I heard someone say earlier, if even one person says it, we should take that seriously and consider it. So now I want you to think about this and say, okay would you really consider the one person or would you say meh, it’s just a preference for one person.
AUDIENCE MEMBER: Yeah, I think we have to consider that this may be an issue to address. Because in his perspective, he’s looking at things differently than the other 14. That we need to know.
Okay. Does anyone want to argue that? Jason.
AUDIENCE MEMBER: I think one of 15 is not all that indicative. But I think the first thing we need to do is say, would you have trouble answering this question, if there were not an additional response choice.
That’s right. That’s why the probing is so important. As you watch them do the think-aloud process and they go through the whole survey and do all of that, and if you have a very experienced interviewer that’s working with them, they’re going to go back to each of these questions and say, okay what is it about this, they say I want another one. Okay, if you didn’t have another one, would you understand it and know how to answer it. Would your answer have changed if you had another response choice. And oftentimes they’re going to say actually no it wouldn’t, I just think some people might have used a different one, or something like that. So very good point.
Okay. We could go on with cognitive debriefing scenarios forever, but I’m going to move over to a crossover study. And I’m actually going to skip parallel groups. When I thought about again how to get this to an hour, we’re just going to go with crossover studies. And I’m only going to focus on two parts of this, which is mean score comparison and ICCs. I realize there’s DIF. I’m going to talk a little bit about Kappa but we’re not going to go really in depth in every possible way that you could approach analysis.
So when we talk about mean score comparison—and I’m taking this more broadly from the ISPOR ePRO group recommendations—you’re comparing scores from two modes from the same person or comparing scores from two groups. The difference should not exceed the instrument’s set minimally important difference, what we call MID. And there’s also many articles about what that could be called and should be called and definitions, but today we’ll go with MID. Or an estimate of what that should be if there’s not one set by the developer already. In the case of the SF-36 there are MIDs set. The interpretation of mean score differences using Cohen’s effect size usually. And the mean difference between modes should be interpreted relative to an estimate of mean difference within mode in repeated administrations, and you can also analyze whether the mode difference is robust, and whether it’s independent of order administration.
So what I decided to do is to go through some of those 23 studies and come up with just a few examples here. So here’s a mean comparison scenario from one published article. This is back from 2002, which is before we actually had electronic versions, so I can’t vouch for exactly how they migrated it. It’s from 79 healthy and 36 chronic pain patients who completed paper and electronic SF-36, the first version. And for simplicity I only showed five scale scores, mean differences between the electronic and paper version.
So if you look here, I’ve got physical functioning, role functioning, bodily pain, social functioning, and role emotional. And we’re showing electronic minus paper and P values. Who wants to interpret this for me?
You can take a couple minutes. I know you have to look at it.
AUDIENCE MEMBER: What specifically are you seeking?
I am seeking in this case what you would say about these results, whether or not you would—what you would say. Based on if you just had this, would you say there’s equivalence or there’s not. And I realize there’s not an MID here, but in this paper they didn’t actually do that. So that’s part of my purposeful setup of this.
AUDIENCE MEMBER: So I would say that on the social function scale, there’s a statistically significant difference between the electronic and the paper.
Okay. All right. Thanks for pointing that out. There is, that’s true. They did in this case do a correction for multiplicity that would mean that this actually isn’t done, but it’s very hard to see that at the bottom there. In this case, in this article, they determined that there was equivalence between modes, although they also noted that the paper version had 44 percent missing data and electronic didn’t, so they actually felt that the electronic was better. But this one was interesting because it said, every difference, there were some that were statistically significant, but they didn’t reach the MID. But then when I looked at what they used to see if it matched what’s in the manual, they didn’t, they cited four papers, one of which was actually for the RAND-36, so none of which was to the manual. So I’m not sure that I would take this as something that I would incorporate, say, in a meta-analysis or something, because I was a bit concerned about how it was done.
I’m going to do one more example here. This one is actually fake data that I made up. And it’s showing means for groups of electronic minus paper for five again, and I just do that because 8 + 2 would be hard to view here. They did a paired t-test. How would you interpret the results. So we’ve got mean, standard error, significance. And here I’ve listed the MID in the manual for each of the scales. I’ve also got your confidence intervals here. And I just want you to think about it. You don’t have to report back. I’m going to give you a minute or so to look at it.
Okay, so if you’re thinking about what you’re going to do here, you’re going to take the mean, and then you’re going look at your confidence intervals. You’ll also note the P value but you’re going to compare it to the MID, or the difference from 0. And this is what you would look at. This is an example from Walker & Nowacki from 2010. But if there is no difference, you would be at 0, right, they would be at the same. And then you would want to look at, here is your mean, there’s confidence intervals around it. And this interpretation says that if you’re within—and in this case, these are just random bars, this isn’t actual SF-36 MIDs here—but if you were within this MID, either above and below, you have established equivalence. If your mean is within it but the confidence intervals go above and below, you have not established equivalence, but you also have not proven that it’s not equivalent. If it’s completely outside the MID, you would say you did establish that it’s not equivalent. So what’s interesting to me here is that this approach isn’t used a lot. But statistically speaking, and if you do say that everything should be taken in context of what’s an important difference, not just statistically significant, which I think is important, we should be looking at this, right, and thinking about it.
AUDIENCE QUESTION: Just out of curiosity. If it extends above the top MID but it also extends below the bottom MID, would it be considered in or out?
It would still be out. It would be out but it would be that you didn’t establish that it’s not equivalent either. It’s unclear. If it was complete under, like this one is only under, you would have established non-equivalence also. Just in the other direction. Within the minimal important difference levels set by the developer or sometimes standard deviation is used or half-standard deviation, it depends.
So something to think about, right. Which is, exactly what are you going to put in your statistical analysis plan if you’re going to do an equivalence test study. And that’s going to affect your end size requirements too. So another approach would be a Kappa coefficient. I’m not going to have an example of this, but this goes based on the theory that simple agreement may be high due to just chance, there just might be chance that that could happen. The Kappa coefficient corrects for this by examining the proportion of responses in agreement in relation to the proportion of responses that would be expected by chance alone. And a weighted Kappa actually can look at partial agreement, not just exact agreement, which is important. And I’ve got some thresholds here. But usually we don’t see this approach used very much. I have seen it a couple times. What’s more often used is the ICC, which can assess both co-variance and degree of agreement between score distributions and assess reliability of scores given on multiple occasions or across multiple raters. It takes into account both relative position in groups of scores and the amount of deviation above or below the group mean, and typically a score of about 0.7 is considered acceptable. This is probably the most commonly reported statistic in mode equivalence study publications, but should be used also along with the mean difference comparison, not in and of itself.
Here’s another example that I have. It’s from an older group of men with prostate cancer. They did paper and web mode. The first version of the SF-36. All completed on their own, so they didn’t have any technical assistance. End sizes ranged from 205 or 208 by scale because there was some missing data for paper. And I’ve got five scales below here and they’re ICCs. There’s a range because they did web minus paper and paper minus web. And it doesn’t say here which because I was trying to make it simple. So based on what I just said here, on the threshold, out of these scales that we have—physical functioning, role physical, role emotional, vitality, and mental health—how would you interpret this?
AUDIENCE MEMBER: It varies by scale, so I mean if you want to talk about scale as a whole, I don’t think that’s appropriate. You’ve got the PF meets RP, suspend RE, suspend—though it’s kind of ugly—VT meets and MH meets.
So I think you hit on a point that I’m going for in this, which is, if you were doing a study on a PRO that has one construct, one scale, you could look at any one these, right, and you could say, oh I don’t know about role emotional here, because in one of the groups, it was below 0.7. But first of all, you haven’t taken it in context with any other statistic like the mean score comparison. And secondly, let’s assume that all the rest of the scales did meet, what would you say and conclude about the instrument as a whole, even if you saw this? Would you say yes that’s equivalent. One out of ten different scores—eight scores and two summary measures—and this does have at least one that’s above the threshold, one direction. Or would you say no, one’s not good they’re all not good.
See, this is why you’re the support group. Come on guys. These are the types of things that I’m looking at, as I’m looking across these different studies and I’m thinking, how did every single one of these, by the way, publications came up and decided that the survey was equivalent. So I’m kind of smashing the end by telling you what they said. But this is where I get to the discretion of, well I don’t know, would everyone say that.
AUDIENCE MEMBER: Michelle, correct me if I’m wrong, but role emotional only has two items as well. So the variability is going to affect the calculation of the ICC, the lower variability will necessarily lead to lower scores and greater standard error amongst those scores. So comparing the raw ICCs of role emotional to all of the others may be inappropriate anyhow.
Right. Exactly. And that’s another point that I’m trying to make, which is whatever you’re going to do in analysis interpretation, it must be relative to your design, right. Are you doing the appropriate thing for what you intended to do.
So I’m at the end of my time. I’m actually about done, which was—I was just going to have you guys take a few minutes to say whether or not you would change any of your original decisions after we kind of went through some of these. But it might be better to do questions, whichever you prefer, for a few minutes.
[END AT 39:21]