Welcome back everybody. Hopefully the first session was informative and has given you some ideas or thoughts that you might not have heard or experienced beforehand. During the break, I was just chatting to somebody that patient compliance can be affected by the most simple of everyday objects. Spectacles. I actually can’t read this piece of paper in front of me without my glasses on. And we were talking about something as simple as if a pair of glasses affect patient compliance, do we really understand the patient burden that we’re expecting them to go under when we’re asking them to enter all of the detail into their diaries. Just pause for thought really.
So let’s move on to the agenda, I do comply, wearing my glasses. I’d like to welcome up Michelle White. Michelle is the Senior Scientist and Senior Director for Consulting Operations at QualityMetric. And we’ve asked her to present on the equivalence across modes. So I think this is the second or third time that Michelle has presented with us, so we’re very excited to present her up to the stage. Michelle has got a huge amount of experience in health related quality of life patient-reported outcomes, both from a qualitative and a quantitative perspective from the studies that she’s been involved with, with a background in treatment in addiction, and also as a representative of the Illinois State University and University of Illinois. Michelle.
Good morning. I get to go after the two Pauls, yaaay! Thanks everyone for being here.
Today I’m going to talk about researcher discretion in determining equivalence across modes in clinical outcomes assessments, and I’m going to focus specifically on patient-reported outcomes. This is more of a workshop format. The goals today are to really critically evaluate crucial decisions that are made when you conduct mode equivalence studies in terms of research design, analytic approaches, and interpretation, and how these decisions may affect your study findings. The spirit of our conversation today is really more open and sharing of information instead of waiting until the end, there’s going to be many times I’ll be asking questions. I’ve been working and completed recently a mode equivalence meta-analysis on SF-36 studies that have been published, and this is almost like a support group for me, for all the things that I had to go through, looking through the publications and wondering about things. So I’m going to really value everyone’s input.
In order to keep this to an hour, some of the things I had to do is focus it down, and I am going to focus more on generic multi-dimensional PROs, although most of what we talk about is going to be applicable to disease-specific or single-scale PROs. And I’m going to just look at single comparisons of paper to electronic, I’m not going to include any interactive voice response, anything like that. I’m also not going to talk much about multiple modes or bring your own device, because I really need to keep it simple, and I think you’re going to find that it’s pretty complicated even when I think I’m simplifying it a lot. I’m also going to talk about a PRO that’s already been migrated to an electronic format. There are a lot of other things that are going to come up if you are working with a paper PRO that hasn’t yet been migrated.
So the first thing I thought about is, what really is discretion, what are we talking about that the researcher can do. And Webster’s said it’s “the right to choose what should be done in a particular situation.” Who has discretion? In any study, you the researcher is not the only person making decisions. For an example of yourself having discretion in something, it may be something as simple as studying for a test and you really have most of the control over how much time you spend in that in daily life. Something else that we encounter in the US a lot, where other people have discretion is if you go speeding down the highway past a police officer, that officer has some discretion on whether he’s going to choose to stop you or someone else. So in the case of mode equivalence studies, we often find that the other is time and money. There are ways in which researchers want to do a study, but there are constraints, and particularly in terms of rare diseases and, really, any drug development, there are costly implications and cost to patients if you are holding up clinical studies while you do this type of research. So it’s important.
And that brought me to the question: Why does it matter how we use discretion here? And I think it matters to all of us because we do care about scientific integrity. While we have these constraints of time and money that could be lost while we do it, we want to make sure we do it right and that the findings truly represent what’s going on. And if you don’t do it right, you’re compromising drug development efforts and ultimately patients’ lives.
So I’m breaking this down into two parts: research design and then analysis and interpretation.
The first critical decision point that you would have to make about a mode equivalence study is, do you do a study or not. And here I’m using work done by Sonya Eremenco and colleagues. This is specific to mixed modes, but I think the flow chart really helps and is applicable for all studies. It starts: Will the PRO be used for regulatory submission or not? And as we kind of talk about before, really, what we should care about, is it a gold standard, is it something that needs to be done. So let’s today assume that it will be perhaps used, the data that’s collected in your study will possibly be used for regulatory submission. What I like about this chart is it then goes to: Are there published studies out there. Is there already evidence of mode equivalence for this PRO. So many times, people jump right to let’s go do this study, and they haven’t taken the time to see what’s there.
So since we’re thinking about regulatory, what do the FDA and EMA say about the necessity for mode equivalence studies. And I know we’ve had many bright and engaging discussions about the necessity of these studies. In 2009, the FDA put out the PRO guidance, and it has something in there. It says, when a PRO instrument is modified, including changing from a paper to electronic version, you need to provide evidence to confirm the new instrument’s adequacy. But it does say that’s not to say that every small change in application or format necessitates extensive studies to document the final measurement’s properties. It does also say that it wants to review in your submission the paper version and screenshots of your electronic version, so they are going to see how it was migrated. But other than that, it’s really kind of vague about what it is that you have to do.
The EMA has not done a specific PRO guidance that includes mode equivalence, but it did put out a reflection paper in 2010 on expectations for electronic source data and data transcribed to electronic tools in clinical trials. And it’s a pretty long document and the only thing I could find in it sort of about it is that the use of PROs should ensure data are at least as accurate as those recorded by paper means. Nothing about how that’s done, when you have to do it, how much evidence says it’s okay for any type or do you have to do it for any time you use it. So let’s just stamp that as, okay let’s not worry about what they say and think about what we should do with our discretion.
What can we learn from the past? Well, there’s been three published meta-analyses of comparisons of paper-to-ePRO mode equivalence studies that have shown a faithful migration that only has minor changes generally equals equivalence regardless of what kind of patient-reported outcome used. And I want to be really clear here that in these three meta-analyses—the first one was a Gwaltney back in I think 2004 or something and I think it moved on from there with a couple of others—there were 480 different comparisons. Now these are of scales, not necessarily of PROs, so the number is slightly smaller and some of them may have been used in more than one of the three studies. But I think you can say when you’re looking at hundreds of comparisons that they all came up with this, regardless of PRO. It’s pretty strong evidence to say maybe we just don’t need to do these studies anymore. Maybe 20 years ago it was important because the technology wasn’t as well along and we didn’t really know what the best practices were, but maybe it’s not necessary.
The mode equivalence study that I have just finished conducting and am publishing looks at SF-36 alone because my thought was, well there’s a lot of differences between all these PROs, maybe they’re finding certain ones do have problems and it’s worked out in the wash. And the EQ-5D and the SF-36 happen to be two instruments that are used so broadly that there were actually enough studies published on mode equivalence in those studies to look at it, and there was 25 in fact that had adequate data to look at.
And generally, our findings—I’m giving away the answer here—have concluded equivalence. So do we need to study this at all. Well there are some reasons to conduct a study. One is, you know, people call me sometimes and they say, we want to know, is there equivalence when we use this version or is it not. What if all the people who publish these studies only publish the good ones. What if there’s lots of studies saying it’s not equivalent. So as instrument developers, there is some onus or responsibility for determining whether or not there is equivalence. We also don’t know if all migrations have been faithfully done and well implemented. Many of the studies that I looked at were completed before we even had a single-item version, were completed before handhelds were what they are now, they were PDAs and kind of clunky devices, so you could have different findings that way. And in fact, only three of the 25 published studies actually showed what any of these migrations looked like. I happen to have access to six of them from talking with different authors, but that’s still a small small number of studies, compared to 25 being published. We have no idea exactly what they did. Are all these mode equivalence studies well conducted? I could tell you that there were lots of studies that we didn’t include because they didn’t publish the necessary statistics to determine the finding that they published. It’s sad to say. So then again, what is the ultimate answer regarding equivalence? It’s not so simple.
So I’ve kind of talked about this decision that you have to make. Didn’t really come up with an answer. So what we’re going to look at today is the SF-36 actually, and if you take the handout on your table, you may have to flip it over. I think you might have a flow chart, depending on— You should have—It’s going to be on the screen too, but I just find it might be easier for you to look at, and when we break into small groups you’re going to need it.
This is a CRF Health implementation on a handheld device of a couple of the items from the SF-36. So I want you to look at how this is on the righthand side, these are what the screens look like, you have your response choices that are vertical, and it’s, you know, “During the past four weeks, how much of the time…” bahbahbah. Okay. And you have a “next” on the screen. Now, if you look, on the left side of your handout, at the paper version, this is somewhat similar to your screen in what you showed, again we’re looking at how it looks on a paper version with a grid presentation. So obviously this has more than two questions on it, but you can see that here, you have the “During the past four weeks, how much of the time have you had any of the following problems.” And then it goes down to just the item underneath the stem at the top, whereas on the handheld version, you have to reword it to fit some of the items stem with the item. Are there cognitive processes that are different when you read it that way compared to when you read it in a grid format? Let’s look at that.
So let me talk a little bit about how the migration was done from paper to the two single item versions that exist. The developers took lessons from past published research, in informal discussions with many ePRO companies and scientific experts in the field in how it was migrated. There were actually two versions migrated. There’s one single item version for the tablet, and then there’s one for handheld. The reason for that is, handhelds weren’t very popular way back in about 2006 when they started making the first single-item version. So you think you have all this screen space, and so it actually, in the tablet version, has the entire stem at the top of every single screen and then every item separated out. So with the tablet version you do have a little bit more reading than you would have in the paper version. In the handheld version, as I showed you, you pack it together a little bit more, you don’t have the full instruction and then the full question on every screen. That said, really the main changes that were made were to slightly modify the instructions from saying “mark an X in the one box” to “select the one response.” Fairly simple, you’re going to have lots of people doing it in a different way, they’re not all checking a box at all. Another is again just the change to this one item per screen. And then the response choices are vertical instead of horizontal, you might have noticed.
So let’s see I talked about stem placement, so this brings us to our decision point two. What types of changes were made? And this is an important thought for any type of PRO that you’re working with, but with the SF-36, here I’m using work by Dr. Stephen Coons and colleagues that was published from the ISPOR ePRO Good Practices Task Force Report. And it has this handy table that resulted from I think about two years of argument and discussion over what constitutes a minor, moderate, or substantial change to an instrument, and then what type of test do you have to do to show equivalence based on what changes were made.
And some of the things that they said, well those little changes in wording for the instructions to say “select one response” really doesn’t change the meaning of the survey, that’s probably a minor change. They also concluded that changing something from having the items packed together on a page to single item was minor. Could be debatable, but that’s what they came up with. And for minor changes they said well all you have to do is usability testing and cognitive debriefing, and we’ll talk a bit more about what that is in a minute. Moderate changes are changes to wording that may possibly present a change in meaning. Or something that would change interpretation, so something like interactive voice response, so if you’re hearing something rather than reading it, but we’re not really covering that today. And in that case you would want to do equivalence testing and usability testing. And according to them there really is no reason to do a full psychometric testing unless there are major changes to the item response options, like you go from two to six response choices, or some other changes. I’m not exactly sure why you would do that unless you were actually modifying the survey, but there is. So today, to keep it simple, we’re going to focus on minor and moderate and not substantial. I’m also not going to focus on usability testing because I know there’s another session on that, and it has to be done either way, so there’s not much debate about it.
What are some of the design decisions you make when you’re doing a cognitive debriefing study? So what is one, first of all? These types of studies are used to explore the ways in which members of a target population understand, mentally process, and respond to the items on a questionnaire. And I would say not just the items, the instructions, the response choices, the recall period. How do they understand these different components of the survey? They’re sometimes done in focus group format but usually they’re done in individual interview, and a lot of times—most of the time—they’re done in person, because whoever is doing the interviews wants to be able to pick up on non-verbal cues, facial expressions as they’re looking at the survey and filling it out and saying out loud what they’re thinking. So you really want to hopefully do it in person.
Some of the questions that you have if you’re going to do a cognitive debriefing study is, what type of population are you going to use. That seems pretty simple, right, if you’re doing a study on a sleep scale, you would just get people with sleep disorders. Not really that easy because you—there’s lots of type of sleep disorders. You might have to think more deeply about it. But with something like the SF-36 it’s really even more complicated for a person like me to think about, because it’s used in hundreds of disease conditions. And it’s used in general populations and all over the place. And if I was to try to test it in every single population, the end size would be tremendous, it would be practically undoable. So what is the level of evidence to say that it’s equivalent that’s required for this? Who would I have to do this study with?
How are you going to do the study? In these interviews, you’re sitting there with a person who normally is doing what we call a think aloud cognitive process, which means they’re given the survey, they’re asked to read it out loud and then say what they’re thinking as they fill it out. And that’s what I mean by doing it in person so that you can actually see when somebody pauses, if they look quizzically at something, and they’re going to be saying, oh I don’t know what that means. But you’re not interfering, you’re not answering their questions, you’re getting all the way through the survey and then you will go back and probe on different things that they have questions about. But you have to think about in the amount of time that is reasonable, given patient burden, to sit there with someone and, given the length of your survey, how much probing are you going to be able to do, what are the important things to think about.
In the case of the SF-36, you have a survey that’s been around 30 years. There are over 28,000 published studies that we have in our database that use the SF-36, probably many more out there. So it’s been used a lot. If you are going to end up making a change to it for something, you’re affecting your ability to compare your results to what’s already out there in the literature, you have to think about some of these things.
So how many people do you need in your study? How are you going to recruit them? how are you going to ensure for a small study that the people that you involve in the study really have whatever condition you are interested in, because there is lots of cost implications and time implications of going through say a doctor clinician as opposed to going through a patient advocacy group or something like PatientsLikeMe or something like that. What mode would be used in the interviews, and by that I don’t mean are you comparing paper and electronic; I mean are you going to sit down with someone with the version on your electronic device that’s going to be used in a clinical trial specifically, are you going to sit down with them with the paper version, are you going to have half of the participants do one and half the other, or all of them do both and then comment on which is easier and randomize who does what first. There’s a lot of discretion here in the types of decisions you’re making about how you’re doing this. And maybe not how you’re doing this, some of you are actually hiring someone else to do it and you want to make sure that they are thinking about all of these things.
And then you also have to design your interview guide, because usually you use some kind of structured interview guide that has the basic questions you want to go through but allows you to probe and go off of the guide to understand what are problems with the survey and what are problems with the mode effects. Because both can occur, especially when you have something that’s been around for 20 years. Language changes, people may understand certain words differently in different cultures, things are understood differently. You have to make sure that whoever is doing the interviews pays careful attention and is able to probe and find out, okay so I understand that you didn’t understand this question, or you think that this should be done. Would you understand it though the way it is, or would you not.
Okay, so that’s cognitive debriefing. The other thing that we talked about is if it’s not minor changes. So we’ve already said minor changes seems very minor, you do this cognitive debriefing seems like nothing. Small number of participants. And then if it’s moderate, you actually have to do equivalence testing and actually usually you do both cognitive debriefing and equivalence testing. And equivalence testing is designed to evaluate the comparability between PRO scores from an electronic mode and paper-pencil. This is more traditionally what people think of when they're thinking mode equivalence tests. They often don’t even think of cognitive debriefing.
So here’s some of the types of decisions that you’re going to have to make with doing equivalence testing. Again, you have to think about the population you’re going to use. You’re going to think about what type of design you’re going to use. And if you turn that paper over to now the other side, you’ll see the same thing that’s up here, which is the two most common types of designs. There are certainly several others, but if you look across all of the different meta-analyses, what almost every study did was either a randomized parallel groups design or a randomized crossover design of some kind.
The parallel groups design does have a randomization aspect, but then the people in the study are only going to take either paper or electronic survey. So the limits at lot what types of analyses what you can do and what you can say then later. A randomized crossover design on the other hand has a randomization where about half of the participants take the paper first, then there’s some sort of activity or there may be an interlude of several days, and then they go take the other mode, so you may have paper first, and then electronic, and then some of them do electronic and paper. Ideally you would actually have four groups because you would have also a group that takes electronic first and second, and one that takes paper first and second. Why would you want to do that? Because sometimes there’s just changes that happen when people take it and it’s not really related to the mode. And the only way to do that is if you have the four groups and then you take the differences from let’s say handheld-to-paper, and you compare it to the differences that you found when you did paper-to-paper. Otherwise you may overestimate the differences.
So the most popular of these two is the crossover design, and that’s what our group exercises are going to focus on instead of the parallel groups.
So what are some of the other design decisions you have to make in equivalence testing? Again, you have to think about how many participants you need. And ideally we’d be doing a sample size analysis, but we’re not going to do that today. What mode are you going to test? And here, is it important to do all different modes that could possibly be used? If you think about an instrument developer, there are going to be people who use all different types of modes. So do you have to test all of them, or can you use a lowest common denominator approach, where you say the handheld is the smallest screen size possible, if it’s equivalent there, we can extrapolate that it would be equivalent everywhere. Yes or no, would you do a combination? But remember that every time you do that, you’re adding to the number of groups you have, you’re adding to the end size, you’re adding to the time, and the cost.
Some of the other things you need to think about is, for the paper surveys, how are you going to get that data in? Are you going to scan them and somehow have them automatically go into a database, are you going to have double data entry done, partial checks, how are you going to ensure the validity of that data.
And then, if it’s a crossover study, you might need to think about the time between administrations. There’s also been great debate in the literature over whether it should be done like a test-retest study where you go back maybe a week or two later. There is positives and negatives about that. One of the negatives would be that your health may actually change in that period of time, especially depending on what type of condition you have. So how do you account for what was really a mode effect and what was really a change in health. The other thing that you have if you have people come back later is a loss to follow-up. and are there differences between the people who did show up for the follow-up and those who didn’t. So there’s some legitimate problems with that. But there’s also some legitimate problems with doing it the same day. How much time do you put in between. In the 25 studies that I looked at, it ranged from five minutes in between, where they literally just went to another room and did the other one and sort of got instruction on how to fill it out, all the way to, I think it was about four hours in between was the longest on the same day. And in that, what types of things do you have them do in between. Are they just sitting there in a waiting room left to their own ideas? Are they watching a movie or doing something that’s supposed to take their mind off, because what about memory effects, are they just going to remember what they just filled out. Is that even a problem if they remember what they just filled out. And the other thing is, are you going to have them do something that you’re really interested in, like ask them a whole bunch of questions about whether they prefer electronic or paper. Well that could bias what they say, because you’re doing it in the middle of the study and they’re going to do the other mode. So you really don’t want to do that until after you’ve collected all the data.
So as I said, this is sort of like a support group. Here’s all these different things I just threw at you. It’s like augh! What do we do?
So this is where you come in. I’d like you to do a small group exercise, you’re going to have about ten minutes to do this. Just at your tables, a couple of you only have three people, I’d kind of like six per, so you might want to combine over there. And I want you to talk about, with the SF-36—and that’s why you have the handout is to sort of look at it again and think about what we need. Just in case you’re not too familiar, the SF-36 has eight scales and two summary measures, the two summary measures are the physical component summary and the mental component summary. And you want to talk about do you need to do a study at all, based on the types of things that I talked about that were done to migrate it from paper to electronic. If so, if you have to do a study, would you consider the changes that were made that I talked about, and I can put that back up on the screen, what the changes were, so that you can remember it easily. Would you do a cognitive debriefing, would you say it’s minor, or would you do equivalence testing. And then I want you to explain why. Or both, I guess you could do both. Provide some of the details of what you would do. For cognitive debriefing, what would you test with—paper, ePRO, both? What type of population might you use? How many people would you interview? If equivalence testing, would you use randomized crossover design or parallel groups? How much time between the administrations? What activity would you use? Are you going to do that four-group thing I talked about where you have the stable groups or not? And that sort of thing.
And I want you to keep in mind reality, okay. Cost and time. There’s all sorts of things we would love to do. So I’ll be sort of walking around, answering questions, and I’ll have this up most of the time, but just before we start, I’ll put it back on this to remind you of the changes that were made. And will someone help me with when ten minutes is? Okay wonderful. And go.
[Break for Workshopping 31:05 to 31:10]
All right. So hopefully everyone has kind of picked really quickly now someone he might want to report back. There’s a lot of tables, so instead of forcing each group to say something I think I’m going to ask first for a volunteer to talk about what your group concluded or thought of. If no one raises their hand then I pick on people. Anyone?
We said that if we assume this is the first ever migration to electronic for this instrument—because obviously now we know with looking back we have loads and loads of examples where we’ve done this—we think that the changes are more in line with a moderate change than a minor change. We felt the minor changes were more applicable to things like changing “please circle the answer” rather than “please select” you know, those sorts of minor wording changes, whereas we really have made some quite big changes to kind of the root of the question and the way that it’s structured, even though we think that those changes have been made really well.
So we’re going to think it’s moderate. And therefore we would do a single equivalence study, we only want to do one, get it published, and then after that any other electronic implementation should be able to use that evidence so long as we design that study really well. And—how far were we to go with this—so we felt typical equivalence study is probably about 50 patients, we’d do a two-way crossover. We’d obviously do a proper sample size calculation. We’d have a one- to two-hour wash-out period, we would have a distraction task in the middle of that wash-out period but nothing too intensive that would fatigue the patient for the second period. We wouldn’t go for parallel group. We’d try and do this in a population that was fairly generic, so we wouldn’t necessarily specifically go for a particular indication, what we would like is a spread of disease severities, so that we’re getting patients responding at the top end of the scale and some lower down in the scale as well so we’ve got a nice range of responses across our sample. And that was possibly—oh and also we’d want a nice spread of ages in there as well, we’d want some elderly patients as well as some adults. And that was it I think.
Wow, you guys agreed a lot. That’s awesome. Anyone want to argue any of those decisions or second them? Anyone? Yes? Okay. You want to share your thoughts?
I think the discussion we had was is this minor or moderate, I think actually we leaned more towards the minor. But that again comes back to the challenge of your presentation, it’s saying how do you interpret this, there’s sort of now strict rules, and how much of a change is this. But from that I also think we also concluded yes, we need to do something definitely. But hopefully there’s something we can refer to so we don’t need to do it ourselves, and that comes back to your status, saying if you’ve done all that work, you can refer to that and that would save you time and money to get that done. If not, again I think we were leaning towards the minor, and that would lead us into doing cognitive debriefing should be enough. And if we’re doing that for that specific study purpose, that we were sitting in, we consider that ten patients would be enough. That kind of is the experience from the legacy specialists in the area. And then also again if we’re focusing on a specific study, a specific case, we would lean towards a little bit more about that specific population and get that range within that population. But in a good day you have covered all those patients groups anyway and then you can refer to the bigger study.
And then you’re done. Great.
Yeah but I think there’s a good point in saying that I know, in terms of the industry and sponsors, I think—correct me if I’m wrong—but I just get the impression that everyone is doing this kind of patchwork of—I minimize mine and do my five patients, maybe seven or maybe ten, I think that could— Some of you think the concept of doing it a little bit more properly, thoroughly, kind of more general and then refer to that, would be really welcome and I think an area where sponsors could collaborate to say actually this is a benefit—it’s not worth it—it’s not where we compete in these kind of things, let’s do it jointly probably and then use it, all of us.
I think you’re really right. I’ve looked before at literature on cognitive debriefings for mode equivalence, and there’s not much out there but you know lots of people are doing it, they're doing it with the usability testing, and they’re just not stopping to publish whatever it is because they did it for their purpose and they’re moving on and that sort of thing. So that’s a really good point.
Let’s take one more perspective from the room before we move to the next section. Anyone else have some different thoughts or want to second what was said already? I know I heard some arguments all throughout the room. Did any table not actually agree with each other at the same table? I have liars in the room. I heard it.
Maybe, would it make sense to start with the cognitive debriefing with a small number of persons, and then later on, depending on the result, then moving to an equivalence testing in case there are special let’s say feedback from the cognitive debriefing?
People have done that, yeah. Especially when it is on the fence of is this minor or moderate. And you know, given time, you take a risk there of if you’re—you might end up taking a longer time maybe doing both than going one or the other, but if it works out with the cognitive debriefing, you may say, oh this seems like it’s good enough. And again, my sense is that different people in the room will come up with a different conclusion every time about what should be done. In this room, I would guess if I made everybody write on a piece of paper what they decided, I would have a pretty even split of decisions, at least on the minor versus moderate and some of the decisions that were being made.
So I guess my point here is that there are a lot of decisions that are made that may affect what you come out with. And that is important, because if you don’t pay careful attention to the decisions that are made, what you’re going to get at the end may have been not worth doing it at all. So thanks for all of your active discussion in that.
Now we’re going to move on to the next part, where we finish the research design and we’re going to talk about analysis and interpretation. Some rooms this is the fun topic for people, and some rooms this is not the fun topic for people. I don’t know who is in this room exactly to know if this is a part that people really like to talk about or not. So I’m going to go again and I’m going to focus on cognitive debriefing and then I’m going to focus on equivalence testing, the two parts.
So what are some things to think about in analysis and interpretation in cognitive debriefing studies? cognitive debriefing studies as we talked about are more qualitative, they are not really filling out the survey, although they are, but you’re not concerned with that data. It’s really more of a one-on-one in-depth discussion with people. So one of the things I did, I already talked about a little bit is determining what is a mode effect and what are issues across modes. And something that Paul and I were just talking about over the break is that, it might not even be something that’s inherent in the survey itself but also in translatability which I know will come up in another session, so I’m not going to go into detail on it, but there could be issues that way too.
One of the major debates in qualitative research in terms of people who do these studies on a day-to-day basis is, should you a priori set a threshold to detect the effect, which seems kind of funny. A hardline qualitative researcher would say you would never, never in advance say, I need to hear four out of ten people say something to consider it as a problem and something that I need to change in the survey. There may be only one person that says it and I have to think about that. On the other hand, cognitive debriefings are exploratory but not as exploratory as most qualitative research. They’re in a way confirmatory. You are confirming if these things that are already made are adequate or not. You’re not trying to develop new things at that point. In the survey development process, that should already have been done in a separate qualitative process. So now, you have your survey and you’re trying to test the validity of the items.
So in that case, would you maybe decide that you’re going to start with a threshold, and if so, is that threshold hard and fast. And is the threshold related to any kind of problem with that item or response choice, or do they all have to say the same exact problem. So I think that’s an important thing to think about when you’re conducting these studies and it’s something that will impact what you decide to do and whether or not you’re going to make a change to the survey later.
And I think it’s more of a problem for surveys like the SF-36 or the EQ-5D that have been in existence for over 25 years. As I mentioned, 28,000 articles of studies that used the SF-36, if you make a change to the survey, and maybe you’re going to throw something in there, you’re really affecting worldwide use of something, it’s pretty important.
And here’s another thing. If there is a problem with something that you find, and you believe it’s truly due to mode effect, you truly believe that in the paper version people understand it this way, in electronic they do not, can you change it to make it better. Some things are just harder to understand. Some things involve very technical complex language in the healthcare field that there is not a simpler word for. So that’s a consideration.
All right, so now instead of breaking into groups, I’m just going to ask people to help me with a couple questions based on what we just talked about. So first, as we mentioned, the SF-36 has 36 questions and that makes up eight different domains of health and two summary measures, four are more physical, four are more mental. So you have like role emotional, physical functioning, vitality, general health are some of the domains that are measured. It has a rich history of use. So let’s assume you’re going to use the SF-36 and you’ve conductive a cognitive debriefing study with 15 participants and instead of sitting them down with the paper you sat them down with the handheld device. Three participants, or 20% of them, read one item three times before answering it. But when you go back and you probe, they say they understood it just fine, and they won’t give you any more information about why they needed to read it out loud three times. I’d like someone to help me understand, is this an issue that you need to address, and if so how would you address it? I know I have qualitative people in the room. Someone?
So my name is Cornelis, I’m with Novartis. Just to be a little bit provocative, I think it’s a big issue, because if I would roll this out, let’s say, in a large trial population, 20% is quite a lot. So I would be alarmed, personally, and certainly look into this in more detail, maybe even do a retest in a slightly larger group.
Okay, so retest it. Anyone else?
I wonder, do you have the social demographic information about the patients. That may be helpful to understand if it could be linked to study level of the patients or this kind of thing, education level, this kind of thing, and potentially reassess the need to rephrase the question or something like that.
That’s a very good point, and this is a made-up scenario, so in this case I don’t. But yeah, you do want to look at that. Was the three participants perhaps of some type of group that might understand something differently, might be cultural or some other factor that might be influencing them, and that could be important in different ways.
I actually would not say that there is anything critical that that point in time. It’s always, when you look at the questions it’s sometimes hard to understand anyway, so it’s—what you said before—it’s a question, what is the knowledge of the people, how much do they really know about their pain and these kind of questions. And depending on which kind of questionnaires—and we know our tax forms, when you have to complete them once a year you need to read some of these questions quite often to really understand. So at that point in time if they really said afterwords, I’ve understood, then I think it’s fine.
Okay. So we’ve heard three different things, and I’m going to move on to the next one, but say one of the things that’s good about qualitative research is that you can vary a little bit what your interview guide says. So if you’re doing 15 participants and the three people are within the first six, that may be something that you want to probe on more and ask everyone about in later interviews to really find out, well, is it just three people that brought this up spontaneously or, you know, when you ask everyone else, how do they really interpret it. But you could see, there’s different ways of dealing with it. You could interview more people to say, three is a lot, let’s keep going. We haven’t maybe reached saturation in terms of the amount of evidence needed. Or you could say, eh, it’s not that much.
Okay, so now you’re conducting a study. One of the inclusion criteria is having been diagnosed with either type 2 diabetes or a major depressive disorder. So in this study you've got people coming into the room for one day. You’re doing a screening on them first to make sure they actually have the condition, and then you’re doing the cognitive debriefing right after. So they remember these screening questions may be about depression. And they start saying, well I don’t know whether these questions are about my general health or about depression, even though the instructions of the survey say, this is about your general health. Do you think that’s an issue to address? What would you do? Would you actually change the survey about it? Would you add instructions before you start doing the survey that say, I want to really point out that this is your general and not specific to this? So I’m going to keep going so I’m not going to go through it, but I think it’s an important question to think about. The type of thing that people are coming up with makes a difference. What type of problem are they having?
And here is another more serious thing, it’s not just maybe about understandability of a word or instruction. But let’s take a SF-36 item that has three response choices. One participant out of 15 says, this is not enough response choices, I fit in between these. This is a very common thing actually. So in this case, what would you say? Before, our scenario had three people reporting it out of 15, now we only have one. So now will people agree, from a qualitative perspective, every person’s opinion counts? Or would you say, no this is somebody’s opinion?
So I think this is an issue with the instrument itself, it has nothing to do with the modality. And I think this is what I observe sometimes with cognitive interview studies, is that there seems to be a misunderstanding in my mind that it’s about the validated instrument. And it isn’t. what we’re trying to do is show that patients will give the same answers on paper and on the electronic form. So this is really, this is an instrument question, it’s got nothing to do with—so I wouldn’t be concerned about this at all. And actually, it’s a great reason to use electronic because you can’t scribble between the options as you can on paper.
So a really great point is that there are lots of different types of cognitive debriefing studies. Some are done as part of instrument development and validation in a particular population. Some are done, as other people will talk about, in terms of translatability and then some are done for mode equivalence. And really you do have to think about what are the type of changes being recommended.
And unfortunately, I was going to go through all of this in—I thought I had an hour, did I go through a whole hour? I’m sorry guys. So I’m not going to be able to do mean score comparison and how you determine equivalence, but just to say that when you do an equivalence testing, there are many different statistics that you can use. The main two that we find across the meta-analyses are either the ICC or mean difference. Ideally if you had a choice, you would do both. And that you really need to compare in mean differences to the instrument developer’s recommended minimal important difference, not just is it statistically significant. And how do you actually determine whether or not that is. So afterwards, I’d be happy to talk with anyone about how you interpret some of these things. But I have to stop now, I’m sorry.
So thank you very much, Michelle, for a really thought provoking discussion. And the bit that really stuck out to me was how the decisions that you make early on in the process have such a dramatic effect on the output of the process. So real thought provoking concepts there, how the thinking that goes early on into the design of what you want to achieve really has a practical and physical impact on the outcome of it. So any thoughts or questions for Michelle at this point in time?
Michelle you are going to be around at lunch, I presume. So Michelle will be around at lunch, so those of you that want to continue this discussion, please head to Michelle during lunch.
[END AT 52:06]