- Qualitative and Quantitative Equivalence: What are they and how do we measure them? (Paul Swinburn - Mapi Group)
- (00:27) Thoughts on Equivalence Testing: Can we do better? (Jason Lundy - Outcometrix)
- (00:48) How Much Evidence is Enough Evidence? (Paul O'Donohoe, CRF Health)
Okay. So I’ve been asked to talk a little bit about qualitative and quantitative approaches to equivalence, and why we should care. We should care, it’s a nice thing to do.
Okay, where to begin, where to begin.
We’re at a time now where really technological innovation is accelerating, and there’s a real pace of change, and we’re seeing more old-fashioned mediums giving way to arguably superior technologies at an ever-increasing rate. So things like perhaps a good example would be vinyl giving way to CD giving way to digital downloads, VHS giving way to DVD, that kind of thing. And so it’s easy to see that there are potential benefits to be had by embracing these new technologies.
And I’d like to just take a minute to think about what we’re doing when we’re completing pen and paper PROs. It does, to me, just feel a little bit antiquated. It does feel like something that I don’t do an awful lot of in other respects and parts of my life. Completing pen and paper forms maybe just feels a little bit of its time now.
So essentially we’re in this situation where most of the instruments that we’re working with, the sort of fast majority of them, are really of another time, compared to where the technology is. And we’re starting to realize that ePRO does potentially offer us all of these new benefits. And there’s a sort of weight of evidence that’s starting to be built up which supports this idea. But we’re in a situation now where these instruments are, you know, they’re not legacy instruments—that would be unduly harsh to a lot of great work that’s gone into developing these instruments. Quite often, apart from sort of early ‘80s back-of-the-napkin style clinician-developed instruments that we all know and love and work with.
So what we need to be thinking about is, how we can use these instruments effectively. But what are we saying in terms of equivalence? So these instruments, we could refer to them maybe as being vintage instruments in that they haven’t been developed with this specific medium in mind—although the term vintage does make it just bring kind of ideas of maybe like a sort of skinny-jean wearing hipster finding a copy of SF-36 on eBay or something like that.
Okay, so we have this issue of taking existing measures that have been developed on pen and paper and migrating them to electronic format. Now, unfortunately, as we’re aware, it’s not simply a case of just scanning these in or exporting some files and being able to have them just appear on a website or a PDF. We need to think carefully about how we’re going to modify these instruments in order to maintain the original intent of the items and to hopefully end up with some kind of comparable measurement.
And so you know, there is a question here about what we mean by equivalence. And this definition I think is as far as maybe I’d be happy to go as a kind of generally accepted idea. I don’t know about other people. What I think we need to do is we just need to say, maybe, that what we think holds true for the pen and paper version holds true for the electronic version, warts and all. That’s not to say that there’s not going to be problems. Not to say there won’t be problems with the electronic version, but as long as those problems are roughly consistent with the problems that exist in the previous version, I don’t see the problem. But there are issues that can be raised as a result of investigating the properties of the original instruments. Perhaps too far, and we need to think about a pragmatic idea of equivalence.
So there’s some evidence that’s been gathered over the years with respect to comparability of pen and paper and computer-administered instruments. So I’ll just take a very quick look at this. Now, ePRO or eCOA hasn’t got any claim to say that this is the first sort of context within which there’s been interest in these research questions. Really it goes back to the kind of proliferation of computers, back in the early ‘80s, the days of Reagan, the Cold War, Charles and Di, happy times. So what they found originally, largely, was that the impact of computer presentation on the ability to perceive and process information was largely negative. It was largely detrimental to the way in which people perceive the information, the way in which they process it, even speed of reading, recall, in-depth comprehension. And this all led to a kind of increased working load for people, it was quite an unpleasant experience.
But research from the sort of last five to ten years has gone quite a long way to say this is no longer the case. And this is due possibly to quite a number of reasons. One thing, the spread of technology now is incredibly good. Most people’s phones have resolutions that are way beyond the screens of yesteryear. Computers processing power is far greater. There is very little response lag anymore, multiple input errors, these kinds of things, they’re far less of a concern. And I think we all have an increased familiarity with the use of computers or electronic devices. And it’s not just contact that we come to the devices, it’s the nature of the use of these devices. And the idea that inputting personal data into an electronic device is a novel thing—it doesn’t really carry anymore. I mean, I remember the original days of internet banking and it seemed like magic to be able to transfer money between different accounts. It was like something from the future. But now you wouldn’t give it a second thought. And it’s just that kind of level of technological familiarity. I mean the first time that anybody used wireless it was a bit like magic, you know, the whole wifi kind of thing, it was crazy. And now it’s like a human rights violation if you haven’t got 3GS, you know, so it’s that kind of process, it just normalizes very quickly.
Yeah, there are always going to be little populations where people are less familiar. I mean the classic example is the granny who won’t touch computers, possibly because she’s skydiving. But you know, how can I put this nicely, that problem is going to go away, right. Nature will solve that one.
The way in which we approach computers, our relationship with computers, our relationship with technology is changing. And this situation that we have now is not forever. This is a situation, this is a problem of our time right now. What we’re doing is we’re designing to solve a problem now that probably isn’t going to exist in 10, 20 years time, perhaps. Even the nature of the way in which we interact with computers, we might well just be presented with a disembodied electronic voice asking us for input on how we feel at a given point. We just don’t know what it’s going to be like.
And you know, the research does seem to suggest that now we do have a preference for this kind of data input. I mean I don’t know about how often you fill forms in, but there’s quite a bit of research that suggests that people just have a personal preference for entering things into a computer, whether it’s that degree of anonymity that it presents, whether it’s the simple mechanical issue to do with people having a preference for typing over writing, or the convenience of being able to input data into different locations. There are numerous reasons why people might prefer it, but it does seem to be a recurring trend in the data.
And this idea of inputting personal data, it just doesn’t seem like what it was. I mean a lot of people waste enormous amounts of time filling in those little quizzes on Facebook, where it says like, you know, which character are you from Game of Thrones, you know, that kind of thing. And people just, they just give away information like anything, it just feels like a natural thing. And I think we can’t underestimate the impacts of that. And with respect to ePRO particularly, there’s a reasonable amount of research that’s been produced over the last ten years. Obviously, the most quoted paper probably is the Gwaltney review from ’08, which I think used about 46 studies, with about 200 different individual comparisons between electronic and pen and paper administration, and came to the conclusion that there’s really rather little evidence to support a meaningful difference between administration modes.
Okay, so let’s say in principle that there is potential for there to be equivalence between electronic and pen-and-paper administration. The question then becomes, what do we need to do in order to adequately convince those people who are charged with the responsibility to produce evidence that has been produced using these different methods.
There’s a table that many of us know and love from the ePRO Task Force paper, which tries to set out the level of evidence required to support equivalence, based upon the degree of modifications that occur during the migration process. As you can see, there’s a level of evidence for the minor level of modification, the recommendation to deploy cognitive debriefing methods. In a moderate, there’s equivalence testing, which in this context is described as a quantitative study. And then substantial, recommendations for full supplementary testing, actually validation of a new instrument.
So what do these different methodologies imply? What are we trying to achieve there and what does it mean? Okay, so you’re not really familiar, the basics of qualitative research is that we are assuming that the meaning lies within the words and not necessarily within numbers. This doesn’t hold true for all forms of qualitative research, but we’re talking about a very specific kind of qualitative research. Qualitative research is really just an umbrella term for a large number of different methodologies, hugely variable, which attempt to provide insight from an individual perspective. Now historically, these kinds of methods have been viewed with a degree of suspicion by scientists, and that’s probably due to the kind of logical positive underpinnings of 20th-century science. But I personally find it quite refreshing that the FDA particularly not only are willing to entertain this type of research but actively encourage it. Because really, it is really the best mechanism for getting any kind of detailed understanding from the approach of patients. It would be all too easy to rely on kind of retrospective assumptions about data from statistical methods. But they have very limited explanatory power, and I think this is where qualitative research really comes into its own in this context.
So what does qualitative research in equivalence really mean? Well, in this context we’re largely talking about cognitive debriefing or interview-type approaches. And these are very good for everybody concerned, really. They are very fast, efficient, and they give very good levels of information, particularly where there isn’t a problem. And this is one of the points that I think we need to consider, is many people don’t think there is a problem here, okay, or a problem that’s meaningful. And this kind of cognitive debriefing approach very quickly can show you whether there is any indication that there will be a problem. And so this is a good thing for us to be doing. And it’s quick to do, it’s cheap to do. Where there is a very trivial migration, very straightforward, you can conduct this kind of work quite quickly with as few as five patients sometimes. And where there is a problem, you may end up conducting several rounds of cognitive debriefing. And that, you know, you could end up conducting a couple of dozen interviews. But it’s a very effective mechanism for providing detailed individual understanding.
What we’re trying to do is to have participants view the items and to give us their opinion as to whether the intent of the item is true to the intent of the original item as we see it, and whether the nature of the task is the same, and whether they would ultimately give the same response. Now there are different approaches that you can take to cognitive debriefing, and this is one of the problems, is that there’s a huge lack of standardization in the way that these kinds of methods are applied. And I don’t think that this is strictly necessary. I think that we could do more as an industry in order to create a kind of more standardized approach. And I think that this is the kind of thing that’s important, if we want to convince regulators that this is not an issue, this is a non-issue, then we need to have evidence that’s easily assessable rather than these kinds of myriad, disparate approaches. And so I think a more structured approach to the use of cognitive debriefing could give us a better degree of evidence.
So there are different approaches you can use to cognitive debriefing. We can have participants complete both versions of the instruments and give us their impressions as to the differences between them in terms of the items. We can have them scored, look at the items where there is disparity and investigate those more in depth. Or, if you’re very fortunate and you’ve got a very well documented instrument that you’re migrating, for the item definitions you can simply have them complete the electronic version and see if it’s consistent with their approach. And you know, approaches like think-aloud, where you have people articulate their thought processes as they’re completing questionnaires, they can provide enormous amounts of insight.
So quantitative research, as most people would be aware, takes a kind of alternate approach based very much on the idea that the story is within the numbers. And that’s trying to give you a kind of robust and objective account of the empirical evidence. And this approach is better suited really to hypothesis testing. And this raises some questions, which I think we’ll come onto. But you know, ultimately what we want to know is, does this data look like this data. And beyond that, would we draw a different conclusion on the basis of the available evidence, were we to use one mode rather than the other.
Okay, so equivalence studies are suggested as being the kind of way to approach the issue with the sort of moderate modification. And these fall very much into the kind of classical experimental design kind of approaches. Either sort of crossover or parallel groups. With a crossover design, which is the usual kind of preferred design, due to the kind of efficiencies that it offers, we can get a good idea about whether there is a meaningful impact on the scoring, on the ultimate overall scores of groups on the basis of the mode of administration. And parallel groups tend to be kind of the fallback option, I think, for most people or in situations where you’re using more than one different mode. And what we need to ask here is, is systematic bias being introduced into the results by the migration process.
Now, the nature of the investigation that takes place here, you know, there are different things that can be done. I think what’s recommended in the paper is quite reasonable. The use of ICCs and test of mean difference give us a reasonable indication of the equivalence, I think, of the ultimate scoring comparability of the scoring, for example, when I view it. Using approaches like DIF from IRT or something like this gives us the ability to uncover different trends that exist within the responses. And coming back to the hypothesis testing idea, we need to think about what it is exactly that hypothesis is, what the question is. Are we saying that this holds true for overall scores in terms of comparability, or are we saying that this holds true for women, for young women, or for young women with serious disease. And half the time, the level of understanding that exists about the original instrument won’t extend to anywhere near this amount. So what we’re doing is we’re essentially going to create our own problems by mining the data to an extent that we’re going to undermine our own efforts to establish equivalence. So I think we need to be very careful about what it is we’re stating, we need to have a good, kind of consistent, narrative for what we’re saying, when we say this meets this requirement.
Okay, so what does this kind of mean? Well essentially I think the qualitative and quantitative approaches are giving you two different answers to that same question about whether we want to draw different conclusions on the basis of available data depending on the mode in which we’ve administered the questionnaire. And this use of kind of mixed methods of marrying up in-depth detailed qualitative research about individuals’ approaches to individual items with the kind of overall statistical approach to establishing equivalence is a good place for us to be at. I think it’s a reasonably defensible position. But where we meet a real challenge is in the terms of the pragmatism of these approaches. We need to have a better idea of exactly how we want to implement these and to what extent, in order to be able to reinforce that this is sufficient, because we run the risk of getting into a situation where we create an insurmountable barrier to establishing equivalence between every possible platform and every possible device.
You know, there are different contexts of use, different populations of interest. And it’s trying to have this kind of all-encompassing idea that you can simply tick a box and say this is absolutely equivalent under all circumstances for everybody is unlikely to happen. Realistically there are going to be situations always. So what we need I think is to have a kind of core equivalence, a kind of area within which we feel comfortable to say that this equivalence exists between these different modes, between these different devices. And we can build on that, we can conduct certain study designs to elaborate on these particular areas that are potentially problematic, and using both qualitative and quantitative methods to do that. But if we don’t do that, we’re facing a real trouble in terms of possibly providing a kind of active disincentive for inclusion of ePROs. And I mean, that would be a travesty, really, because we’re talking about uncertainty, a small amount of uncertainty that might be introduced into the data as a result of the methods, that we think is probably not a big issue. But by doing that we risk denying potentially a very important information that would come directly from patients in terms of treatment benefit, treatment burden. And that would be kind of a perverse thing to do really.
And there’s always going to be a degree of uncertainty, there’s a degree of uncertainty in every single action, you know. A patient might report being chronically depressed because they came into the clinic and their dog just died. You know, there’s always going to be a certain degree of uncertainty in every single sort of sphere of influence. And you know, the question becomes, is this level of uncertainty that we’re going to have in certain situations, is it meaningful, is it important. And that’s the difficult thing to assess. But we need to have built up a kind of body of evidence using qualitative and quantitative approaches to say that’s probably not the case. And I think perhaps, having a kind of generic body of evidence concerning equivalence across a range of different studies, so that the evidence is cumulative—I don’t think it’s all set, I think this evidence is cumulative, but there are situations where there are problems where we can see using qualitative approaches, these kinds of issues that we’re going to face again and again, we’re not doing something radically different every time we do it, there are certain things that cause problems. And if we can document those things, if we can see what the best solutions to solving the problems, then I think the hard work is all done.
And so yeah, I mean I think there’s a place for both qualitative and quantitative research methods in advancing equivalence, but we need to understand what it is that we mean by equivalence before we can really go about having this kind of statement, this broad statement that we say this is equivalent. It’s equivalent enough under what circumstances and in what context.
Okay, I think I’m out of time.
Paul did a great job of setting the stage here for us in terms of what we generally mean when we are talking about equivalence. And I want to take maybe a little more provocative take. But echoing some of the sentiments that Paul also had around the lack of standardization really, about what we mean by the terms, what we mean by the methods, and really that’s the limitation, I think that we have right now for coming up with this consensus that we can finally say that when we adapt instruments to electronic modes that we don’t need to go through this process, then the preponderance of evidence demonstrates that we have a good measurement still.
So at the risk of being a bit redundant—and Paul laid this out very nicely—assessing equivalence has been recommended, and in the context of the FDA at least and the clinical trial, it’s probably a really good idea for you to collect some of this evidence to make sure that if you did a migration from paper that you haven’t introduced bias into the scores that are produced by that instrument. And again, we know that the level of equivalence is based on the changes or the modifications that you make to the instrument. And as I said, there’s a lack of consensus—I’m going to dig into this a little bit deeper—but there’s a lack of consensus around both the qualitative and the quantitative methods. So the same table here, but one of the things that are—couple of things—that I want to point out.
I think it’s our job as—my job—as a measurement scientist to say, we don’t want to end up in the very bottom category. That’s a really bad outcome, and so for me, I want to make sure that that’s not where we’re headed. The other thing that I’ll say is that we’re typically in a situation where we’re not really making drastic changes. We’re usually in the minor change category if we conduct a good migration.
I think the last thing I’ll say is that you’ll see usability tested listed out here separately from the other methods, and you see it on all three levels. I rarely see usability testing being conducted separately. It’s typically rolled up into the conceptual equivalence testing that we do in cognitive interviewing. Paul has some instances where that has been a stand-alone test that has been conducted, but because it’s not complex and it doesn’t take a lot of time, it’s easy to just add that in. So I get confused when we start parsing out these different types of what I consider essentially pilot testing. And so now the new mixed modes task force report has introduced yet another type of testing, feasibility testing. I don’t really know what any of this stuff is except that I do know what our intent is whenever we go into a cognitive interview or a quantitative study. And so I just focus on that and try not to use these labels, because I think they mean different things to different people.
So again, the qualitative study design considerations, you know, there’s a lot of questions that come up. So do we need to have the subjects answer both versions of the instrument? Is the comparison between formats really that helpful, is it even necessary? How do you truly assess the responses if you only have them complete one version? And, as I’m going to illustrate, on this slide there’s no consensus. And Paul touched on this a little bit, but I have essentially found that there are basically three approaches that are being conducted by measurement scientists in our field. The first option is, subjects complete both instruments, we assess to see where there might be differences and then we probe them about what the cause of that variation is. And this is not a quantitative approach, this is qualitative of course. I think though that we can improve this, increase the sample size and we take this approach and now we can do some quantitative analysis on it, a truly mixed-modes study. Option two, you give them the new mode only, and you can pair it back to an item definition table. And then option three, where we just ask them about the changes that we made to either the instructions or the instrument, and we assume that if we transferred the content faithfully that none of that actually is going to matter to them.
So similarly in quantitative studies, when we’re trying to ensure that the migration hasn’t produced bias in the scores, we have other issues. We have sample size and the choice of statistical methods, that differ. What is the time between if you choose a crossover study, what is the time, the optimal time between administrations, this is a debate that is kind of going on for a number of years when the task force report came out. This third point, which I think is one of the things that makes this incredibly difficult is paper is really poor gold standard. I think that most of the time when we take a paper instrument and put it on electronic mode, we are actually improving the data collection, and so we’re comparing back to what I consider to be inferior, and we’re using this as the gold standard. And that creates some conceptual problems for us. And then what do we do about diaries? The reason why we give diaries is because we expect there to be variations in symptoms that someone experiences. And so you actually don’t want to see—you want to see that variation, you wouldn’t expect to see equivalence. And so that creates some challenges as well.
So just digging a little bit deeper into the sample size issue, when you look into the literature for sample size calculations for a crossover design, you have a choice of a few different approaches. And typically for a two-administration crossover, you end up in the 50-60 subject range. There is also other approaches that were recommended. Parallel groups, obviously the sample size is going to go up. And then if you go into the IRT realm, the sample size really increases.
I think the point that I really want to make here is that we’re not powering these studies for a P value. Equivalence testing is reversing the roles of alpha and beta. And that’s a fancy way of saying we don’t care about the P value, we care about the confidence interval and, in the case of IRT, the stability of the parameter estimates that we’re trying to estimate. And so for statisticians, that might be something that is difficult to get their heads around, something that psychometricians might be very comfortable with. So just keep that in mind, if you’re having one of these conversations.
Again, on the retest interval, there’s a lot of debate about this. What’s too short, what’s too long. Shorter interval, we worry about the memory effect and are we seeing this high-level equivalence because we re-administered the questionnaire in ten minutes. And then the longer intervals give us more confidence that we’re able to wash out that memory effect, but they’re burdensome because you have to—the subject either has to come back into the site, or you have to send them home with a device, or they’re mailing paper back to you. And it gets messy.
And so good old paper. You know, I tell people this all the time. If you implement a poor paper measure on an electronic platform, you have a poor electronic measure. And so there’s nothing that we’re doing whenever we put this on an ePRO device that is going to improve the measurement that you’re going to get out of those items. They were either developed well or they weren’t, and in some of the cases that we have with legacy instruments, we don’t necessarily know a lot about their development history. And the literature might be a little light in terms of what evidence supports their content validity, what evidence supports their performance. And so you end up in these situations where you have items that don’t perform as one may expect, or as have been advertised. And actually, Keith Wenzel and I were involved in the study of the QLQ-C30 and we had Neil Aaronson, who is one of the developers, as part of that study. And we found that some of the single items—we were getting really low ICC coefficients on some of the single items that are in that measure. And we said, Neil, you know, we don’t have equivalence here. He said, but those items never performed well. And we were like, why are they in the instrument. He said, well because, you know, we thought they were important. And I said, okay but do you have any ICCs that we can use to sort of support this stance that we’re holding the equivalence test to a higher standard than we held the original paper version. He said, well they’re not published but we have some data that essentially, when we weren’t able to get this in our manuscript, but we sort of anecdotally know that these items don’t perform very well. And so I think this is something that I think is mentioned in the task force report and I want to reiterate, that we oftentimes are in situations where we don’t have a lot of experience and we don't actually have the data that we need to compare our new electronic mode back to the paper. And so we either have older data from different samples—as I mentioned, we don’t have ICCs, we don’t have mean differences to compare back to or look at. And so then you ask this question, well as part of our testing, should we just generate the new data. Should we generate new retest data on that old paper measure. And you could do this if you just extended the crossover design to a three-period design. And essentially you would be able to have both the retest of the paper and of the electronic mode in the third period. And then we could also compare between the modes. So that might be an approach. I don’t know that I've seen people use that approach, but that might be an approach if for instance you need this data to take to the FDA.
Electronic diaries. This is a situation where if you’re moving from a paper diary to an electronic diary which I highly recommend, this is almost surely an improvement. I would also go the next step and say the reason why this poses an issue for equivalence is because we’re not actually interested in equivalence anymore. We don't want it to be equivalent to the paper diary, and I would argue that you probably have a new measure now that you should go out and test and see how that electronic diary now performs. So while that may not be the preferred option, because that does introduce a lot of new expense and timelines and testing, I think that it’s probably a very wise choice, if you find yourself in this situation.
Okay. We touched on this a little bit yesterday. I think this is kind of hot button—maybe not a hot button issue—but this is something that people get really passionate about. And it’s this issue of are we talking about measurement improvements or are we talking about measurement comparability. The task force—and I’m complicit in this—co-opted the term measurement equivalence, that has that sort of well understood longstanding meaning in psychometrics to mean measurement invariance. And if you—as I mentioned yesterday during one of the discussions—if you talk to a roomful of psyshometricians about this, we’re going to start talking to you about DIF, and all the different ways that we can detect differential item functioning. And so I cite this—and I mentioned this, Lori McLeod wrote the editorial to the new mixed modes task force that appeared in Value in Health in July last year. And she’s sort of—she’s a psychometrician so she’s bringing up this point again. She’s also taking the task force to task, actually, a little bit, on their definition of what measurement equivalence is. They have a very light definition of that—that the scores between measures are comparable. And so she’s also touching on this issue that—I think Paul was mentioning this—when we go down this path and we start doing IRT and DIF detection, we’re not just going to be looking at are there items that function differently on paper and on electronic measures, because we’re going to get a lot more of a picture, and you might actually start to uncover things that you didn’t necessarily want to know about how poorly your favourite measure works. So I’m of the opinion that in most situations it’s really hard to fail an equivalence study if you do a good migration. I think that we should spend our time conducting good migrations, and that’s where your efforts should be. instead of trying to get something done quickly and into an equivalence study, I haven’t seen many cognitive debriefing studies that come out and say, well these two measures are equivalent. They come out and say, well we need to make a couple tweaks to the way the system interacts with the subject, and so we go, we do that, and we go on with our lives. In the same way, with the quantitative studies, I’ve seen very few quantitative studies that haven’t denoted equivalence. The ones that I have seen, we actually were able to correct by just using the right statistics. And that’s another interesting point, because there are ten different ICCs that show up in the literature. There are about four other ones that you could consider quasi-ICCs. And so, you know you have about a 10% chance of getting it right, so good luck. But it’s not really that hard, we just have to be careful about what we do.
I mentioned this earlier, but I think that this is something that we can improve on, make the process a little bit more robust. And that is to collect the qualitative and quantitative data in the same sample. I didn’t insert the reference here but there was a recent article in Quality of Life Research that was talking about the sample sizes that we typically use in cognitive debriefing and you see five to fifteen as being this standard recommendation. And this study is concluding that that’s not enough, and that 15 subjects don't give us enough information to really know anything about what we’re asking. And so they recommend samples of a minimum of 30. Well, when I see that I—and my experience with C-Path developing the new measures that we were developing in the PRO consortium supports this as well. Because we were trying to make decisions about our items, and we had a sample of 15. We had a cognitive debriefing study of 15, which is not enough, it doesn’t give us enough evidence to really say, yes we’re going to change this item or no we’re not. So you end up in these situations where you’re carrying forward a lot of baggage, and if we had just collected a little bit more data we would know with a little bit more certainty.
So whenever I saw that paper, first of all, I was happy because now I have a citation that I can throw back at the folks who say we only need five. But I also started thinking, okay now we’re in the realm where I become a little bit more comfortable conducting some quantitative analysis on that data. So I feel like if we set up the study in an intelligent way, we administer both modes at the beginning, where we haven’t influenced potentially their responses by asking them a bunch of question about the items, we can still conduct our interview, we can do some quantitative equivalence testing, potentially, and truly as I mentioned have this mixed modes approach that I feel like would give us a much more robust set of evidence to take forward and say, there’s nothing funny going on here and we feel pretty comfortable that we either did a good migration, or we developed a good new electronic instrument.
As it relates to BYOD. This surely is going to require post-talk analysis of equivalence. I think that if you were to use BYOD in a clinical study, you would be—I would be remiss not to recommend that you conduct some test of DIF in that large sample. You’re going to have enough people to do it, hopefully. And I think this is a question that the FDA is going to ask you. So I would almost certainly, in BYOD, recommend that you do that. If you were preparing for that, like you do on the front end, I’ve heard the small-medium-large recommendation. Test a small, medium, and large screen, I think that’s a little overkill. I’m more concerned about the smaller screen size, particularly with certain types of scales like the 11-point numeric response scale. So I would kind of want to know how is that smaller screen performing. I would hope that it looks and feels the same as it does on the small screen on the larger screen and you’re just scaling it up. And so, therefore, I don’t feel like it’s necessary to test those different iterations. But I do feel like it’s necessary to do it on the back end.
And then what if you don’t have equivalence. What if—say you’re in a rare disease, and you had no choice, you had to administer paper to one site or country and you ePRO in another, and you need every shred of data that you’ve got. It’s not all lost. There are ways that we can equate the scores, and this gets complicated. It probably makes some people uncomfortable. That’s not going to be a fun conversation to have with the FDA. You’re going to have to walk them through equating an IRT. You’re also maybe going to have to add in mode as a variable in the statistical model. So that’s a situation I hope you don’t find yourself in, but you may not have a choice, as I mentioned. And so think about that, if you are approached with something that looks like we’re not sure that we have confidence to say we have equivalence. There are ways that we can get around that.
And I will now turn it over to Paul.
Hey guys. How are we all holding up? Second day. Thank you for sticking with us. I think Paul and Jason provided some really good insight into the intricacies of the issue we face here, or some of the challenges we face here and insight into some of the complexity that we face. But I’m going to take a step back and basically ignore all of that wonderful detail they provided us, because in thinking about this—I was trying to get my head around it in layman’s terms exactly what we’re worried about here. And I think no matter what approach of all these wonderful and confusing methods that Paul and Jason were discussing, I think no matter what approach we’re taking, what we’re really trying to do—or what we’re really worrying about—is that there’s some kind of fundamental difference between answering a question on paper versus answering it on a screen, which will change how participants respond to such a degree that the data is now not really comparable. And I think that’s really kind of—in layman’s terms—the fundamental thing we’re worrying about.
And what is this fundamental difference between the paper and the screen? I don’t think it’s content, because we’re not really changing the content beyond some very very minor potential updates, from for example tick to select or to tap. So I don’t think it’s anything to do with the content that we’re changing. Possibly maybe we might have some concerns about physical layout. We know the physical layout can have an impact on how participants respond to questions. The example on the screen is on the lefthand side, participants are more likely to choose response options from the top row of the response options as compared to when the response options are linear. We also know issues around visual midpoint can play a role. For example, on the lefthand side where the non-substantive options are separated from the substantive options, you see the visual midpoint tends to line up with the conceptual midpoint, so you have participants more likely to answer in the conceptual midpoint when compared to the righthand side, where everything is merged into one, the visual midpoint has worked its way down a bit. And we can see this in horizontal effects as well, where the visual midpoint is shifted when you have uneven spacing. So for example the top option, you see participants more likely to respond possible, as compared to the bottom option. So physical layout can have an impact on how participants are responding.
But I don’t think that’s really what we’re worrying about here, because we’re kind of worrying about changes such as what we have on the screen, where you’re going from this very nice paper version to a very very similar electronic version. Even more substantial changes, where you’re going from paper to a handheld, where you only have a single item on the screen, we’re not introducing any of those systematic sources of variation that I was showing you before in regards to layout. In fact, there could be quite a strong argument to be made that we’re simplifying and focusing patients on a single item, and in fact you might be getting better quality data from them, because they’re not being distracted by everything that’s going on around a specific question on a piece of paper.
We also seem to be worrying about changes like going from ticking a box with a pencil to tapping on the screen. And I really feel like this is not something that’s going to fundamentally affect how participants are responding to a question. But of course, you’re all wonderful scientists in here and you’re naturally going to ask, but how do we know. And I think the simple answer is because we’ve tested it again and again and again, using all these various methods that are being discussed. And there’s definitely discussion to be had about exactly what it is we’re testing there, but as I said overall I think we’re just looking at, is there some kind of fundamental difference going on here.
In fact we’ve tested this so much that we’re now able to do meta-analysis. As already has been discussed, there was the Gwaltney paper, which as everyone is aware, there were 65 studies, they saw a very high correlations, and they came to the conclusion that there is really limited difference between paper and electronic administration of patient-reported outcomes. But because you’re all good-quality scientists in here, you might say you’re still not convinced, you want more evidence. This paper is back in 2008, technology moves quickly. Things might have changed in the meantime. And I think that’s a very good point. So how about another meta-analysis, which basically took the Gwaltney approach. And this is work we did with ICON. And extend it and looked at all published equivalence studies from where Gwaltney cut off—2007—all the way up to the end of last year, to see are we seeing any kind of difference in published equivalence testing. And unsurprisingly, we found 72 studies, and we’re not seeing any differences compared to what Gwaltney was reporting.
Particularly interesting from my point of view in this particular study as well, all you great scientists out there might raise a very fair point that, well, only equivalence tests that succeed are going to make it into the publications. Any time you run an equivalence study and you don’t get equivalence, you’re just not going to publish the paper. Very fair point. So we took a look at publication bias as well, any potential for publication bias. And I’m not going to even attempt to try and describe these statistics. But the long and the short of it is, we didn’t find any particular evidence for publication bias. And I think that one of the most interesting things was, even if you assume the lowest correlation that was found in the meta-analysis which was .65, even if you assume those for all the missing studies, you’d still need 123 studies to lower the overall average correlation to below .75, which is the kind of accepted cutoff.
So there’s a lot of evidence out there for the comparability between paper and electronic. But again, your great scientist brains might say, all well and good for a study with a single device, which is what the meta-analysis were investigating. It was paper compared to a single device in a study. All the participants are using the same device. As we’ve touched on in numerous presentations throughout the two days, we’re now moving into this wonderful world of BYOD, which means we’re facing a situation where potentially there’s going to be tens, if not hundreds, of different devices being used within a single study. And as Jason touched on and as other people have said, we can’t possibly test all devices, all potential screen sizes for all questionnaires, it’s impossible. And despite maybe Jason thinking this was a bit of overkill, we can look at the range of screen sizes from small to large. And I think it’s important to bear in mind that the meta-analysis that I discussed were looking at paper versus electronic. And I think there’s a very strong argument to be made that paper to electronic is much bigger step than a screen screen to a slightly different sized screen. So I think we can carry over the confidence that we got from paper to electronic evidence to think that potentially there’s probably not going to be that much difference between a range of screen sizes.
But of course we can test these things, and this is something we have started to do and have done with our friends at MAPI, to really investigate any kind of conceptual impact of varying screen sizes on how participants interpret and respond to the questions. So we did a qualitative study with 20 participants across a wide range of different ages, comforts with technology. And basically we were assign them to interact with a vaccine symptom diary across three different devices—small, medium, and large. We provided them devices, and we also got them to use their own device where possible. And basically we just really wanted to look at how the participants interacted with the app in the first place, but also how the different screen sizes impacted how they were interpreting and whether they felt that it would affect how they would respond to the questions. I think it should be noted that this is probably a more extreme scenario than we’ll actually face in real life, where largely, depending on the approach you’re taking to BYOD, but largely a single participant within a study is only going to be using a single device across the entire study. This we were looking at single participants but using a whole range of devices, so if we can show there are limited differences in this quite rigorous scenario, then one would hopefully be able to feel quite confident that in the more restricted scenario where they’re only using a single device in the study, you could get quite comparable results.
And thankfully, the feedback we got in the study was that, really, the screen sizes weren’t impacting how participants were interpreting and responding. A couple of nice quotes: “They all look exactly the same. I’m really comfortable with smaller phones but would be happy to use all three and it wouldn’t affect the answers.” “There would be no differences in answers. I could comfortably do it on any device.” In fact there were only three out of the 20 participants who suggested that there could potentially be some kind of impact of the different screen sizes. One said, “I would probably answer the same on all devices, but if you are not used to a small phone you could miss something and answer differently.” Another interesting one said actually, “You may concentrate more on the big screen if it was flat on the table, so could possibly give different answers.” And another participant felt that they may go into more detail on a device that was easier to use, because typing might be easier.
I think it’s probably important to point out that at least two of these points would be overcome with familiarity, so if the patient was using their own device, two of these points are kind of moot.
And 18 out of the 20 participants also reported they would answer the same on the apps as they would the paper. But wonderfully, all expressed a preference for the app.
So what does any of this mean? I think we have substantial evidence of the equivalence of paper to electronic administration. I don’t think that’s—to my mind it’s not an argument, really, anymore. As Paul touched on, I think there’s probably specific cases to bear in mind. I think that probably more relates to specific participants’ ability to interact with the device rather than necessarily how they respond to the questions and the content themselves. So I think that’s a slightly different argument than the broader one we’re having here.
I think we have strong reason to believe that allowing participants to answer on their own device will produce consistent data, but that’s something I happily admit we need to look at in a bit more detail. Jason gave some good suggestions for how we might start looking at that. But I think it really raises this question of, well how much evidence is enough evidence. I have some concerns that we are being somewhat stifled by entrenched feelings on how PROs should be developed and how high quality paper PROs are. I have concerns that people, to a degree, pick on electronic because it’s a big obvious shiny difference, while kind of ignoring more fundamental issues, larger sources of variance within the questionnaires themselves, independent of the platform that it’s being administered on. And I think it’s kind of a responsibility of us as an industry to really be shifting the burden of proof onto—certainly in the cases where we’re using provisioned devices—starting to shift that burden of proof onto the people who are saying you need to demonstrate equivalence for this case, by asking why do you think I need to demonstrate equivalence when we have all these wonderful studies and all this data showing that, unless you’re doing something pretty weird, you’re going to be getting the same kind of data. And I think that’s a conversation we really need to be driving, and I think we just need to be driving home this message of, well how much evidence is enough evidence.
[END at 01:01:45]