Presented by Sonya Eremenco, Evidera
So what I wanted to do today was to give very quick overview of the ISPOR Mixed Modes Task Force Report that was published in 2014—some of you may be familiar with it, some may not—and to go into a little bit more depth in a couple of areas from that report where we didn’t really have room in the report, we didn’t really have enough evidence. And I think there are some areas that have since come out. And since I’ve been working on this mixed modes topic, I’ve really come to realize how complex it is, how really complex it is, and you know, what I think sponsors need to be thinking about when it comes to the potential of using mixed modes. And I also wanted to present a few case studies just to look at what happens in kind of the real world when this happens, because it’s one thing to talk doubt mixed modes in kind of a hypothetical situation but then when you see what really happens—it opened my eyes to, again, how can we minimize risk and then what do we do with the data when we get it, because what I’m starting to realize is not that it’s completely inevitable, but there’s a very strong possibility that mixed modes could happen in a clinical trial. And I think sponsors need to be prepared to deal with it.
So just again, I’m going to try to do a quick overview of the task force report for those who may not be familiar with it. It was an effort that took place between 2010 and 2014. It was done under the PRO SIG at ISPOR, but it did really focus mostly on electronic, mixing paper and electronic, mixing between electronic. And the goal was to develop a good research practices report to address more than one mode of data collection for a specific instrument. So we didn’t really touch on mixing different instruments, that’s a whole different issue, not really our concern. We wanted to provide recommendations to ensure the quality and comparability of the PRO data and to look at analytical approaches for evaluating and pooling mixed modes data.
And one of the other really important pieces of this report was that we were focusing on PROs being used to support label claims. So we were not talking about use of PROs in general and we’re not talking about use of, you know, PROs in clinical practice, but very strictly, how are we going to, you know, use mixed modes and this regulatory context. Again, mixing the same instrument via different collection modes in a trial. And our concern was that non-equivalence between modes means the difference between success and failure for the endpoint in a trial, because treatment effects could be attenuated by the mode differences. And I’m going to look at an example later on about how can this really happen. And on of the other really important points in this report is that mixing modes should really be a decision that you make intentionally, it shouldn’t be something that happens by accident because a site decided they didn’t like ePRO and they wanted to use paper. And that has happened in the past. I don’t know if it’s still happening, but it has happened in the past. so it’s really something that because of the potential consequences, it should be approached with intention, it should be done in a methodical way.
So our overall recommendations were to select the appropriate modes for the trial. And there is pros and cons, there’s a whole host of different modes, including paper. And we go into a lot of reasons, you know, and we didn’t necessarily discourage any particular modes, but we tried to illustrate the differences between them and when certain ones may be better or worse. We also talked about performing a faithful migration, because you need to migrate before you mix. And the whole lot of recommendations in the report that I won’t go into today about what that means and how you do it and how that’s really the key to leading to equivalence, which could then support mixing the modes. The report talks at length about equivalence, which I’m not going to go into a whole lot today. And we talked about different study designs to use to evaluate equivalence. But we really felt strongly that before you mix you should evaluate, do you have equivalent modes so you can feel confident in using mixed modes to pool the data. And then, if you meet the above conditions, then it may be acceptable to implement the mode or modes in the trial, with very strong recommendation against mixing paper and electronic diaries—I’ll talk about that a little bit more later. But you know, assessing the risks of the other combinations that you might be considering, and if you’re deciding to mix appropriate modes, plan it, implement it carefully, and mix at the country level or higher—I’ll talk about that a little bit more as well. And assess—you know you need to plan to assess statistical issues and poolability of the data. Don’t just assume you can do it. So those were the high-level recommendations.
But I kind of want to take a step back into why are we concerned about this, like why did we bother doing this task force report in the first place. Because, believe it or not, when we were forming the—we started off as a working group, and as we were forming, we had to present a proposal to the ISPOR Health Science and Policy Board to get approval to become a task force, and we actually got pushback. We were told, oh this is a settled issue, you know, we see all these equivalence studies, why do you need to do this report? And we really had to present a case for, no this is not a settled thing, there could be consequences in mixed modes, it’s something that affects industry, it affects eCOA providers, it’s something that we really need to look at. So we were approved, and we became a task force, and we moved on to work out the report. And one of the factors that led to the report was that the first task force report, the report on measurement equivalence in PRO, and that report did not explicitly address mixing modes. They kind of put it to the side and said, we’re just focusing on equivalence. The other factor was that the PRO guidance in 2009 explicitly states that we intend to review the comparability of data obtained when using multiple data collection methods or administration modes within a single trial to determine whether the treatment effect varies by methods or modes. So we felt that this alone, this statement in the updated guidance, was enough to say we need to help sponsors be prepared to address this expectation. Because the FDA didn’t say don’t mix modes, they didn’t say you can’t do it, they just said we want to see that the treatment effect doesn’t vary. And we didn’t—it wasn’t apparent that anyone was paying attention to this.
So that’s kind of what led to the development of the report. And of course, as I’ve probably hinted at, the biggest problem in mixing modes is when you’re dealing with paper.
So again, another reason that we thought this report was important was to really break down what do we mean by mixing modes. And it’s, again, very complex because it happens—there’s so many different modes at play, there’s different—I call this levels—of mixing, which have different potential consequences. But I really want to focus on within the clinical trial mixing, because that’s I think where the consequences are the greatest, potentially. And this can happen very easily between the countries if a certain country isn’t ready with the, you know, the ePRO and they need to get started because patients are ready to enrol. We’ve seen sites within a country decide, I can’t deal with this ePRO, I’m just going to stick with paper, I don’t think my patients can use ePRO, I’m going to stick to paper. We see subjects on a site, something could happen, a subject says they don’t want to do it and we don’t want to lose that subject so let them do paper. Within a subject, this is unfortunately more common than I’d like to believe, or for various reasons the subject may start out with one mode and end up in a different mode. And that’s—the time points within a trial is the biggest example of that. And I’m still seeing unfortunately quite a number of examples where a baseline might start on paper, for various reasons, and then the endpoint measurement is done in a different mode. And I don’t think a lot of attention is paid to what happens during that change.
So this is a table, it is taken from the task force report, where we attempted to equate sort of the level of risk with the different levels of mixing. The higher levels, where it’s between product development programs or different clinical trials within a program, at that level you’re probably more likely to compare the data between the trials. You may not necessarily be pooling the data together. But it’s really within the clinical trial where the goal is to pool, you really want to use as much of the data as possible to evaluate your treatment effect. And our recurrent point in this report is, is it safe to do so when you're mixing modes.
And one of the questions that did come up a lot in our efforts with this task force report was, well if you found equivalence in the instrument between modes—I’ll go back to this slide—what’s the problem of mixing within a subject? Why is that such a big deal? Aren’t the modes equivalent? Isn’t it okay? And we always have this—or I always had this sense that that’s not good enough, like there’s something about within-subject mixing that could compromise the data. And it’s really difficult to determine, if you’re in a trial setting where you’re expecting the patient to change, what’s driving that change if you’re mixing modes. So we strongly discourage that within the subject, but we didn’t really have a clear cut explanation beyond that.
And really, part of our rationale is that we want to avoid measurement error in clinical trials, that’s why they’re so controlled. And as stated before, measurement error reduces statistical power and attenuates stability to detect real change. So even though it can be feasible to mix modes operationally speaking, we really don’t want to add any unnecessary measurement error to the trial design by using modes that are not sufficiently equivalent.
One of the areas of pushback that we got from the ISPOR board was that, well what about all those other sources of measurement error. And we agreed to acknowledge them, we were like, that’s true, that’s absolutely right. There are a number of other sources of measurement error in clinical trials that you can’t avoid. Translation and cultural adaptation. You know, there are issues—we do our best with that but there are issues there. You could have cultural biases due to different experiences with the condition culturally. And different variability in patients’ ability to reflect and provide a response, just the nature of the PRO measurement. And our point was, mixing modes was preventable or potentially avoidable source of measurement error, so why wouldn’t you take that out of the equation because you know there’s other sources that are going to affect the data.
But on the other hand of the equation and the other side of the spectrum, there might be reasons why you do want to include additional modes. And one of the biggest issues is in dealing with missing data, and especially in areas like oncology where missing is likely to be not at random. And there’s a lot of push to have either a backup mode in oncology or something to allow patients who can’t come to the clinic to complete their assessments another way to complete their assessments from home. So we’re seeing that as an area where mixed modes could be beneficial to prevent that type of missing data when the patient’s getting too ill to come in. And it may actually provide a broader and more representative sample, because you can include populations that may or may not have internet access, so you’re not excluding anyone, and there’s the potential for a lot of patients to actually choose the mode that they prefer, and they’re most comfortable with. And there may be cases where hearing impairment or patients are not computer illiterate, where they might have difficulty using the chosen mode for the study. So by mixing you could actually increase the data and allow for comparison of results that did not use the same mode for data collection. So there are some pros—I don’t want to say it’s all bad, but you know.
So because of this, we really wanted to be pragmatic in this report, we wanted to say okay, what should we be thinking about if you are going to mix modes. And getting back to the equivalence point, there’s quite a bit of literature now on equivalence. A lot of it tends to focus on comparisons at the group level. And I want to get into why that could be misleading. But again, this published literature may not generalize to all PROs and all clinical trial contexts. And it tends to focus on what I call the formatting differences and not the procedural differences. And procedural differences are more related to how using paper and electronic is different, how you know, on paper there’s nothing restricting you from answering anywhere on the page, or answering multiple items, multiple responses to an item, you can leave things blank. And on electronic, all that is restricted, so you can only answer one response to a question, your answer is mandatory, you can’t leave things blank, so there's just those kind of differences that do lead to differences in response that are legitimate. So our overall conclusion was, do not vary data collection modes within a trial that you seek to pool or compare data without prior evidence of measurement equivalence. You know, avoid it unless it’s absolutely necessary.
And just a couple of quick points about the within-mode issues or things to think about. Paper and electronic is the most risky most risky because of the procedural differences I just talked about. We strongly discouraged it with paper and electronic diaries because we know the FDA discourages paper diaries anyway. So it’s just really too risky. Those are, in some circles, they are not even considered equivalent anymore because there are just so many differences between paper and electronic diaries. And from a feasibility standpoint, you have to actually develop a separate data entry system for that paper data entry that’s going to add cost and add complexity.
So we’re seeing more and more examples of mixing screen-based modes. This could be less problematic because the interface is more similar, the procedures are more similar, it maybe easier to implant across modes. And you know, I have concerns about web-based just because of such great variability over screen size and issues with connectivity that so many patients may have, not to mention you just really can’t control the screen in the same way that you can in other modes. And the third option that still may happen is mixing visual and auditory, which would be an example of IVR with a web-based or visual. There’s still some questions around whether IVR and visual are equivalent modes, so I think this is an area where quantitative equivalence study evidence really needs to be demonstrated to really feel comfortable mixing those modes.
So some of our more operational recommendations were plan ahead to the extent possible, don’t do this by accident, don’t do this by default. Allow time to conduct the measurement equivalence evaluation to support your decision to mix if it’s not yet available. And defaulting is very risky, mainly because you don’t have that measurement equivalence, you don’t even know for sure, you know, in a lot of cases if the modes are going to be equivalent and if it’s actually acceptable to mix them.
So prior to the trial, things you can do to kind of mitigate the risk. Again, evaluate measurement equivalence if you don’t have that data. Assess the risks of which combinations you are considering using. You may need to consider powering the study according to the results of equivalence evaluation so that you can adjust for the presumed error in a sample size calculation. So what this means is, you may have to increase your sample size to accommodate this increased risk due to mixed modes, which is kind of contrary to what you really want to be doing in sample size calculations. It’s really important to have appropriate training for both modes or as many modes as you have in the trial. I think that’s a place that when people think about mixed modes they overlook that. And there needs to be criteria for which countries, regions, sites, subjects, are permitted to mix, so that investigators are not just doing it, sort of willy-nilly and making up their own rules, that there’s actually a process and rules and a rationale for this mixing approach.
During the trial, when you’re implementing the trial, it’s really important to minimize the types of issues that can lead to this accidental or default mixing, such as inadequate training or infrastructure. Problems with infrastructure that can lead to defaulting to paper. Because that’s I think one of the biggest things that we’re trying to avoid in this report was, try not to default to paper, try to avoid it if possible. If it is planned, which is our recommendation, managing when and where the modes are used. And again, try to minimize that within-site or within-patient because it’ll just be so much harder to actually tease out what’s driving differences at that level. There is—you know, I hate to say it, we might hear about this more tomorrow—the potential for technology failure. And so you do have to plan for contingencies in that case. If the patient has a diary that dies, you know, what are you going to do to replace that diary as quickly as possible or some kind of recovery. In diary studies, we did caution, try to use another option besides paper as a backup because of the many issues with paper diaries. But it’s just something that you do need to think about and plan for, because again that helps avoid this ad hoc mixing. And small proportions of data may not impact study results, but sensitivity analyses are necessary. So people wanted us to give them sort of a cut-off, when is it okay, when is mixing okay. And we kind of threw this in there, but less than 10%, but I’m still not 100% sure how comfortable I am with that, but you know, I think it just reflects the reality that the likelihood of mixing is, as I said, higher than I would like it to be so we have to be pragmatic.
And then finally, you know, you need to be thinking about your statistical analysis plan, how you’re going to address analysis of mixed modes a priori to evaluate the treatment effect. And I think that’s an area where I don’t see sponsors thinking about this. I’m seeing some better attempts to manage, but very little forethought into how am I going to deal with this data in this. So once you have your data, what do you do with it. So we recommended comparing results using modes that are similar to how you test translations for poolability because that’s, like I said before, translations can be a factor in measurement error and so you generally want to test them for poolability before you pool the data together. Same thing with modes. Asses mode as a variable for analysis similar to site comparisons. So again, just making sure that the pooling is acceptable. Consider conducting sensitivity analysis to evaluate the effect on data and treatment effect of including or excluding the alternative mode data. And this can be really helpful when you have such small numbers at the site or subject level, that you can’t really do sufficient comparisons of them as a group.
And we couldn’t really go into much more detail in the report, and part of the reason was there are just so many different options and we recommended working with a biostatistician to determine the appropriate statistical techniques.
I’ve added the last point, it wasn’t in the report, but what I’ve realized since then is that even if you did put something in your SAP about the mixed modes analysis you’re planning to do, you still may need to revisit it and possibly revise it because your assumptions may change. You may have planned to do certain things that might not have been what happened in the trial, and you would need to adjust and accommodate that.
So our best case scenario for mixing modes in a clinical trial is doing it in a planned way with the appropriate choice of modes. And I really want to stress that because we didn’t mean to say—I don’t know if anyone ever got this impression, that if you plan to do it with paper diaries, that’s okay, that’s not what we’re saying at all. But plan controlled, try to have prior evidence measurement equivalence, if you can, at the individual level. And have your analytic methods specified in the SAP.
So that was the recap of the report, so what’s happened since then. So I have a couple of case studies based on some situations that I’ve encountered in the past year or so that I thought would be interesting to look at. So what happens in the real world when you mix modes.
So this is the case of a site-based assessment. So this is an HRQOL instrument administered at baseline and week 24 and your endpoint is change in score between these time points. That’s nothing unusual. In this case the study started off with the tablet but various issues arose that I can’t get into, and some sites had to switch to paper. In fact, quite a number had to switch to paper. And for this instrument there had been qualitative comparisons done, so there was this qualitative equivalence available, but there wasn’t any published evidence of statistical equivalence between the modes. And the mixing of modes had occurred after randomization. So that’s one of the things that people are always saying is, well doesn’t randomization take care of it, and very often the mixing happens so much later that you have no idea where anyone is, it’s not going to take care of the problem here. So this is what ended up happening. So we had paper to tablet, about 250 patients, which is kind of an unusual situation because we figured once the tablet was having problems, no one would go back to it later, but that’s apparently what the largest group did. And we have tablet to paper, 120. Paper to paper, 210. And then tablet to tablet - 20. So unfortunately, that tablet to tablet group is so small that there’s very little that we can actually do with it. So what we recommended in this case was to start off with looking at equivalence between the tablet and paper modes at baseline. So we just group, you know, the tablet group the paper group. And then we suggested assessing treatment effect by mode to ensure that it’s the same regardless of mode with the larger treatment groups. And if the treatment effect looks different, then you would be able to actually do further analyses within the different treatment groups, with the three larger samples, to probe into further what’s going on, is there one group that’s driving the difference more than the others. But you’re limited at that point about that you can do with that tablet to tablet group.
So second scenario was a mixing with electronic diaries. This is a symptom diary completed for 12 weeks, and in this case the average score over a week is compared to a baseline in week 12. And in this case the study started with a tablet, but for various reasons the sponsor decided to move to a smartphone device. And it actually had two studies, so one of them allowed mixing within the patient, and the other study had a little bit more control over their enrolment and they were able to have everyone complete the tablet, the ones that started it, and then the newer patients were enrolled on the smartphone, so you had two discrete groups. Similarly to the first one, we had another case where we had some qualitative evidence of comparability between the models but here was not enough data in that study to compare them statistically, so we’re still in this, you know, not quite sure if these two modes are equivalent, although because they're both electronic there's a better chance of that. So our sample sizes are pretty small again. We have in the first study ten who did a tablet, 20 who went from tablet to smartphone and then 120 all smartphone. The second study, larger study but smaller sample doing all tablet 50, and then all smartphones 360. So in this case we took a similar approach of first looking at baseline data to compare the two groups and see if the difference is comparable to the responder definition, if it may exceed it, because that’s really the concern that we have. And it was possible, if you had patients who had switched the modes close enough to and prior to receiving treatment, where they’re still in that stable state, you might be able to calculate an intraclass correlation coefficient, which is typically what we use to look at equivalence. And then again, assessing treatment effect by mode to ensure that it’s the same regardless. So I don’t have the actual results of either of these two case studies, they’re still in the discussion stages, but I’m hoping in the future to be able to present some of that to see what actually happened with these mixed mode situations.
But of the biggest concerns is really the impact of mixed modes on responder analysis. And what we do keep seeing in the literature is that small differences in means between modes have been found, so typically so small it’s within the minimally important difference, and they seem to be not meaningful because the means are analyzed at the group level. What we’re also seeing is the FDA is interested in change at the individual level. The responder definition is typically meaningful change at the individual level, and that’s very different from the equivalence that we’re finding at the group level. And my question was can differences between modes that are even less than responder definition impact the identification of responders and mixing that occurs with a patient. And I happened to come across—this was actually quite accidental—but I happened to come across this article that was published quite a while ago. But it was comparison of the FACT-L on paper to a handheld, and it was a cross sectional study so they didn't do it in a diary format, it wasn’t a take home, published in 2008, and some very interesting results. We did see at the scale level, that they essentially said they found equivalence at the scale level and that the majority of patients—it was FACT-L EQ-5D but I’m more interested in the FACT-L results here—at the individual item level the responses were within one point of each other between the two methods, so that sounds like that’s not a big difference, that’s small. However, when you add them up, the total score differences were a lot bigger. So even though they said they didn't find significant difference between the mean total scores between the two methods, they found that 29% of the patients had a difference greater than +/- 6 points for the FACT-L total score, and for the TOI, which is treatment outcomes index, a smaller set of items, 19%, and then the lung cancer subscale of 40% had what’s considered a potentially clinically meaningful difference at the individual patient level. And what the authors of the article said was, they said it’s really unclear if this difference is due to the mode or due to test-retest variability within an instrument, and then they said their study design did not allow them to evaluate that. But this is what concerns me about mixing at the individual patient level, is that there actually can be what’s considered a significant difference.
So just to try to illustrate this point a little bit better, I wanted to do some—these are simulations, this is not real data—but the scores in this report showed around 97 for the paper and around 100 for the electronic. So I kind of used that as a starting point, and I just wanted to do some like playing around with the data, what does this really look like, what can the impact be. And so what the different bars are, the green bar is the baseline score, and PE is paper to electronic, EP is electronic to paper. So baseline versus the end of study assessment. The yellow or gold bar is the ePRO adjustment, what the score was because of the ePRO. And the end of study score is what the score was on the device at the end of the study. So in the first example, it’s a case where if you’re just looking at the baseline and end of study, it looked like you had a ten-point difference. But because of the ePRO adjustment—which again, I just, you know, using that six-point difference—that when they answered it on ePRO, the ePRO made them answer six points higher. The real difference is only four points, that’s actually not a meaningful difference. But it looked like it was.
Second example, where it went from electronic to paper, the reduction in score was more due to paper, but the total score looks the same—or you know, the score difference, there was no difference in score—looks the same, but it actually was a score increase if it was done in the same mode.
Third example is more decrease in score. Again, it looks like it might have not really—you know—because of ePRO the score difference is actually much greater than it looks like in the paper to electronic, and then vice versa with the electronic to paper, the score difference was actually—looks like it’s eight, but it’s really a little bit less than that. So just an illustration to think about, can the impact of these score differences at the individual level really impact what we’re seeing and then actually change our conclusions at the end of a trial.
So conclusion of this part of the session is that we do see evidence from equivalence studies that can point to potential issues of variability within subject. I see it sometimes reported but not always reported in the literature. And it may point to the need for actually a little more research to look at using a longitudinal study design to look at, is this an issue of the patient—you know, is it a sustained change over time or is it something more about the instrument itself regardless of the modes. And there are lots of different ways to look at that, but that was something that was made—repeatedly stated in the Ring 2008 article was that their study design didn’t answer that question. But to me, this is just further evidence of the concern that if your responder definition rests on change at the individual subject level, which I think it’s most likely to be, that we again, very strongly recommend avoiding mixing modes within a subject because it’s going to be so difficult to tell what’s really going on and whether you have type one or type two error or whether there’s a legitimate change.
So I just wanted to provide the link to the task force report, it’s freely available from ISPOR. And that’s all for my talk. Thanks so much for listening.
[END AT 31:43]