Speaker: Rebecca Prince, Corporate Translations Inc.
I’d like to bring up Rebecca Prince from CTI, as the General Manager there, to talk about assessing the efficiencies of cognitive debriefing in ePRO usability testing. Rachel has had a lot of experience in this, she’s had over ten years of experience ranging in project management, patient interviews, process development, and heading up various production teams, so she comes with plenty of experience, and also one of her specialties is overseeing the methodologies for testing eCOA, all the measures of eCOA on various devices. So please welcome Rebecca.
Hi everyone. Firstly I would just like to say thank you very much to CRF for inviting me to talk. I’m here to talk about eCOA testing, which is a process of testing new eCOA with patients before it’s used in a trial. It’s used to assess the effects of migration from paper-based measures to an eCOA version. I’m very lucky to be following Michelle’s talk on that because I think by now everyone should have a really great understanding of the process and the background behind what we do.
At CTI, we follow ISPOR’s guidelines. They recommend cognitive interviewing and usability testing for measures that have undergone minor adaptations, which effectively is mostly paper to eCOA. So at the moment, there’s really no consensus on cognitive interviewing methods. CTI’s process, we use full cognitive debriefing, and that’s one of three common approaches listed in the ISPOR guidelines.
We’re always really interested in refining and strengthening our process, so we recently conducted a study to look at the effectiveness of cognitive debriefing within eCOA testing, and that’s what I’m here to talk to you about today. So what we’re going to cover is a brief overview of the testing process, an analysis of the debriefing data from some past eCOA testing studies, a review of the effectiveness of the process versus the input that it needs, and then some ideas for an adapted approach based on what we found.
So I’ll give you a quick overview of the eCOA testing process that we use first off. So the aim of the testing is to establish whether the migration from paper to eCOA has led to any changes in understanding on the patient’s behalf. What we do is, first in house, we review the new eCOA against the source paper version, and there we’re looking for any discrepancies between the two, any differences and potential issues, and we prepare our interview schedule based on that. We then do the actual testing, and that’s generally with 5-10 patients. So we use a three-step process. So first the patient fills in the measure on the eCOA device, and that’s typically tablet or handheld, or occasionally a computer if we’re doing a web-based version. And they do that using a think-aloud process. Then we do cognitive debriefing. And our definition of this, just to be clear on what we’re talking about here, is it’s comprehension-based probing so it’s very much centered around trying to find out what does this phrase mean to you, and then other probes as needed, so what do you understand by “past week,” for example, really looking into the core contents. So we’re seeking paraphrases or summaries or over-comments on every single line, it’s a standard blanket process for the whole measure. And once we’ve gathered all our data, we compare the results against the approved item definitions that we have in a concept elaboration guide that would either be created at the beginning of the process or be obtained from the developer. The third step is usability questions, and that’s both general to the whole measure and then tailored to particular elements and features.
So that was our processes. It gives us really strong results, and as I said, we’re looking to assess things and see how it’s really working for us. So we’ve carried out this review of the process, and the aim of the review is to assess the effectiveness of the debriefing process independently of the types of testing, and as I mentioned we do usability testing, really wanted to put that to one side and just look at this core procedure.
So analysis, we looked at the results from ten past projects. And as I said, we were just looking at the cognitive debriefing data only. So the data points that that gave us were all of the debriefing paraphrases, comments, descriptions given to us by the patients, and then any issues of things that they’d raised themselves, and also any observations and issues raised by the interviewers. We included all feedback in this summary, whether we would consider it necessarily valid or not. So I didn’t get into an analysis of which feedback resulted in changes to the eCOA, because in practice occasionally we see eCOA vendors are limited by software, or there are other things to consider in whether something ultimately gets changes. So we’ve just left that to one side because our initial focus in the interviews is really digging out as much data as we can and finding as many issues as we can.
Once we had our data, we calculated positive or confirming responses and then negative or critical feedback. And then we took that critical feedback and we’d break that down into three more categories. So looking at anything that was related to the source text in particular, and by source text I mean any content that’s unchanged between the paper version and the eCOA version. Then second category, anything eCOA-related, so specifically usability and functionality, anything that doesn’t owe to the understanding of the source. And then third is migration-related changes in understanding, which is a key thing that we’re looking for in this process.
These are the initial results of what we found. We looked at ten projects, and that covered 16 measures that we had tested, giving us 657 items and four and a half thousand data points. So we had a fair amount of data to work with. What we found is that 93% of those data points were either positive or confirming or neutral. And by that what I mean is that the paraphrases or descriptions or comments that they were giving us indicated that the patients’ understanding matched that approved item definition that I mentioned, so showing that they have a good understanding that is the same as we would expect it to be for the paper version.
That left us with 7% of what I’m going to call critical or negative feedback. And that’s a broad term, we’ve lumped in any indications of alter comprehension or difficulties in understanding, any usability issues, any other critical comments, we were really looking to cover as much as we could. So we had a really high level of positive data from this process, and that’s great. It’s also to be expected in this context because of course the migration process aims to change as little as possible. So what we’re really interested in is that 7% of critical feedback, because our aim is we’re really trying to avoid a box-ticking exercise, we’re not just looking to confirm, yes it’s equivalent, but really we’re looking to seek out as many issues as we can possibly find and take a really critical approach to it.
So as I mentioned we broke it down into three categories, the first being feedback on the source text. We found that 64% of all of the critical feedback that we had actually related to the source text. And again, that’s anything that’s unaffected by migration. Breaking that down further, 40% of that were indications that the content was unclear or confusing for the patients. So for example, terms like “disease activity” sometimes cause confusion. We had one patient commenting on the question that was asking about shopping and saying, well what kind of shopping are you referring to, because actually that will change my answer. A question about back pain at night raised some issues as to whether that meant after dark, after I’ve gone to bed, and so on.
A lot more comments also referred to the phrasing and the style of the source text. So people thought the questions were too long and wordy in places or didn’t like the grammar or just the overall style. We had a few, we had 13% of comments suggesting that the content, they thought, was unsuitable for some reason. So for example we had a uterine fibroids questionnaire that we were testing, where some patients thought that the past-three-months time frame was too short to see a pattern. We had a questionnaire that we tested in COPD patients with a list of rather strenuous activities that some patients told us were just well beyond the reach of people with COPD.
So all kinds of issues being raised, and the remaining feedback related to content that seemed redundant because of overlapping questions. Some seemed offensive to some people, issues with response options, and a few suggestions for additional questions.
So it was all really interesting stuff, but as you might expect, we can’t actually implement any of these suggestions in our process. It’s really not the aim of our process because we’re working with validated measures. The feedback might be of relevance in some cases to the developers, although that may or may not be pursued, and really it’s not the intended focus of the interviews.
Just to make a side note on this point, I’d like to raise feedback on the source text as a potential issue for sponsors. So the final reports that we put together for eCOA testing projects document all of the feedback that we received, and then that gets lumped in with information that’s submitted about the measures. And so it’s possible that in some cases, any unaddressed feedback on the source measures can actually raise concerns with regulators. And so we’re just seeing a very few instances where some sponsors don’t actually want cognitive debriefing included in the process.
So to summarize, just on this section on feedback on the source text, it comprises the majority of our critical feedback. It can be useful information but just not in this context, an ultimately it detracts slightly from the focus of the interviews.
Moving on to our next category of feedback on the eCOA version, 35% of all the critical feedback related to usability and functionality in some way. What I mean by that is any feedback on content or features that are specific to the eCOA version and that don’t affect comprehension of the original content of the measure. So you can see here a large amount, about 62%, related to instructions, and that’s the only thing existing in the measure that’s been adapted. And pop-ups, meta-text, and also quite a few cases where the instructions actually haven’t been adapted for eCOA but perhaps should have been. Twenty-two percent related to layout and formatting and that was things like buttons being too small or not being very clearly labeled, and difficulty in using a reformatted body map. And also then we’re finding feedback on navigation, the device itself, and on forced responses.
Admittedly, results of this type are not particularly the main aim of cognitive debriefing, because they’re covering aspects that would be looked at in the third stage of our process, which are the usability questions. However, we’re generated really useful feedback. Feedback on the instructions really illustrates comprehension for us and gives us some idea on usability of the measure as a whole. And because we are looking at it line by line, there is a possibility that we’re actually capturing specific information that might not be covered by the techniques.
So some points for consideration really at this stage are: Would we get the same level of feedback if we were using another entering technique without the line-by-line focus that debriefing requires. And then also, could we obtain this information in another way.
Looking at our third category, which is migration-related changes in understanding, this is essentially the main aim of the cognitive debriefing process. Anyone who has had an eye on the figures at this point may have noticed we are left with 1% of all of our critical feedback relating to migration-related changes in understanding.
Let me show you some examples of the kind of things that we’re looking for in this category and the kind of thing that we’ve found. The first example here is an instance where some extra space in the layout caused a perceived disconnect between the lead question and the following item. So on the left here, this is the source paper version, and you can see that the lead phrase, so this “During the last four weeks, how often because of your endometriosis have you…” sits nice and closely above the first of the questions and so there is an obvious link there. Whereas in the eCOA version on the right, there’s a lot more space. This led one patient to actually overlook the lead phrase because of the distance, they just didn’t realize that they were connected.
I have mocked up this second example here just for clarity although it does relate to a real-life sort of feedback that we had where one patient had difficulty seeing the difference between similar items such as these when they were shown on separate screens. So these were very similar but just differentiated from the time frames and also the underlined qualifiers there. So I put those up just to illustrate how it’s quite easy to distinguish between them when they’re on the same page, but actually when you’re looking at them one item per screen you have to use memory a bit more.
Now I should point out that we may not necessarily recommend a change for all of these things when points of this type are raised. Again our focus is really just to find as much of that initial information as we can and then assess and make recommendations from there.
So wrapping up this initial section, we can see we get a high proportion of positive data. And then our critical data consists of large amounts of feedback on the source text, which we can’t actually implement in this process. Lots of useful feedback on eCOA-specific items, and then limited feedback on the target area which is changes in understanding due to migration. So we can see cognitive debriefing is capturing useful information but only fairly low levels of the target information.
How am I doing for time. I’m going to skip over a tiny bit just to keep us on track.
So some points for consideration are actually, the main thing is it’s quite difficult to assess the effectiveness here because it’s hard to know if there’s more data to be found. We’re getting low levels of targeted critical data, so that change in understanding between modes, but this is to be expected to some degree. So obviously eCOA versions are developed painstakingly following best practice guidelines that specifically aim to minimize that change in understanding. So effectively we’re looking for something that shouldn’t be there. Really what would be useful at this stage and what I think will be the next thing we look at is a more side-by-side comparison of different techniques together that will give us a better oversight.
Now we’ve seen the results of what we’re actually getting from debriefing, I’d like to just put them in context and just briefly touch on the practical considerations. So as has been mentioned before, earlier today, debriefing is highly time consuming. Just testing one or two measures can take up to three hours. And that can actually limit the time available for other types of techniques that you might want to use in the interview. We’ve already heard about patient burden, and of course it’s quite an intensive exercise, it’s a challenging task. And if you look at the reports you can start to see patients actually get a little bit tired towards the end if it’s a long interview, and they find it harder to interpret things and explain their interpretations. So it’s not really optimal for encouraging good engagement with remaining tasks. And then the volume of input required to get to the data that displays any issues, so our 319 points of data that we’ve been looking at, as I said, came from four and a half thousand original data points, boiling down to what I’m going to call useable critical feedback criticizing anything not related to the source text, which is just 2.5%. Now every single one of those four and a half thousand data points has been definitely useful and has contributed to our understanding, but I think in this context it’s worth reviewing the efficiency as well as the effectiveness of the process.
In summary so far, we can say that standardized cognitive debriefing is an effective process in eCOA testing because it gives us a really high volume of data, and we get a signifiant amount of feedback on the instructions and usability which really help us understand patients’ understanding. And as I mentioned I think more research is needed to judge the effectiveness against other methods. We can also argue from what we’ve seen that actually it’s not an efficient process in this context because of the high input of time and resources required. And it really limits the time available for other types of testing. So what we’re left with here is, how can we make this process more efficient? Do we need to actually focus on all the items, because it’s the amount of items that make it a long process. And could we create a more targeted approach.
One of the arguments for doing blanket debriefing of the whole measure in the first place is that we don’t know what parts of the text might throw out problems. So we look at every single line to make sure that we don’t miss anything, and that’s why it’s so long. Now I was really keen to find out would we actually need to do this. So I went back to the data and did a second round of analysis and that’s where we get some really useful information. Everything that I’ve shown you up until now has been based on the type of feedback we’ve had, but when you look instead at where that data comes from, which part of the text, you get a much better idea of how that process is working and where the strengths are.
On this chart I’ve broken down the measure into six key elements of the text. So we’ve got titles and headers, instruction lines, lead phrases by which I mean the “how often have you…” type of phrase, standard questions by which I mean anything that’s not had wording changed within it, response options, and adapted questions so anything where the actual wording or layout or something of the question has been changed. This shows you, in blue, the number of items that we actually tested; and then, in orange, the number of data points of critical or negative feedback that they generated. And as usual I’ve taken out anything that’s not related to the source text.
So this shows us really the ratio of effort in to useful data out, and it really lets us zone in on the areas where debriefing is most effective. Looking at the top row, you can see we’ve tested 44 title and headers, and got zero points of critical feedback back from that. And that’s fairly predictable, I think the standalone items we wouldn’t really expect them to be affected by the context. The second line shows, we’ve tested 132 instruction lines, and got back 89 useable data points. So comparatively it’s an area that generates lots of feedback, it shows us it should be a point of focus for us. Similarly with lead phrases, you don’t get that many lead phrases, but there’s only 14 that have been tested there but it’s given us, I think, yeah it’s given us three, but still a relatively large proportion, and it helps us pick out any issues there. So what’s interesting is when you get to the standard questions and response options. It goes right down. So we’ve tested, for the questions, 379 for 9, for the response options, 84 for 4. So what it’s showing us is that the testing of the questions and of the response options is giving us lots of positive feedback, it’s showing the same understanding between paper and eCOA. Or in other words, these particular items are quite unlikely to be affected by the migration process. Adapted questions ratio goes right up and I think that’s fair enough, that whichever method we’re using if you have a question that’s actually been changed you really want to focus on that.
So the data suggests that the items that are most likely to be affected by migration as we’ve seen are those instruction lines, those lead phrases, those adapted questions. And I would propose, based on that, that we could reasonably limit standardized debriefing to target just those key items. We can see that certain items are unlikely to be affected by migration and that’s again those title headers, standard questions, and response options. And so this indicates to me that we could actually reduce or skip debriefing on those items with low risk of missing any useful data.
So I’m suggesting that instead of doing blanket debriefing on the whole measure, we instead just focus it on the parts of the measure where it’s really likely to be an effective tool. And what that does is, using a reduced process actually opens up more time for targeted questioning and other techniques to really explore those differences in understanding. And a reduced form of cognitive debriefing can actually be used alongside complementary methods, so either a comparative review of responses between paper and eCOA or a comparative review of any sections were the layout or format changes for example. Essentially it gives us a lot more scope of what we can do with those interviews.
So to summarize, we’ve seen that CD is effective, and it’s most effective on items that are directly affected by migration. We’ve seen that blanket cognitive debriefing of the whole measure appears to be fairly inefficient because it needs a high input. And based on this, we would advise that a reduced form of cognitive debriefing, so focusing just on fewer items, would really maximize the benefits and make the most of cognitive debriefing while facilitating the use of complementary techniques.
Thank you so much for listening and please fire away if you have an questions.
Rebecca, thank you very much, that was really really interesting, and to me a new topic if I’m honest. I haven’t come across this topic in depth beforehand. So understanding the relevance of the data and how that data is used is an eye opener for me. So thank you for presenting it in such a way that makes it, for those of us that haven’t come across it, simple to understand. Thank you.
Any questions at all, please, for Rebecca.
[Q&A section starts at 26:25]
I have one. Thanks, that was a great presentation. I’m going to guess this might affect source data more, but I wondered if you had enough data to look at all at the difference between cognitive debriefings of more standardized settled instruments compared to new instruments, or sets of items that you might be testing that aren’t really part of a PRO, and if there was a difference in the amount of problems that you saw?
The measures that we’ve tested all tend to be quite well established PROs, actually.
Just a quick general question. You were saying that it’s quite burdensome on the subjects, it’s quite long. So was there any correlation between how far they got through the interview compared to how many useable items you got? So for example, at the start, do you get more feedback, useable items, than you did as it got towards the end of the interview where maybe they’re more tired or they get a bit bored with all the questions?
Well, where you really see the difference is in the actual paraphrasing itself. So we would just see in a few instances, you could see—because we would report everything back verbatim, so you actually have comments of people saying, I can’t quite think how to rephrase that but I understand what it means. What it doesn’t seem to affect is people spontaneously pointing out issues. So although we do see some decline in the actual paraphrasing side of things, I think people are still really good, they stick through it to the end even thought it’s kind of a challenging process. And yes, so it’s not a significant enough decline to stop us getting good quality data. But if an interview gets really long we would break it down into two sections, perhaps over a couple of days, to try and limit that effect.
Maybe if I can just add, I really really love that presentation and not just because it kind of provides data to a lot of my preconceived notions. I have kind of a couple of comments on it. I think it’s really important to highlight the fact that 64% of the feedback was on the source text itself as well as something you touched on briefly around some sponsors having concerns around maybe getting that kind of feedback. But the fact you are receiving that feedback raising questions about the suitability of the questionnaires we’re using in the patient population and we’re just kind of choosing to ignore that somewhat. I don’t have the answer to that, I’m just raising that as that you’re seeing so much of that feedback in what are meant to be validated questionnaires, I think is—worrying is too strong a word but certainly should give us all pause for thought, I think.
Yeah, and the thing is that what I’ve got really is the figures and what we haven’t done is really look into whether you would consider that feedback valid. So for example in the same questionnaire you might see one person saying well five response options is not enough and another person saying that it’s too many to choose from. But I agree it’s something that doesn’t really get addressed. We kind of have all this data floating around, and most of the time we’ll try and include the developer in the process where at all possible so that even if nothing gets changed as a result of our process—because of course that would have implications for how it’s being used in a trial—that at least they have that for any following implementations and if they would make changes at a later stage.
So I did have one question in regards to you touched on the fact you tended within what you were referring to as the cognitive debriefing process, you would get a lot of feedback that we could probably classify as usability feedback. Do you actually still find it useful to distinguish between cognitive debriefing and usability, particularly considering we at least my experience, we tend to do usability in almost a cognitive debriefing kind of way, inasmuch as you’re getting people to think aloud as they’re using the device and I’ve long kind of struggled with, is that a useful distinction that we’re making.
Yeah, I mean there is a lot of overlap. And the two processes, although I kind of mentioned three specific processes in the beginning, they really do merge, and that’s been the interesting thing that’s come out of breaking this down, is that we do more usability testing in the debriefing section than we really realized, and I think the way that gets generated is firstly, it’s done first. And secondly we look at the whole measure, so people are going through it and they’re picking up things and saying oh yeah when I got to this question I had this difficulty. And like I said it kind of often relates to navigation or to kind of trying to hit the end of the VAS properly. So yeah, there’s a lot of overlap and it’s useful to look at the data as a whole. But generally I would distinguish between what’s really looking at comprehension and what’s looking at the usability side.
I was just wondering, you know when you talked about selectively debriefing certain aspects rather than the whole measure, I just wondered if you got any information from a regulatory perspective in terms of their acceptance of that kind of method and not debriefing the whole instrument.
Yeah, sure, well actually not from a regulatory perspective but going back to the original ISPOR guidelines. So there was a set of 2009 ISPOR guidelines that mention specifically cognitive debriefing and it said in it, you know, it listed out the “what does this mean to you” phrasing. When you come to the more recent measures, the 2014 task force report, the mixed modes group, they’re actually just listing debriefing as one of three measures, so one of three ways of doing things. So our way of doing things is the only one that looks at the whole text. The other two were, firstly, establishing where you get differences in responses. So you test the paper and the eCOA, look at the differences in responses, and then probe on those and interview on those. And the other one was looking at the source paper and the eCOA version and looking for differences between those and then exploring based on that. So actually it would seem that is fairly common to not necessarily be wed to the idea of having to go through every single line.
Any other questions at all? Okay, thank you once again, that was really great.
[END AT 33:30]