Presented by Michael Bass, Northwestern University's Feinberg School of Medicine
Thanks for inviting me. And I reiterate that technical perspective. So I’m not necessarily the developer of the instrument, I’m more of a technical implementer of the instrument. So a couple keywords that I have been presenting about before I begin. The system we’re talking about is PROMIS—Patient Reported Outcomes Management Information System. Just a show of hands, how many people have heard about PROMIS before? So okay, there’s a couple of people who have heard about that. What I want to really focus on with PROMIS is the scoring aspect of PROMIS. PROMIS is based on the concept of item response theory, and I’ll get into more detail about that. As well as scoring, PROMIS has a unique feature in that it’s an adaptive test. The participants taking the test will be receiving different items. That’s called computer adaptive testing. And then I’m going to start getting into the actual implementation, the software implementation. And API, application programming interface, is a common term in software development of exposing software for software benefits, so it’s one system of software communicating with the other, and that’s the approach that we’ve taken to implement PROMIS in a software package so it can be used by other systems.
So my objectives for this talk are really to give everyone an overview of the PROMIS measurement system, highlighting the scoring and the administrative aspects of that that are based on item response theory and computer adaptive testing. Then I’ll go into a software workflow for administering PROMIS content. How does this computer adaptive test actually work while it’s running in front of a patient? And I’ll finish by showing a demo app that I’ve written that can be used as a prototype for how a possible solution for a bring your own device type of solution would work.
So the mission statement of PROMIS. It aims to provide clinicians and researchers access to efficient, precise, valid, responsive, adult- and child-reported measures of health and wellbeing. That’s the overall aim of PROMIS. And it offers a lot of very common domains and metrics across conditions which then makes it useful for a lot of comparisons across diseases.
So with that said, PROMIS is a symptom- or function-based measurement system, it’s not a disease- or condition-based system. So those are some of the questions we receive is, you know—well, I get to that—but it really focusing on symptoms like depression, anxiety, physical functioning, fatigue. So you won’t find items that are either condition-based—you won’t find specific items in general about cancer, or about COPD, or about certain types of replacement, like knee replacement, certain types of devices. It’s really focusing on the symptom, not the disease or condition. So we get a lot of questions such as, has PROMIS been validated for certain condition A or condition B. And so sometimes it has, but in an attempt to kind of address that, PROMIS has requested a drug development tool number for physical functioning, fatigue, and pain to the FDA, so that it could be used as a clinical outcome assessment. I won’t talk a lot about that because that’s really outside of my expertise. I know later today someone’s going to be talking more about the FDA and labeling, which I’m actually quite interested in. The piece where I usually use PROMIS in is mostly in clinical settings or behavioral research, so the clinical trial field is actually kind of new to me.
As far as the coverage of PROMIS, it has a hierarchy of three different domains. It’s got the physical health domain, including— So actually there’s like the core domain and there’s a subdomain. I’ll really sort of focus on above the line of the core domains, where some of the other banks have come in later—they’re just as valid as they other ones but there’s a lot less work that’s been done because they were introduced later in the development process of PROMIS. But in physical health, there’s physical functioning, a couple different types of pain measurements, fatigue, and sleep. The core mental health domains include depression/anxiety and social health, as the ability to participate in social role and activities. So that’s kind of the coverage of what PROMIS instruments is. at the bottom, that’s where you can find a lot—that’s kind of the home website for PROMIS and a couple other measurement systems that have been developed at Northwestern, that has a lot more information about the scope and the different classifications of the instruments.
So one question is: how is PROMIS used in a clinical setting. There’s usually kind of two different approaches. It can be used as a screening, so for example, in the first publication, that’s a review of Lurie Cancer Centre in Northwestern, where before an appointment, three days before an appointment, people would receive either an email or reminder to answer the assessment, so it was a battery of emotional distress screenings, depression/anxiety, as well as I, think fatigue and some other instruments. And based on a certain cut-score, if a person had received an abnormal result or a low result, they would get a referral to see a psychologist. That’s one approach that’s been used a lot in a clinical setting with PROMIS content. Another one is sort of a MID study over time. And this is one that’s a reference to a cancer study, but another group that uses PROMIS a lot for longitudinal studies is the orthopedics, for replacement, like a knee or hip replacement. The issue is not three days or one week after the replacement when things kind of go wrong or problems occur. They want to be measuring, you know, how this knee, the pain and the physical functioning of this participant, is happening three to five years down the road. And that’s where the orthopedics group has really latched onto PROMIS. And they’re one of the main users of the instruments.
So what does a PROMIS item look like? It has a seven-day recall period, followed by the focal point of the stem. In this case, this is an anxiety item, so “I feel uneasy.” And it’s based on a five-point Likert scale graded response model for the choice options. So as you can see, it’s very easy, there are not many issues related to user interface, user presentation, so it can be displayed on a mobile device very readily, compared to a lot of the other instruments, sort of like what we’ve seen. There are not as many issues that may come up as in some other instruments. I’ll also just add, because of the adaptive nature, you have to display one item on the screen at once because the next item really does depend on your response, so you can’t create even a grid-like type of presentation mode for PROMIS.
The way PROMIS is administered is there’s really kind of two options. You have a short form or static approach, where all the items in the instrument are administered to the participant. Or you have more of an item bank or a dynamic approach, where there’s an algorithm behind it, which picks what’s the next item. Regardless if you use the static short form or the item bank, the scoring is done the same way, the same approach is used for scoring. It’s based on the statistics that have been calculated, collected, and analyzed, and called the calibrations for these items. These calibrations are kind of what is the fundamental data for the statistics that are used for item response theory. We’ll be getting into that next.
So item response theory. Item response theory, you know, the objective is to measure a person’s latent trait or ability. And so what we’re really talking about here is, we want to be able to measure a person’s depression or anxiety for symptoms, or the physical functioning, the ability to function and get through the day. It uses an explicit model, a very probabilistic model, where there’s parameters or data for every possible response. And the model is dependent both on the statistical properties of the test, as well as the person taking the test. And with that, you get very nice characteristics that you can do based on that, which we’ll get into.
Before I go into further, I wanted to just—the equation at the bottom is what we use. I won’t go into many details mathematically, but I will have a lot of graphical representations of what the equations are doing. But I just wanted to show the equation at least once. It’s a two parameter logistic model. And what that means is that there are two aspects that come up that are used to create the parameters for these items. So the sub j, that is called the discriminating parameter, that is really associated with the stem, “I feel uneasy.” So what I’m sort of getting at is that all items are not created equally. Certain items—so since this is an anxiety item, we’re trying to measure anxiety and we’re trying to measure anxiety based on an item that says, “I feel uneasy.” Well, there are a lot of different items that measure anxiety, but some of them are better at it than others. So it has a higher discrimination level. And so that is one of the parameters that goes into this probabilistic model. The other parameter is the b sub j sub k. What that is, is that is called a location parameter. What we’re trying to do is, find the points along the response options where you have an equal probability to answer, you know, the thresholds between the never and the rarely, the rarely and the sometimes, the sometimes and the often, the often and the always. So how you interpret this graph is that on the x-axis we have the scale of anxiety. Negative-four would be a person who has no anxiety whatsoever. On the other end of the spectrum we have a 3.8 of 4, a person who is very—they have a lot of anxiety. So if I presented this question to a person and it’s, you know, “In the past seven days, I felt uneasy,” a person with no anxiety would have—if you look at the blue line—almost 100% probability, a probability of 100, or asymptotically approaching 100, of endorsing never. As you increase a person’s anxiety, at a certain point, somewhere around a -1.6, the probability of this person answering “never” starts to decrease, while the probability of them answering “rarely” starts to increase, until you hit a point where the red line crosses the blue line, and so that’s the point that I’m talking about, the inflection point. You continue that process and you see there are inflection points between the red line and the green line, and so that’s the inflection point where a person at that level would have the same probability of answering “rarely to sometimes.” So this is kind of how you read these characteristic curves for an item. And it’ll come into play a little bit later when we talk about computer adaptive testing.
So to set the stage for that, let’s take the example of physical functioning. We wanted to create a bank that measures physical functioning, but we also wanted to include a wide range of people. We wanted to be able to measure people who cannot get out of bed, who have a low physical functioning, to people who are athletes, could run. So we obviously need to see that these people, somehow they reside somewhere along the physical functioning spectrum. So what we wanted to do is create a bank of items that have been calibrated, that lie along this continuum. By doing that, we’re able to ask questions that are more appropriate to the individuals. So we can ask questions to people on the lower ends of the spectrum questions at the bottom, like are you able to get in and out of bed, are you able to stand. On the high end of the spectrum, we can ask people questions like can you run five miles, can you job two miles.
Before I do an example of a computer adaptive test, I want to make a distinction between the scoring, the static and the dynamic administration. I mentioned we want to measure the physical functioning across the whole spectrum. And let’s say that we had, for example, that short form that has ten items. So we would have to put those items kind of continuously or equally distanced along the continuum in order to be able to measure people on the low end of the spectrum as well as the high end of the spectrum. With that, what you get is a wide-range instrument, but you have low precision because you’re asking questions to people that may not be appropriate. In a computer adaptive test, we have 125 items to choose from, so we have many items throughout the spectrum. So what we can do is handpick those items that are more appropriate to those individuals in order to get a higher precision score.
So this leads us to computer adaptive testing. It’s based on a large set of items. We pick the first item and then we estimate the score. After that, the CAT then selects then next item, which is targeted to this person, and then we recalculate that score. We continue this until we have a stopping rule. And then we end the test. So it’s really not simple branching, where it’s like a go-to-if statement, it really is statistical calculations that determine the workflow.
So as an illustration, graphically this is—I’m going to use depression as the example. So imagine I wanted to give the depression bank. Because I don’t know anything about the person, I assume there’s a normal distribution, the person is centered in the beginning. I’ll present an item, “I feel depressed.” The person responds. They respond “rarely.” So going back to the previous slide, we see the highlighted area of the response, that’s the distribution of the response. We take that response, we merge it with the previous knowledge, and we end up with a new estimation of their trait. Initially, they were centered at 0, now we see by answering the first choice, they moved a little bit to the right, and we notice the confidence interval shrunk from like a +/- 1 to .4, so we’re getting more confidence in our answer. We continue the process. So based on that, we present another question, and the person responds. Based on the response, we update their probability or their distribution. Here, they answer “never,” so we shift the curve a little bit to the left. And then the standard—anyway, you can see that it’s starting to shrink. Our confidence is growing in our answer, we’re at .35 right now. We continue the process until we reach a point where we think we have enough confidence. PROMIS uses a .3 standard error as their confidence interval. Once we reach that, the test ends. We have enough confidence in the estimate that we’ve provided that we can stop the test. So after four items, we’re able to provide an estimate of a person’s depression. So the answers, that person’s estimate is a -.24 with a +/- .28.
So those numbers, what we want to do is, we want to give a person an interpretation on a 0-100 scale. IRT is based on a logistics metric, so it’s going to be the odds of answering something over the odds of not answering. So in order to transform this to a T-score what we do is we times the score estimate by ten, add 50 to center it at 50, so that we have a range of 0-50, so then the final interpretive scores are 47.6 +/- 2.8.
Now when dealing with PROMIS there are two things to remember. One is a symptom-based score as opposed to functioning. So when measuring symptoms, high symptoms are bad—high depression, high anxiety. So for symptoms, low numbers are better. When dealing with physical functioning, it’s just the reverse. You want someone who has high physical functioning. So high is better than low. So that’s one thing, when interpreting or presenting results, just to keep track of, whether you’re providing symptoms or functioning.
I’ll start now talking about the software implementation of how this system works within the software. So initially during the development of PROMIS, we had to develop a system to administer the content as well as manage studies that were run using this. So we created a system called Assessment Center. That was our first implementation of computer adaptive testing within the research community. It worked, but the one problem is, it wasn’t very portable. It wouldn’t work necessarily in a healthcare setting because it was a web-based system and then, you know, for regulatory reasons, you couldn’t just have patients lead the patient portal of an EHR directly into our website. So what we did four or five years ago is sort of try to package the content of that we were interested in distributing, the algorithms plus the content, in a way that other systems could embed into their products, and so that they could then start administering PROMIS content to their users. And so that’s what we’re going to be talking about.
There were four areas that we needed to answer in order for it to be properly embedded into another system. The end user, or the consumer of this API, needed to be able to ask what instruments were available. They needed to know how to order these instruments. They needed to know how to administer the assessment, how to ask for the next question so they could present to the user how they could get that result, and then put it back into the algorithm to calculate the next item. And then they also needed to get to the score.
So here is kind of a workflow, more from a clinical setting—not necessarily a clinical trial setting—of how we’ve implemented the software. So I mentioned previously, we had implemented this at the Lurie Cancer Center, and so the idea of how this is done, is that either a person would order—number one, order an emotional health panel or assessment. Now this could be an actual person, or this can be an automated process. When I say the assessments are ordered, it could be automated. And they would order it through this sort of clinical interface. Just because it’s pictured as a server, that doesn’t mean that this whole area in blue kind of could represent software boundaries as a model, all of this could actually be running on a mobile device. I’m sort of trying to segregate the functionality of the software into the areas. So an order would take place in the clinical interface and it would send it to the API. Once the orders were received from the API, the end user—jumping over to the mobile phone—would have to know to log in to start taking the assessment. So in this case, a reminder was sent to their email for them to log into the EHR portal to take the assessment. Once they did that, then the patient interface would communicate with the API for the administration to show questions and to get responses. This would be going back and forth between the API and the patient interface. At a certain point the API would say, I have enough information, I’m done with the assessment. They would send a signal back to the patient portal, which would then tell the clinician, this person is done, that they could then ask the API for the results. So this, again, the one thing that I want to mention is, this is kind of a workflow, it could still be implemented as a clinical trial. Right now it’s just that middle piece which is the whole encapsulation of the PROMIS content that I’m trying to focus in on, and what type of entry points you would need in order to implement this into a technical solution.
I’ll finish by giving a sort of a bring your own device type of sample of how this could possibly look. So this was a product that I created as a prototype. And I had some sort of system requirements that I needed to make this happen. I wanted the content to be authored outside of the mobile device, so I had to communicate with some server tool that could author or create content. It needed a place to store data, eventually the data would have to get off the device onto the server. It had to administer PROMIS content, of course. And more from a patient perspective, maybe not necessarily suited for a clinical trial, but I wanted to be able to have the device give patients feedback, and the patient would have some control over the data flow. So those were sort of requirements maybe more for like a behavioral research study, maybe not necessarily applicable for a clinical trial.
So then I needed a back-end solution for the authoring as well as the data storage. So I chose REDCap because it’s a system that I happen to be familiar with for other projects, and it happens to use the PROMIS API to administer PROMIS content natively. So I knew that it could administer PROMIS content.
So what I did is, I created a study in REDCap, and it was emotional distress screening where I added depression/anxiety instruments to that. And that was my protocol, be able to administer depression/anxiety to a participant on a phone. Once I had the study created, I needed a way to grab that protocol that existed someplace on the server and then download it to a mobile phone. So my approach was to use the camera on a smartphone to read QR codes. So I created another website where a person can enter the pertinent information about the study they have been creating—what is the name, what is some text the patient would see, and what is any—where is the location of your data, of the REDCap instance, as well as the token, the security token needed to get that. Once you do that, then you can click “generate a QR code” and you will get the QR code. At that point, as a researcher you can either print those out and put them on the wall, you could send them to individual participants, you can distribute however you want. The participant would then download the app onto their phone, scan the QR code, and then the mobile phone would communicate with the server to download the protocol and the instruments so then everything at this point in time could be administered over the phone.
The other functionality that I wanted to enable in the system was to allow reminders for when a person should take the assessment. I wanted the user to have some control over the privacy of their data as well as whether they even wanted to send it—again, maybe not applicable for a clinical trial study, but sort of a feature that could or could not be installed into an app. Ability to remove or update the protocol in case it changed or they wanted to get that data off their phone, and ability to review the results. So under the setting button, you can set these type of properties and change those. To begin the assessment you would click the “start assessment” button. And here are some examples of what the PROMIS content looks like on a phone. So it’s a very standard five-point Likert scale, not many user interface challenges.
The end of the assessment, a person then can click a button to review their results to see the longitudinal data of what they’ve collected on the phone. So as a proof of concept of what can be done—so the purpose of the app is really a proof of concept of what could be done with a mobile phone, collecting PROMIS data offline.
So at that I’ll end. And thank you very much for listening, and I’ll entertain any questions.
[END AT 28:25]