## Clinician Comments

I think this is a really solid first effort at setting a reasonable
benchmark -- and the ACORN page on the subject is well constructed.
My first thought is that we will need to come up with a better definition for "good enough". Using 0.6 may be exactly right, but we will need to support it (or something close to it) with data -- or at least with a well thought out argument and acknowledgement that this may change as the project evolves.

I also think we will need to debate and justify the 50% return rate.
Perhaps a review of the clinical trial data (to assess the typical drop out rates for studies) and a poll of ACORN users (to compare the
proportion of clients that only have one session versus the number of
clients that have multiple sessions but do not have two points of
measurement) would be a useful place to start.

I think that most of our work group activity can be coordinated and
completed using the ACORN site, but it may be useful to start with a
conference call. If people agree, I'll throw out some tentative dates
and times to get things scheduled.

--

ScottWilliams - 24 Oct 2007

.6 - "good enough" - Tak is familiar with the literature regarding effect size benchmarks. I will talk to him about helping us document the rational with references. We can pull almost everything we need from his articles on benchmarking and the lit review he did for those.

50% return rate: I don't think we can look to clinical trials for help in setting a data collection standard for real world data. Part of the problem is than some sites choose to not administer the questionnaire at every session. For example, if you are administering the questionnaires at every 4th session and your median length of treatment is 4 sessions, then it becomes almost impossible to get two get a second data point on 50%, because you would have to be almost perfect. In this case, the site might have to collect data at the first and third session to make 50%. Other sites that use the questionnaire at every session have no problem reaching the 50% threshold. I will do some analyses to see how many clinicians meet that threshold now. When I have looked at large data involving several thousand clinicians, I have found than about 50% meet the 50% threshold. That's why I suggested this as a starting point, though I hope in the future we can move it upwards.

--

JebBrown - 24 Oct 2007

Hi Scott and Jeb-
I have a few thoughts about this. First, since this is a pretty simple formula being suggested (no severity adjustment), I think it's very important to have as level a playing field as possible. A couple of issues are that first-episode clients are likely to start with lower intake scores than return-episode clients. A fairer estimate of effectiveness wouldn't penalize therapists who have a higher proportion of returnees, and reward therapists whose clients don't return. So because of that, I suggest that therapists should code for "first-episode" or "return" client, based on some fixed (8-week, 90-day, etc.) gap to define separate treatment episodes. And only first-episode clients should be counted toward the ES.

I also don't think that the minimums are rigorous enough. 50% of clients leaves an awful lot of room for cherry-picking clients to improve outcomes, and biases upward for clients who stay longer in the example Jeb gives. I suggest a more stringent minimum data-collection requirement of 70% of first-episode clients who have >1 session. My collection rate is 95%, so I don't think 70% is too much of a stretch. In terms of your comment, Jeb, about some centers only collecting the data every 4th session (which will drop their percentage lower than 50% because of the number of very-short-term clients), I don't think that's an argument for lowering the threshold. Rather, I think it's an argument that such a practice will not yield accurate outcome data about a clinician or a clinic!

Also, if only first-episode clients are used (to allow a fairer assessment), then I think that the ES category levels may be set a bit too low, and I would also prefer a third level, since there are a number of therapists who will score substantially higher than 0.8 and these high-quality therapists would be of great interest to everyone. So, if 0.5 = "effective, and 0.8 = "highly effective," then I'd add 1.2 = "exceptional." But if calculating the ES based on first-episode clients only, I might bump each of those up by 0.1.

Lastly, I think that the minimum N of 15 for an individual therapist is way too low. My modeling shows

**huge** fluctuations in ES up until about 25-30 cases when randomly ordering the first 60 completed cases. So my vote would be for a minimum of 25, but definitely no lower than 20.

I know that all of this may make it a bit more difficult for practitioners to reach these levels, but this is definitely not out of reach for the majority, and the certification ought to be reasonably rigorous--especially when one of the big factors (case mix) won't be known.

--

JasonSeidel - 29 Nov 2007

Hi Jason! Thanks for your input.

I agree with your thoughts on case mix adjustment. Tak is working on the document to cover the technical details of how to calculate a "severity adjusted effect size" for purposes of ACORN reports which will shortly be circulated among work group members. You can view an abbreviated discussion of the methodology by visiting the topics:

ACEOutcomesEvaluationMethod and

SeverityAdjustedEffectSize
With regard to the minimum sample size per clinician, this threshold is in large part a function of the reliability of the outcomes measure. In practice this means that longer measures result in more precise measurement of the therapist. Bruce Wampold and I did a series of analyzes using PBH data to determine minimum sample size for the Honors for Outcomes program. In the end, we settled for 10, though my first impulse was to hold out for 15.

My impression in working with data from ultra brief measures of 4 or 5 items is that you do need larger sample sizes to tease out the signal from the noise when you are trying to detect therapist effects. However, since the ACORN measures are longer (at least 10 items, and usually 15-20 items) I'm expecting that the minimum sample size of 15 is pretty accurate. We might want to consider a higher minimum for briefer questionnaires, depending on the reliability of the measure. For example, in the data I've worked with, outcome measures with 4 or 5 items have a reliability of approximately .75 rather than the estimated average of .9 for the ACORN measures.

In any event, this is an empirical question to be evaluated as we gain data on the ACORN measures. A number of large organization besides Regence are already using these measures or are likely to do so within the next year. I can't give specifics, but the fact that the ACORN measures are in the public domain and part of a collaborative effort to build a normative data base has obvious appeal to organizations that are interested in sophisticated measurement strategies that do no depend on the use of "Questionnaires". For more more on this topic see

ItemsOrQuestionnaires.

--

JebBrown - 30 Nov 2007

Hi Jeb-
Re: minimum sample size needed per clinician to have a reliable ES.
I may be missing something, but it seems to me that reliability (either test-retest or internal consistency) and number of items in the measure aren't really that relevant for establishing the minimum sample size. I'll try to show why and then see what your critique of my logic/methods looks like. Since these outcome scales are sensitive to change and individual clients will have substantial variation in their change scores (termination - intake), that means that for a given scale (whether 4 items or 45 items) we can expect a lot of variation in how much each client changes from pre-test to post-test, and that fluctuation will be driven much more by the treatment effect than by the reliability of the measure. That means that if a clinician has 10 clients, and then another 10 clients, the variability from client to client in amount and direction of change will make it likely that the mean standardized change for the first 10 clients will not be very similar to the second 10 clients.

Let me tell you how I ran my little test for this. I took the change scores for my first 30 completed clients, and I randomly ordered them 15 different times and ran a cumulative ES with cumulative SDs(i.e., 1st and 2nd clients; 1st, 2nd, and 3rd; etc.); so this yielded 15 different patterns of change scores, with increasingly stable SDs and ESs with each additional client score added within each iteration. The range of ESs for the 15 iterations of the first 10 clients was 0.95-3.00. The range of ESs for the 15 iterations of the first 15 clients was 0.88-2.75. The range of ESs for the iterations of the first 20 clients was 1.22-2.45. The range of ESs for 25 clients was 1.37-1.87. At 30 clients it was 1.6-1.7. Many moons later, a more reliable measure of my ES turns out to be about 1.5.

I ran the same thing on another clinician. At 10 clients: ES=0.5-1.27. At 15 clients: 0.49-1.12. At 20 clients: 0.61-0.94. So you can see how big a difference it makes for the ORS (at least in this small sample of clinicians) to have at least 20 clients upon which to establish a reasonably reliable ES (give or take 0.30 to 1.20!!!). At 10 to 15 clients, the obtained ES could be significantly off-base. At 30 clients it's pretty tight.

Just for the heck of it, I just flattened all the SDs so they're all the same constant for every calculation. Reliability still looks terrible at N=15 in my sample (ES range is 1.01-2.10), better at 20 (1.32-2.01) and substantially better at 25 (1.44-1.81).

Whaddya think?

--

JasonSeidel - 30 Nov 2007

Hi Jason,

We are deep "into the weeds" of measurement theory. Hopefully someone one more knowledgeable and articulate than myself will weigh in.

The reliability of the outcome measure does make a difference in measuring therapist effects and evaluating clinician outcomes. The lower the reliability, the more error you have. The higher the error, the larger the sample size you need to detect significant differences. This is why I always advise against use of ultra brief measures if the ultimate goal is also to give therapists means to demonstrating their effectiveness convincingly to a skeptical consumers (employers, health plans, etc).

The reliability also makes a difference in the % of cases that you can report as improved. If we take the ACORN measures as an example, I can illustrate this point. The 20 item versions has a tested coefficient alpha of .92 in a clinical sample (.94 in a community sample). A ten item version has a coefficient alpha of .88 in a clinical sample (.9 in the community sample).

Both the 10 item version and the 20 item version will produce identical effect sizes, with a very similar distribution. However, if we use the coefficient alpha from the clinical sample to calculate the Reliable Change Index, the RCI for the 10 item version is .96 effect size, while the RCI for the 20 item version is .78 effect size (assuming a standard deviation of .52). This means that if you use the RCI criteria for rating patients as improved, a higher percentage will be rated improved on the 20 item version than for the 10 item version, even though the effect sizes are the same. If the ACORN measures were 5 items, the reliability would be in the .75 - .8 range (comparable to results I see with instruments that resemble the ORS). At 5 items, the RCI for the ACORN measures is in the range of 1.3- 1.4 effect size. The topic on

ACEOutcomesEvaluationMethod has a link to Excel workbook with tools for calculating the RCI and SEM based on knowing the standard deviation and coefficient alpha of the outcome measure. You can play around with tools and see what I'm talking about.

The

AcornOutcomeQuestionnairesManual contains a discussion of the relationship between

item count and reliability and calculates the standard error of measurement and RCI based on item count.

I discourage using the reliability estimate published in questionnaire manuals unless the sample sizes are large and the information detailed (nature of the sample, method for calculating the coefficient alpha, etc). For example, some statisticians will calculate the coefficient alpha using all questionnaires available, even when you have multiple assessments per patient. I used to do this also until I realized that this had the effect of artificially elevating the coefficient alpha. The effect is not apparent in measures with 30 or more items, but comes highly apparent in measures with fewer than 10 items. When I began the calculate the coefficient alpha using only one record per patient, the reliability dropped noticeably for the 4 item measures I was looking at.

I always encourage my client organizations to evaluate the psychometric properties of an outcome questionnaire using their own data. One of the advantages of the ACORN measures is that nothing is assumed to be true just because it is in a manual. The measures are constantly being evaluated and modified as necessary by the various organizations using the questionnaires.

As a practical matter for reporting effect sizes for ACORN questionnaires using the ACE criteria, I suggested assuming an average reliability of .91 for the adult measures, and then calculating the percentage improved, worsened etc based on using the SEM and RCI thresholds of change. This works out to .3 and .83 effect size, respectively.

I applaud your efforts to understand the stability of your effect sizes using your own data. However, it isn't possible to estimate the predictive value of the effect size for a single therapist using cases for that therapist only. There will always appear to be a lot of random variation around the mean for that therapist.

If you have access to a sample of multiple clinicians each of who has multiple cases, you can use hierarchical linear modeling to estimate the % of variance due to the therapist. There is another way to approach the problem using a simple cross validation methodology. In this case, you divide each therapist's sample into two separate groups, depending on when they started treatment. Look at the correlations betweens outcomes for two different samples for the same therapists.

Imagine now that you have two samples of 10 cases each for every therapist. You can then run the correlations between samples for the same therapist and see if the correlation is correlation significant. How much variance is explained?

Now imagine running the same experiment again - except for this time you used two different outcome measures. One is a 4 item measure with a reliability of .75 and the other a 30 item measure with a reliability of .93. Now when you run the correlations, you will probably find that the correlations between the two samples will higher with the 30 item measure than with the 4 item measure. Again, this a result of measurement error, which is directly related to the reliability of the outcome measure. You might need 20 cases with the 4 item measure to reliably identify above average therapists, while 10 would be sufficient with the 30 item measures.

The reliability of clinicians' "effect sizes" will always be low if what you mean by reliability is that you will have a very high correlation between effect sizes at different points in time. Even with 30 cases, the correlations aren't real high: around .5 or so. The real question is whether this is better than chance, and the answer is yes. The question we need to ask is whether a clinician with a sample of 15 cases with a mean severity adjusted effect size above .5 at one point in time has a better than 50% chance that they will have an effect size of .5 or higher with cases in the future. If 15 cases is enough to make a prediction that is better than chance, then the interest of patients and effective therapists are best served by making this information available.

If we set too high a threshold before a therapist can be rated as "effective" we postpone the potential benefit to the patients from having access to this information because fewer therapists will be able to report their outcomes using the ACE criteria. When this happens, no one benefits (except perhaps the least effective therapists). When Bruce and I evaluated the PBH ALERT data, we saw that 10 cases gave use a statistically significant prediction of future outcomes. Not a great prediction, but much better than chance. This is why we used 10 as the minimal threshold. I suggested a threshold of 15 for the ACORN measures to account for the slightly lower reliability of the ACORN measures compared to the the OQ-30 used by PBH. The OQ-30 has a coefficient alpha of .93 in a clinical setting, compared to the average coefficient alpha of .91 for the ACORN measures.

I do think it is unfortunate that many clinicians seem to have gotten the idea that an ultra brief measure will measure outcomes as well as a longer, more reliable measure. There are definite trade offs, particularly when you get down to 5 items or less.

I believe it is important that we use outcome measures that come as close to .9 reliability as is practically possible. We can get there with 10-15 items. As I clinician, I believe the added measurement precision is worth it, not to mention the simple fact that 15-20 relatively specific items gives us much better information on our patients than than 4-5 very global items.

--

JebBrown - 30 Nov 2007

Tak was very helpful in clarifying some of the issues for me. Following are his main points:

1. Reliability is not a uniform construct, as internal consistency and
test-retest measure different aspects of it. Internal consistency is
just that--"coefficient of internal consistency"--whereas test-retest
would be considered a "coefficient of stability." From our standpoint,
what we are trying to construct is a measure with high internal
consistency, i.e., a test that has high intercorrelations among items.
In particular, we are estimating internal consistency using the
coefficient alpha (also known as Cronbach's alpha). To be clear,
internal consistency implies nothing about the stability of the test
scores over time. Thus, a test with very high internal consistency
could in fact have low test-retest correlation coefficient. Rather, a
high internal consistency coefficient means that the test is pretty
coherent about what it's measuring. Whether or not that is indeed what
you intended to measure (i.e., validity) is a totally different thing.

2. Test-retest method has several problems. First, what is the
appropriate time between assessments? There is obviously no simple
answer to this. Second, if a low test-retest coefficient is obtained,
does it imply that (a) the test is unreliable or (b) the trait is
unstable? This is also theoretical; when measuring a trait that is
considered stable such as personality, a low test-retest coefficient
with one week apart clearly suggests an unreliable test. However, if
measuring a state such as feelings at the moment, then a low test-retest
coefficient says nothing--it could well be that the test is simply
unreliable, rather than it being sensitive to change. Thus, test-retest
provides NO support for reliability of a scale when measuring states,
and rather, to substantiate reliability, it is necessary to use a
different coefficient such as internal consistency. Obviously here I am
falsely dichotomizing trait and state, which is on a continuum.
However, I assume that we are conceptualizing clinical symptoms as
something closer to state rather than trait.

3. From our standpoint of wanting to assess clinical symptoms,
therefore, reliability based on internal consistency is highly relevant,
while reliability based on test-retest is not. For a measure of
clinical change, we thus want a measure with high internal consistency
coefficient. As for whether or not this scale would have high or low
test-retest coefficient is, based on 2, both an empirical and
theoretical matter. Under the assumption that the test has high
internal consistency, a high test-retest cannot distinguish between
arguments that the test (a) measures a trait rather than a state and (b)
is insensitive to change. For example, if we were to ask people, "How
hungry are you?" you would, overall, expect a low test-retest due to its
sensitivity. What if it resulted in high test-retest? For example, it
could be that (a) you screwed up your test-retest intervals (e.g.,
measured consistently at 1:30 in the afternoon) and/or (b) you are in
fact measuring a "trait" because you took your measurements with those
who never have enough food to eat. If in the former case, the high
coefficient is exactly the result you want because of its sensitivity;
however, you could never tell just from one retest that this is indeed
due to sensitivity. If it resulted in low test-retest? Well, it still
doesn't support that the measure is indeed sensitive if you asked in
English where you should've used Spanish. Thus, whether something is a
trait or state (i.e., to what degree something should fluctuate) is
rather a theoretical matter, and how you obtain your test-retest is an
empirical matter that should be theoretically driven. In the context of
clinical symptoms, if there is high test-retest, the test could indicate
that it is (a) still accurately measuring clinical symptoms or (b) is
measuring a trait rather than a state. In the case of low test-retest
reliability, it is unclear as to whether the scale is (a) unreliable or
(b) accurately capturing the fluctuations. In other words, this
argument is not bidirectional, in that poor test-retest reliability does
not provide any support for a test's sensitivity to change (again,
because it could simply be due to an unreliable measure).

4. As a fundamental point, reliability is a necessity to claiming any
validity of a measure. This is also not a bidirectional argument;
reliability comes before validity. In other words, a scale with poor
reliability can never be valid. So, if a scale cannot coherently
measure something (i.e., unreliable), it is certainly difficult to claim
what that something is (i.e., invalid).

Takuya Minami, Ph D

--

JebBrown - 01 Dec 2007

Case mix issues:

Jason raises several valid issues with regards to

case mix adjustment that we need to think through in the documentation for how we will calculated the

residual change scores used when calculating the severity adjusted effect size, including how to deal with "repeat customers" with multiple episodes of care.

One of the problems we face in specifying a case mix model is that different organizations have different variables available to them. At the very least, we will always have a first score, which explain the vast majority of the variance we are ever going to explain. However, we may or may not have session numbers and episode of care numbers, not to mention diagnosis, age, sex, medical status etc.

With regard to the questionnaire of episodes of care, all of the SAS code I develop for my client organizations tracks episodes of care and utilizes this variable in the case mix model. However, this doesn't generally factor in until you have a couple years worth of data and have multiple returning cases. Many of the sites using the ACORN measures capture a variable regarding prior treatment, which can also be used for case mix. Likewise, session number is used when available.

Since the ACE method needs to be generic, I suggest we specify that the analyst evaluating the outcomes make use of all the variables available in the data set for case mix adjustment and evaluate these variables as part of the general linear model used to calculate the residuals. We can include a set of recommended variables, with the knowledge that all may not be available of all (or any) of the patients. The analyst can include in their technical notes information on the inclusion criteria for cases and variables used in the case mix model. The analyst should also document any caveats, limitations in the data, etc.

It seems to me that our goal is to create a standard for reporting that is sufficient flexible so as to be a broadly applicable by giving the analyst the discretion to make technical adjustments in the case mix model consistent with the overall goal of providing a fair, reasonable and methodologically sound report of treatment outcomes.

--

JebBrown - 01 Dec 2007

Hi Jeb-
Back to some of the points about the two forms of reliability we're talking about (Cronbach's alpha AKA internal consistency; and test-retest):

Jacobson & Truax (1991) use test-retest reliability in their formula for the Reliable Change Index (see "r-sub-xx" in Table 1 on p. 14 of their article). I'm just curious if there's a "paper trail" in the literature for the move to an RCI based on Cronbach's alpha (i.e., someone going through the arguments the way that you and Tax have done here) or if it was just by a shift in convention.

I understand from your comments why Cronbach's alpha is important as a foundation (of reliability) for establishing validity, though isn't it the case that if alpha approaches unity (e.g., .98 or .99), there ceases to be a good argument about what the additional items are adding as a predictor for future outcome (in comparison to what the items "cost" in time, etc.), and some might be dropped as long as alpha is, say, .90 or above? So, if someone were choosing between a 20-item scale and a 15-item scale, and Cronbach's alpha were .98 and .97 respectively, would it make sense (all other things being equal) to use the 15-item scale if efficiency were a concern?

Also, from Tak's and your comments about test-retest reliability, it sounds as if under a couple of conditions we can have more faith in what a test-retest coefficient can show us: (1) if the coefficient is high during an appropriately brief time-span that also allows for enough "independence" (so to speak) between measurements; and (3) if the coefficient is replicated in different samples; and (4) the measure also is sensitive to change in that from a baseline to some future timepoint (past the test-retest pairing), there is a significant difference between a control and a treatment group, or standardized change score in a treatment group only; then we have some basis for calling that reliability coefficient a useful tool to establish a kind of stability in the measure that also shows (with the other stats) that the measure is sensitive to change. Does that sound right?

--

JasonSeidel - 02 Dec 2007

I am not concerned so much with a "paper trial" in the literature as I am with doing something that makes sense and is psychometrically sound. There is much I could critique about the "paper trial" and conventions that have been used, but I don't want to waste time on this right now. For example, there are a host of problems with using the RCI without adjusting for severity as we have done in our standards. but you won't find any discussion of this in many of the articles citing the RCI. Tak has summarized very well the issues regarding forms of reliability and why we have chosen to use the coefficient alpha. I will ask Warren Lambert to review the issue, but I am comfortable with our rational.

You will never have a coefficient alpha that approaches 1 unless you are asking the same item all of the time. The documentation in the

AcornOutcomeQuestionnairesManual discusses in detail the relationship between item count and the coefficient alpha. I will note that I have observed what appear to be abnormally high coefficient alpha in some samples using ultra brief measures with items that are very global in nature and scaled on a visual analog like 10 or 11 point scale with very vague anchors at either end. In these cases, it appears that the format of the items may result in a response pattern that makes it appear that you are asking the same item several time. I tested several different item formats in community and clinical samples before setting on the 5 point Likert used on the Acorn measures.

I think you have perhaps confused some of the issues with regards to sensitivity to change. For example, establishing a difference between control groups and treatment groups is irrelevant to our purposes. Another error I find in questionnaire manuals is claiming that the fact that an outcome measure shows change in a treatment group and no change in a non-treatment community sample is evidence of validity. A community sample is unlikely to show change because they have scores close to the mean, just as patients with scores in the normal range average little change. A difference in change scores for a treatment sample compared to a community sample only demonstrates the reality of regression artifacts, not the validity of the questionnaire as being particularly sensitive to change in a treatment population.

A more interesting question perhaps is whether or not the change we observe exceeds what we would expect from regression to the mean. I rarely see this reported, but it is an interesting an worthwhile question to ask. For more on this topic, see

RegressionToMean
Again, I do not believe that test-restest reliability is the issue at all, except to note that in the absence of a high coefficient alpha, your test retest will be lower. More importantly, your case mix model will explain less of the variance if you have a lower coefficient alpha. I have certainly seen this phenomena when I am working with ultra brief measures with reliability below .9.

Without a good coefficient alpha, you do not have a reliable, unidimensional measure. Everything depends on the coefficient alpha.

--

JebBrown - 02 Dec 2007

I have edited the topics, trying to take into account input from this discussion and others. I also uploaded the technical document Tak and I prepared to specify the methodology for reporting outcomes.

--

JebBrown - 09 Dec 2007

I spoke with

Warren Lambert this morning re the issue of minimum sample size necessary to report a clinician's outcomes. He reports that his own analyzes suggest a minimum sample size of 14, but that this depends on a number of variables, including the reliability of the measure and the % of variance due to the clinician and the

intraclass_correlations of patients seen by the same therapist. Warren - feel free to edit this if I'm not getting it right!

Tak is already working on some code to let us use

multilevel modeling, also call hierarchical linear modeling (HLM) with any sample of clinicians using the ACORN measures. Use of HLM permits correct modeling of the variance at the clinician level.

Warren will forward some code he uses to do the same sort of analyzes. There are a couple of different approaches we could take to calculated the intraclass correlations and % of variance due to the therapists. Tak and Warren will compare notes and work out the technical details.

--

JebBrown - 12 Dec 2007