Evidence-based assessment/Instruments/General Behavior Inventory

The General Behavior Inventory (GBI) is a questionnaire designed by Richard Depue and colleagues to measure manic and depressive symptoms in adults, as well as to assess for cyclothymic disorder. It is one of the most widely used psychometric tests for measuring the severity of bipolar disorder in research, and it also can track the fluctuation of symptoms over time. The GBI was first designed for use with adults; however, it has been adapted into versions that allow parents to rate their children. Multiple short versions are available that are often more convenient to use. Current versions include the original full length GBI (self-report), the Parent GBI (P-GBI), the Parent GBI-10-item Mania Scale (P-GBI-10M) and two Parent GBI-10-item Depression Scales (Form A & B), and the 7 Up-7 Down Inventory, as well as a Sleep scale. A version was tested as a teacher-report version, but it proved not to be practical for general use.

Access and Use
The General Behavior Inventory is free to use in both research and clinical work. Some of the short forms have been formally CC-BY-SA licensed. The original author, Richard Depue, asks that you contact him to let him know about the project (rad5@cornell.edu). The GBI has been translated into several languages (in short forms as well as full length, and parent as well as self-report); see the External Links section for a link to an extensive repository.

Suggested citations are:

Self report, full length version
Depue, R. A., Slater, J. F., Wolfstetter-Kausch, H., Klein, D. N., Goplerud, E., & Farr, D. A. (1981). A behavioral paradigm for identifying persons at risk for bipolar depressive disorder: A conceptual framework and five validation studies. Journal of Abnormal Psychology, 90, 381-437. https://doi.org/10.1037/0021-843X.90.5.381

and if using specifically in teens:

Danielson, C. K., Youngstrom, E. A., Findling, R. L., & Calabrese, J. R. (2003). Discriminative validity of the General Behavior Inventory using youth report. Journal of Abnormal Child Psychology, 31, 29-39.

Parent report about youth symptoms (full length version):
Youngstrom, E. A., Findling, R. L., Danielson, C. K., & Calabrese, J. R. (2001). Discriminative validity of parent report of hypomanic and depressive symptoms on the General Behavior Inventory. Psychological Assessment, 13, 267-276.

Teacher report (not recommended for clinical use):
Youngstrom, E. A., Joseph, M. F., & Greene, J. (2008). Comparing the psychometric properties of multiple teacher report instruments as predictors of bipolar disorder in children and adolescents. Journal of Clinical Psychology, 64, 382-401. http://dx.doi.org/10.1002/jclp.20462

Sleep scale
Meyers, O. I., & Youngstrom, E. A. (2008). A Parent General Behavior Inventory subscale to measure sleep disturbance in pediatric bipolar disorder. Journal of Clinical Psychiatry, 69, 840-843. https://doi.org/ej07m03594

Parent report
Youngstrom, E. A., Van Meter, A. R., Frazier, T. W., Youngstrom, J. K., & Findling, R. L. (2018). Developing and validating short forms of the Parent General Behavior Inventory Mania and Depression Scales for rating youth mood symptoms. Journal of Clinical Child & Adolescent Psychology. https://doi.org/https://doi.org/10.1080/15374416.2018.1491006

Self report
Youngstrom, E. A., Perez Algorta, G., Youngstrom, J. K., Frazier, T. W., & Findling, R. L. (2020). Evaluating and Validating GBI Mania and Depression Short Forms for Self-Report of Mood Symptoms. Journal of Clinical Child & Adolescent Psychology, 1-17.

7 Up-7 Down
Youngstrom, E. A., Murray, G., Johnson, S. L., & Findling, R. L. (2013). The 7 Up 7 Down Inventory: A 14-item measure of manic and depressive tendencies carved from the General Behavior Inventory. Psychological Assessment, 25, 1377-1383. https://doi.org/10.1037/a0033975

Note that the 7 Up has less content coverage, and small but significant differences in reliability and validity compared to the 10 item mania scale. Another practical (and sometimes ethical) consideration is that the 7 Down includes the suicidal ideation item, whereas the 10 item depression short forms do not ask about suicidal ideation.

Other short forms have not been extensively published or replicated.

GBI scoring
The current GBI questionnaire includes 73 Likert-type items which reflect symptoms of different moods, and six additional validity items at the end. The original version of the GBI used "case scoring" where items were coded as "threshold" or "not at threshold." Symptoms that were rated as 1 or 2 were considered to be absent and symptoms rated as 3 or 4 were considered to be present. However, Likert scaling would be a better scoring option. The items on the GBI are now scaled from 0-3 rated as 0 (never or hardly ever present), 1 (sometimes present), 2 (often present), and 3 (very often or almost constantly present). All versions that we are circulating now use the 0 to 3 anchors. The scoring for the self-report and parent report versions are the same for the full length, sleep, and 10 and 7 item short forms.

The Depression Scale consists of the sum of items:

01, 03, 05, 06, 09, 10, 12, 13, 14, 16, 18, 20, 21, 23, 25,

26, 28, 29, 32, 33, 34, 36, 37, 39, 41, 44, 45, 47, 49, 50, 52,

55, 56, 58, 59, 60, 62, 63, 65, 67, 68, 69, 70, 71, 72, 73.

The Hypomanic/Biphasic scale sums these items:

02, 04, 07, 08, 11, 15, 17, 19, 22, 24, 27, 30, 31, 35, 38, 40, 42, 43, 44, 46, 48, 51, 53, 54, 57, 61, 64, 66. (Note that Depue's scoring includes item 44 on both the depression and the hypomanic/biphasic scale).

To compare scores to data offered by Youngstrom et al. (any publication), just add the items (if scored 0 to 3), or add the items and subtract the number of items on the scale (if scored 1 to 4). To compare scores to college student and adult data (published by Depue, Klein, and others), check carefully whether to use the 0/1 case scoring method versus a form of Likert scoring, as appropriate.

GBI Short Forms
There are several short forms that have been carved from the GBI. These include the 10-item mania and depression scales, as well as the 7-item versions (the 7 Up-7 Down scale) , and a sleep scale.

For all of these, the score is simply the sum of the items. If these are given as standalone scales, the the 10 item scales are the sum of the ten items.

If the 73-item version is given, then these are the items to extract for each short form:


 * Sleep (7 items): 5, 15, 18, 25, 31, 37, 52.
 * 5 and 18 load on a lassitude factor, and 52 crossloads
 * 15, 25, 31, 37 and 52's main loading are on the larger insomnia factor
 * 10 item Mania: 53, 54, 4, 11, 22, 40, 27, 19, 64, 31.
 * 10 item Depression Form A: 3, 45, 68, 16, 56, 13, 5, 20, 50, 59.
 * 10 item Depression Form B: 34, 14, 63, 72, 62, 9, 23, 6, 32, 18).
 * 7 Down: 23, 34, 63, 47, 56, 62, 73.
 * 7 Up: 22, 31, 30, 64, 43, 46, 38.


 * Hypomania (Depue's scoring): 4, 7, 8, 11, 15, 17, 22, 27, 30, 31, 38, 42, 43, 44, 46, 51, 54, 57, 61, 64, 66.
 * Mixed (Depue's scoring): 2, 19, 24, 35, 40, 48, 53).

Note that researchers have used other sets of items as short forms.

A sortable table below shows the overlap between items across the different forms, along with the item content:

Additional research versions:


 * LAMS 12 Item self report: 52, 40, 44, 59, 19, 29 (factor 1); 11, 7, 31, 38, 22, 4 (factor 2).
 * Lewinsohn used a 12 item version in the Oregon epidemiological study: 4, 8, 11, 15, 30, 44, 51, 64 (all from Depue's hypomanic set); 2, 19, 24, 48 (from Depue's mixed set). Boldfaced items overlap with the 10M scale and could be used in a calibration study.
 * Jensen et al. rationally derived a 7 item impulsive aggression scale: 27, 42, 44, 51, 14, 39, 53, 54.

Reliability
For all versions of the GBI, the full length scales have exceptionally high internal consistency reliability. This is due to a combination of the scale length (28 or 46 items) and the items asking about related symptoms, often in blends. The length of the scales and the high reading level make them less useful in many clinical settings. Item Response Theory (IRT) provides a different way of estimating the reliability of test scores that is not tied to the length of the scale. The IRT approach also has the advantage of seeing how reliable scores are across the range of the underlying trait. For the GBI, IRT would show whether the reliability stays at acceptable levels even at low levels of depression or manic symptoms (as would often be seen if using the scale in a general community setting or screening), as well as at the high end of mood symptom severity (as might be encountered in a hospital).

These figures compare the reliability for several of the short forms based on self report (i.e., teenagers using the GBI to describe themselves).

The GBI or P-GBI for assessing the probability of mood disorders
The diagnostic accuracy of the test depends on the base rate of disorders for your sample. The positive and negative predictive value are directly influenced by base rate. However, the sensitivity and specificity of a test also can vary from sample to sample (Kraemer, 1992). For this reason, the cut scores published in any article cannot be assumed to be equally valid in new contexts. Please refer to the GBI Manual (available from Depue) and the monograph published in the Journal of Abnormal Psychology,  for additional information about the measure. Two meta-analyses have included the GBI, one in youths under age 18, and the other as a self-report measure in adults. The GBI was in the top tier of measures in terms of diagnostic accuracy in both meta-analyses. However, the short forms have not had their diagnostic accuracy published in adult samples (i.e., all published work used the 73 item version).

Bipolar disorder is rare in most clinical settings (e.g., prevalence of less than 10% in outpatient and private practices, and 2-4% in the general population). Because of the low “base rate,” most people scoring high on any screening test are likely to not have the condition. Put another way, the “false positives” will outnumber the “true positives” in most situations unless bipolar disorder is fairly common where one is using the test. The preferred method for using these tools would be to focus on the change in likelihood of a bipolar diagnosis based on high and low scores. Low scores on a good test decrease the odds that a given youth has a bipolar disorder, just as high scores should increase the odds. It is possible to formally combine (1) the change in odds associated with a test score and (2) the prior probability that the youth had a bipolar diagnosis to obtain a new estimate of the probability that the child has bipolar disorder. This can be done visually (using a “nomogram”), mathematically, by use of a table containing the posterior probabilities for a fixed prevalence, or using an online calculator. There are several excellent sources for clinicians who are interested in learning more about using changes in odds as a way of refining diagnosis, including the Prediction section in the materials about Evidence Based Assessment on Wikiversity.

The changes in odds (or diagnostic likelihood ratios) associated with scores on six different tests (the P-GBI, the P-YMRS, the Achenbach CBCL, TRF, and YSR, and the self-report GBI) based on a large sample of outpatients, and an update based a more recent review is available. We also are including a table here that is based on these likelihood ratios, estimating the probability that a child has bipolar disorder assuming a base rate of 5% in combination with a test score in the particular range. We chose the 5% base rate estimate for three reasons: (1) because other colleagues are estimating that 5% of the youths evaluated at outpatient academic research centers meet criteria for a bipolar spectrum disorder (e.g., 6-7% of outpatient cases evaluated in the TEAM multi-site NIMH grant; Geller et al., 2002); (2) because 5% is low enough to serve as a reminder that bipolar disorder is likely to be rare in community mental health, outpatient, and private practice settings, yet high enough to act as a reminder that the disorder can occur and should be assessed; (3) because a 5% base rate will be reduced to negligible probabilities by low or moderate scores on good tests, and raised to intermediate probabilities (30% to 50% range) by high scores on the same tests.

If bipolar disorder is substantially more rare or more common at your site than 5%, we strongly recommend using a rate compared to benchmarks from similar settings as the starting point.

aFlesch-Kincaid estimate of grade level.

Using the GBI to measure treatment response
The GBI has been used in several treatment studies, and it shows good sensitivity to treatment effects. The 10 item versions in particular are brief enough to be repeated during the course of treatment, but show similar effect sizes to interview-based ratings in research studies. The 7 Up-7 Down scales have not been tested in an extracted, standalone format in treatment studies yet.

Here are benchmarks for evaluating change during treatment: * The benchmarks are based on clinical and nonclinical norms, following the "clinically significant change" model by Jacobson and colleagues.

Interpretive example for measuring treatment progress and outcome
Juan's mother fills out a PGI-10M and PGBI-10Da as part of an evaluation. Both of these have raw scores that range from 0 to 30. Juan initially scores a 21, which is in a high risk range for potential bipolar disorder. After the feedback and first therapy session, the score comes down to a 17 (4 point drop). This is larger than the "Minimally Important Difference" (MID) of 3, suggesting that this is large enough for the person to believe that treatment might be helping some.

However, the amount of change needed to be be confident that treatment is actually contributing to reliable change would need to be larger: The 95% confidence in change target is 6 points for this measure (equating to a reliable change index > 1.96 in Jacobson's approach).

After several months of treatment, Juan's score according to his mother's report is down to a 7. This is enough to be confident that treatment is helping. The 14 point reduction (21-7 = 14 point difference) not only exceeds the targets for MID and reliable change, but it also is lower than the "Back" into the normal range threshold of 9. The Back threshold is the 95th percentile for a reference group without bipolar disorder (in this case often with other mild or moderate clinical issues, as there is no nonclinical standardization sample for the PGBI, like most clinical symptom assessments). Scores this high are likely to still be noticeable and may be concerning to others, but they are also within the range of what could also occur for other reasons besides having a bipolar disorder, including problems in daily living as a youth or adult. The Back threshold is the most liberal of the "clinically significant change" definitions proposed by Jacobson and colleagues.

Reducing the score to a 6 or lower would satisfy Jacobson's "Closer" definition -- reliable change combined with a score more typical of the nonbipolar than bipolar reference groups (operationally defined as the weighted mean of the two groups). Again, scores of 5-6 may be noticeable and sometimes irritating, but they also are a marked improvement compared to where Juan started. This would be an even more impressive example of clinically significant change.

If treatment continued and succeeded in getting his score down to a 0 or 1, that would not only show near complete elimination of the symptoms, but it also would satisfy Jacobson's most stringent definition of clinically significant change -- getting the score Away from the clinical reference group (e.g., below the 2.5th percentile of the clinical reference group). This is an exceptionally stringent definition, and impossible to achieve with many outcome measures, where two SDs below average would require negative raw scores.

Peer Reviewed Research
The first paper published on the GBI was in 1981, and research has appeared steadily since then. The GBI consistently has exceptional evidence of reliability, due to its combination of length and well-written (but complex) items. It has showed excellent evidence of discriminative validity in two meta-analyses, one focused on self-report in adults and the other looking at performance with children and teenagers. Miller et al. (2009) noted, in their review of assessment instruments for adult bipolar spectrum disorders: "As a diagnostic screening tool, the scale with the best support is the GBI, as it has consistently demonstrated sensitivity of approximately .75 and specificity above .97. Readers should be cautious, however, because multiple versions of the scale exist, and cutoffs for a positive screen have not been firmly established."

PubMed Search: Click here for a current search on PubMed, a free database that covers medicine (so some articles published in psychology journals might be missing). The entries will usually include abstracts, and sometimes will include a version of full text (especially if the project was grant funded). The search is designed to be highly specific (i.e., not including lots of irrelevant articles), but it might miss some articles.

Languages Available
The GBI has been translated into multiple languages. Some of the different languages available are linked here.

There is a repository that includes many of these here. An older version of this subset is hosted on Trello here. If you are looking here, note that there are separate columns for the 7 Up-7 Down, General Behavior Inventory (self-report), and parent report versions.

Research Resources
Scoring syntax to make all of the above scales is available in SPSS in the OTOPS project. We are working on making a version available in R as well.

The code will work with any informant or language, provided that the variable names are the same. Because there are different versions of the GBI available on the Internet, please be careful to check that the content and order of the items is the same in the version you are using as in the 73 (+6 validity items) version that we used as the basis for the code. A second caveat is to check whether item level scores are typed in as 0 to 3 (the newer format) versus 1 to 4 (the original format).

If your institution has a Qualtrics license, you can import these .QSF files and have the survey read to launch out of the box, or you can customize it for your project.

Set of QSF files here. (*** upload to OSF and drop link to here).

Supplemental Materials
Two papers that tested several short forms when used as parent report and as adolescent self report included supplemental materials that provided more detail about methods and results. These supplemental materials are published here so that they are freely accessible and archived (rather than having them only behind a publisher's paywall).

Factor Structure of the Short Forms
Tables

Rationale for the Ranking of Expected Criterion Correlations
Both papers had two samples, an academic clinic and a community mental health center, along with a large set of variables that could be used to examine the criterion validity of the short forms compared to the full length GBI scales.

Here is the detailed description of how the authors ranked the criterion correlations from what they expected to be largest to smallest:

The GBI scales were expected to show the highest correlation with the cognate rating scale on the Youth Self Report (YSR) because they were converging measures of the same trait, they were completed by the same informant (i.e., they shared method variance)(Podsakoff, MacKenzie, & Podsakoff, 2012), and they were continuous scales (not categorical variables, which would shrink the size of the observed correlation even when measuring the same construct) (Cohen, 1988).

The YSR Internalizing score was expected to show the highest correlation because of the shared method variance: both it and the GBI were completed by the same person. They would be expected to correlate r ~.3 to .4 even if they measured different constructs, due to response set, mood congruent biases, and other factors unrelated to the trait (Podsakoff et al., 2012). Further, Internalizing and Externalizing correlate r ~.6 in the standardization sample (Achenbach & Rescorla, 2001) and also in our samples. Finally, the 28-item and 10-item GBI versions included some “mixed” items, and so they had depression content embedded in them. The 7 Up, in contrast, was “purer” and showed lower correlations with Internalizing in both samples (though still > .4).

The YSR Externalizing score was the best available converging measure for the mania scales in the two samples, but it was not expected to show quite as high criterion correlations as the depression-Internalizing coefficients. A meta-analysis (Youngstrom, Genzlinger, Egerton, & Van Meter, 2015) of diagnostic accuracy shows that Externalizing is not as strongly associated with bipolar disorder as the GBI is: The effect size was r ~.45 for parent ratings on measures like the GBI, versus r ~.34 for measures such as the CBCL Externalizing; r ~.26 for GBI versus r ~ .13 for YSR correlations with diagnoses. The Externalizing score does not include items asking about grandiosity, inflated self-esteem, elevated or expansive mood, or decreased need for sleep without fatigue – the “handle” symptoms that are more specific to hypomania and mania (Craney & Geller, 2003; Youngstrom, Birmaher, & Findling, 2008). Put simply, Externalizing is not as good a measure of the mania construct as the GBI scales are, so the criterion correlation with it is not going to be stellar.

Next, the youth and parent correlations use different sources, eliminating the shared method variance component. Meta-analyses find that parent-youth agreement about the same trait in the youth hovers in the r ~.2 to .3 range (Achenbach, McConaughy, & Howell, 1987; De Los Reyes et al., 2015), exactly what we see in the Academic sample and similar to the estimates in the Community sample.

For the correlations with the diagnoses and interview-based severity ratings: Meta-analyses have established that parent report is significantly more strongly related to youth diagnoses than youth self-report is (Stockings et al., 2015; Youngstrom et al., 2015). Converting the effect sizes from Youngstrom et al. 2015 into correlations yields an estimate of r ~ .45 for parent ratings and corresponding youth diagnoses, versus .26 for youth ratings and their own diagnoses. The same pattern will hold for the YMRS and CDRS-R as the diagnoses – they were based on the same interview as the KSADS diagnoses, and so they correlate with the diagnosis r > .9. Because of attenuation artifacts when using a categorical variable (i.e., diagnosis) instead of a continuous one (i.e., severity on the YMRS or CDRS-R), we would expect the correlations with diagnosis to be about 80% of the size of the correlation with the severity rating (Cohen, 1988).

Depression scores were expected to show a small to moderate correlation with age as well as with female sex based on normative data (e.g., patterns in Internalizing scores in the standardization sample for the ASEBA; Achenbach & Rescorla, 2001). Anxiety diagnoses were expected to show a small to moderate correlation with depression scales due to overlapping symptoms (e.g., the tripartite model of depression and anxiety) (Chorpita & Daleiden, 2002; Watson, Clark, et al., 1995; Watson, Weber, et al., 1995).

Last in the rankings were some demographic variables (e.g., race) and unrelated diagnoses that were expected to have near-zero correlation coefficients.

Share this page
Share this page by clicking the following social media and interactive platforms: 

| |  |  |  |  ResearchGate

The Mendeley button will import all of the cited references into your Mendeley library if you have one.