Evidence-based assessment/BipolarYouthMeta

=Multivariate Meta-Analysis of the Discriminative Validity of Caregiver, Youth, and Teacher Rating Scales for Bipolar Disorder in Youths: Mother Still Knows Best about Mania=

Eric A. Youngstrom       Joshua A. Langfus     Caroline Vincent     Jacquelynne Genzlinger

University of North Carolina at Chapel Hill

Avery Loeb (Chapel Hill High School or NCSSM?), X. Nicholas Fogg (Loyola), Emma G. Choplin (U of Miami), TBD....

Gregory Egerton

University at Buffalo, The State University of New York

Anna Van Meter

Northwell Hospital/Zucker Hillside

First published version: (2015) Archives of Scientific Psychology.

Author Note

Eric A. Youngstrom, Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill; Jacquelynne E. Genzlinger, Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill; Gregory A. Egerton, Department of Psychology, University at Buffalo, The State University of New York; Anna R. Van Meter, xxxx

This work was supported in part by a grant from the Lindquist Foundation. We thank Camille Sowder, Mian-Li Ong, Karen Bourne, and Ericka McKinney for their help with coding and contributions to preliminary analyses. We thank Angela Bardeen, Ph.D., for consultation with the search strategy and tracking of coding. Thanks also to Wolfgang Viechtbauer, Ph.D., for consultation around the metafor software package and technical aspects of multivariate meta-analysis.

Correspondence concerning this article should be addressed to Eric A. Youngstrom, Department of Psychology and Neuroscience, University of North Carolina at Chapel Hill, CB #3270, Davie Hall, Chapel Hill, NC 27599.

Email: eay@unc.edu

Plain English Abstract
The past two decades have seen a rapid increase in the amount of research on bipolar disorder in children and adolescents, including studies that look at the accuracy of symptom checklists as a way of telling if a youth might have bipolar disorder. How accurate are these checklists? Does accuracy change if they are completed by the youth or a teacher instead of the primary caregiver? Are checklists that focus specifically on symptoms of mania more accurate than checklists with more general content—typical of older measures? How much does the performance of checklists change depending on whether the sample only includes youths seeking treatment, versus including a healthy comparison group? We addressed these research questions by systematically reviewing major publication databases (PsycINFO, PubMed, and GoogleScholar) and looking at 4094 hits based on our search. We looked for studies that reported enough information to (1) estimate the size of the difference in checklist scores (“effect size”) between cases with versus without research diagnoses of bipolar disorder for (2) youths 18 years of age or younger, (3) including at least 10 cases with bipolar disorder. Because we wanted to compare caregiver, teacher, and youth report on the same measures, we used a newer statistical technique, multivariate meta-analysis, to combine and compare results within as well as across studies. We found 63 effect sizes from 8 checklists used in 27 separate samples, including 11,941 youths, of whom 1,834 had diagnoses of bipolar disorder. Overall, checklists did a good job separating cases with bipolar from other youths, with an effect size of 1.05, meaning that bipolar cases scored more than a standard deviation higher. Caregiver report was the most accurate across all checklists, performing significantly better than youth or teacher report. Scales focusing on manic symptoms also outperformed general symptom checklists. Sample composition also changed the accuracy of the checklists a great deal: Many studies either included healthy children or excluded youths with diagnoses that are difficult to tell apart from bipolar disorder. These studies gave an overly-optimistic sense of how well the checklist might do at identifying youths with bipolar disorder in most clinical settings. Three checklists have shown validity in multiple studies and appear accurate enough to be helpful in improving diagnosis in clinical practice.

Abstract
Objective: To meta-analyze the diagnostic efficiency of checklists for discriminating pediatric bipolar disorder (PBD) from other conditions. Hypothesized moderators included (a) informant – we predicted caregiver report would produce larger effects than youth or teacher report; (b) scale content – scales that include manic symptoms should be more discriminating; and (c) sample design – samples that include healthy control cases or impose stringent exclusion criteria are likely to produce inflated effect sizes.

Methods: Searches in PsycINFO, PubMed, and GoogleScholar generated 4094 hits. Inclusion criteria were (1) sufficient statistics to estimate a standardized effect size, (2) age 18 years or less, and (3) at least 10 cases (4) with diagnoses of PBD based on semi-structured diagnostic interview. Multivariate mixed regression models accounted for nesting of multiple effect sizes from different informants or scales within the same sample.

Results: Data included 63 effect sizes from 8 rating scales across 27 separate samples (N=11,941 youths, 1,834 with PBD). The average effect size was g=1.05. Random effect variance components within study and between study were significant, ps<.00005. Informant, scale content, and sample design all explained significant unique variance, even after controlling for design and reporting quality.

Discussion: Checklists have clinical utility for assessing PBD. Caregiver reports discriminated PBD significantly better than teacher and youth self report, although all three showed discriminative validity. Studies using “distilled” designs with healthy control comparison groups, or stringent exclusion criteria, produced significantly larger effect size estimates that could lead to inflated false positive rates if used as described in clinical practice.

Keywords: Bipolar disorder; children and adolescents; sensitivity and specificity; meta-analysis; mania

= Multivariate Meta-Analysis of the Discriminative Validity of Caregiver, Youth, and Teacher Rating Scales for Bipolar Disorder in Youths: Mother Still Knows Best about Mania =

The diagnosis of bipolar disorder in children and adolescents has been one of the most contentious issues in child mental health over the past two decades. The questions of whether bipolar disorder could manifest before puberty, whether the same criteria should be used for children as for adults, and the validity and importance of collateral reports about mood and behavior by caregivers and teachers have guided a growing body of research. Given these longstanding debates and the growing literature on the topic of pediatric bipolar disorder, it is opportune to undertake a quantitative review of the literature on assessment of pediatric bipolar disorder. A meta-analysis also can address larger themes of cross-informant validity and fundamental research design issues that cut across the wider domains of clinical assessment.

Importance of Accurate Identification of Bipolar Disorder
A substantial portion of mood disorders fall along the spectrum of bipolar disorders, which includes not only bipolar I, but also bipolar II, cyclothymic disorder, and bipolar not otherwise specified (now “other specified bipolar and related disorders"). Both longitudinal and epidemiological studies indicate that at least a third of serious mood disorders follow a bipolar course.  Bipolar disorder also needs different treatment strategies. It is not just the difference between bipolar and unipolar depression that matters for prognosis or treatment prescription. Disruptive behavior disorders—oppositional defiant disorder, conduct disorder, and the new diagnosis of dysregulated mood disorder with dysphoria—also are challenging to distinguish from bipolar disorder, and they would indicate substantially different approaches to treatment.  The same is true of attention-deficit/hyperactivity disorder (ADHD), which has been the nexus of extensive debate both because of overlapping symptoms as well as concerns about possible iatrogenic effects of using stimulants when the person has bipolar disorder.

Despite the risks associated with misdiagnosis, clinical practice often does an exceptionally poor job of recognizing bipolar disorder. A meta-analysis comparing clinical diagnoses of children and adolescents to structured or semi-structured diagnoses found an average kappa of .27. Dismayingly, the kappa was even lower for bipolar disorder, K=.08. Similarly, comparisons of the accuracy of clinical diagnoses versus research consensus diagnoses—arguably even more valid than semi-structured interviews alone —have found that bipolar diagnoses are among the least accurate, particularly among ethnic minority groups. Misdiagnosis reduces the likelihood of appropriate intervention.

Potential Role of Rating Scales for Discriminating Between Bipolar and Other Diagnoses
Rating scales and checklists can potentially improve diagnosis, providing tools that are inexpensive and depend less on training to be implemented consistently across settings, and often have good psychometric properties within the populations and settings where they are used. Some also offer age-based norms, providing an empirical method for comparing behavior and emotions against milestones of normative development. If diagnostic efficiency statistics, such as sensitivity and specificity or diagnostic likelihood ratios are available, then it is possible to combine information from checklist scores with estimates of baseline probability, and other risk factors to come up with a revised probability of diagnosis. The assessment methods advocated by Evidence Based Medicine (EBM) use Bayesian techniques, packaged in a way that is accessible to clinicians, to integrate the information from test results with other available clinical data.

The application of these methods to the specific problem of diagnosing pediatric bipolar disorder has already shown large effect sizes for changing clinical practice by making estimates more accurate, eliminating a bias towards overestimating the probability of a bipolar diagnosis, and improving the consistency of agreement (i.e, reducing the range of opinion between clinicians). The cumulative effect is enhanced agreement about the next clinical action to recommend for a given case.

Various instruments are now available that assess clusters of symptoms related to bipolar disorder. Some of these, such as the Achenbach System of Empirically Based Assessment, do not include a mania scale, but contain subscales measuring other symptom principal components, such as attention problems, aggressive behavior, and anxious/depressed symptoms, that bipolar disorder influences. Several other checklists were originally written for adults and then tested for use with adolescents, or adapted for parents to report about their child’s mood and behavior. A few were originally conceived and designed for use with pediatric samples. Though the development of new measures is progress for the field, it also complicates the instrument selection process. With few head-to-head comparison studies, it is difficult to compare the performance of these instruments. It is timely to do a meta-analysis to compare measure performance, and to identify conceptually and clinically meaningful moderators of measure performance.

Potential Moderators of Diagnostic Accuracy of Measures Used with Youths
Several design issues likely complicate interpretation of the literature on PBD assessment.

Differences in Informant
It is axiomatic that assessment of youths should involve multiple informants: parents and teachers observe the youth in different, important developmental contexts, and they have different implicit expectations for typical youth behavior. Youths have their own perspective on their lives, and have privileged access to their internal states; but they also show large developmental changes in verbal ability, metacognition, and their degree of psychological mindedness, all of which change the reliability and validity of their responses to rating scales. For all of these reasons, the correlation between caregiver, teacher, and youth ratings tends to be only moderate (e.g., r ~.2 to .3) across a broad range of psychopathology constructs. The moderate degree of agreement has a big impact on the definition of clinical “caseness”—a clinical elevation according to one informant will typically be linked with only modest elevations according to other observers. A practical consideration for both researchers and clinicians is deciding how to proceed when informants do not agree. Requiring unanimity among caregivers, teachers, and youths (e.g., using the AND rule) identifies the most impaired cases and sharply reduces false positive rates; but it also identifies only a quarter as many cases as would meet the definition of caseness according to any one of the three informants (e.g., using the Boolean OR rule). This was directly pertinent to the debates during the DSM-5 revision process about whether to require impairment in multiple settings as part of the criteria for establishing a manic episode.

Informant issues are especially salient in the area of pediatric bipolar disorder. Clinicians often give more credence to youth report of internalizing problems, because youths have more direct access to their own subject mood states. However, whenever studies have compared youth and caregiver report in the same sample, caregiver report has produced larger effect sizes for discriminating cases with bipolar disorder from other conditions. This might be due to bipolar disorder also creating substantial externalizing problems, which collateral informants often notice earlier and find more bothersome than the youth. Mania and hypomania include other symptoms—such as pressured speech or flight of ideas—that others find worrisome sooner than the person experiencing them. Parents notice irritable mood at significantly lower levels of mania than the youth, who may notice symptoms of increased energy, hypersexuality, and decreased need for sleep sooner instead. There is evidence with both youths and adults that hypomania and mania compromise peoples’ insight into their behavior and how it is perceived by others, possibly further undermining the credibility and validity of youth report.

Beyond caregiver and youth report, some experts argue that teacher report is important for “corroborating” mania—that manic symptoms are more credible and likely to be more impairing when observed by multiple informants across multiple settings. Conversely, others assert that teacher report should not be included in the decision-making algorithm for diagnosing pediatric bipolar disorder, as it has less validity than caregiver and possibly than youth report and has failed to demonstrate incremental validity for the purpose of predicting diagnosis. Unfortunately, less work has evaluated teacher report (relative to caregiver and youth report) with regard to pediatric bipolar disorder. Several studies found that the Achenbach Teacher Report Form (TRF) is significantly elevated on multiple scales in the presence of pediatric bipolar disorder compared to ADHD or to healthy controls. However, the effect sizes tend to be smaller for teacher report than caregiver report, and the effect sizes shrink further when the comparison group is also treatment seeking instead of healthy controls. Although they have not yet been compared head-to-head in the same sample, teacher report on manic symptom scales also produced smaller effect sizes than caregiver report on the same instruments.

Scale Content
Another potential moderator is the item content of the scale. Widely used measures such as the Achenbach System of Empirically Based Assessment do not include a mania scale, and they often do not have items assessing symptoms that might be specific to mania, such as elated mood or grandiosity. The omissions reflect the time period when the item pool was written, predating consideration that bipolar disorder might manifest in childhood. Subsequent research using these scales found that youths with bipolar disorder showed elevations on multiple clinical syndrome scales. These scales are often elevated in the context of other diagnoses besides bipolar disorder, indicating that they are not specific to bipolar. Although there was initial enthusiasm for a “bipolar profile” consisting of elevations on multiple scales, subsequent research found that many cases showing the profile did not meet criteria for bipolar disorder. Other analyses found that the Externalizing score captured most of the diagnostic information from the ASEBA with regard to potential bipolar disorder, and there was no incremental value in adding the syndrome scores after looking first at Externalizing.

In contrast, other scales focus on symptoms of mania, either using the DSM symptoms as the basis of the items (e.g.), or even expanding the item pool to include other clinical features that might be associated with hypomania or mania in addition to the canonized DSM symptoms (e.g.). These scales are likely to be more diagnostically sensitive to bipolar disorder because they ask directly about the relevant symptoms. They also may be more diagnostically specific to bipolar disorder inasmuch as they also include distinctive symptoms.

Differences in Interview Strategy
In addition to the question of who completes the rating scale to describe the youth’s emotions and behavior, it also is vital to consider how we arrive at our diagnoses. Mental health lacks the equivalent of an autopsy or pathology report that can conclusively establish a diagnosis. In a field where a “gold standard” diagnosis is impossible, perhaps the best we can do is a “LEAD” standard—the Longitudinal, Expert evaluation of All Data—including history of development, prior treatment and response, family history of pathology, and integration of collateral informant perspectives as well as direct observation of behavior. Many research studies approximately approach the LEAD standard by combining a semi-structured interview with expert clinician review and sometimes unstructured interviewing to fill in gaps or probe alternate hypotheses. Semi-structured interviews are much less likely to be used in clinical practice because of length, as well as practitioners valuing autonomy. As we include fewer additional sources of information in the diagnostic process, we must place greater weight on the remaining ones.

The least common denominator in clinical diagnoses of children is an interview with the primary caregiver. The caregiver is most likely to initiate the referral for outpatient services, and young children are unlikely to have the patience, focus, or meta-cognition needed to complete many semi-structured interviews. If the interview is redesigned to be developmentally appropriate for young children, it is difficult to connect with adolescent and adult interviews or diagnostic nosologies. However, if the diagnostic formulation is based solely on the caregiver interview, then there is no source of potentially disconfirming information. Several factors can undermine the validity of caregiver report, including the caregiver’s own stress or psychopathology, seeking disability or educational accommodations (“secondary gains”), or complex interactions around issues with the juvenile justice system or child custody. Even if the youth does not complete a semi-structured interview, direct interaction and observation provide key data about mental status, the presence or absence of stereotypic behavior, and a variety of other factors that can change diagnoses.

The issue of interview informants has prompted much discussion within the field of pediatric bipolar disorder research. Although most research groups gravitated towards using some version of the Kiddie Schedule for Affective Disorders and Schizophrenia as the core semi-structured diagnostic interview, some groups relied primarily or solely on parent interviews when the case was a youth younger than 12 years. Others insisted on also interviewing the youth (Findling, Youngstrom, et al., 2005). When groups reported different rates of comorbid pervasive developmental disorders, anxiety disorders, or family histories of antisocial personality and other parental diagnoses (see Kowatch, Youngstrom, Danielyan, & Findling, 2005; for reviews), it became important to isolate the source of the differences. In addition to differences in the content and organization of mood items in the different interviews used, differences in training, or differences in ascertainment and referral patterns, interviewing only the parent may inflate the association between caregiver-reported checklists and diagnoses—even when the diagnosis is blind to the checklist and based on semi-structured interview. Unlike factors reviewed above, interviewing only the parent likely affects both sensitivity and specificity by exaggerating the degree of separation between the distributions for those with versus without the diagnosis. Leaning heavily on the caregiver for both the criterion and the predictor will exaggerate the apparent effect size.

Study Design Features Especially Potent in Diagnostic Efficiency Studies
Experts have developed standardized guidelines for reporting and critically evaluating the design features of studies evaluating diagnostic tests (e.g., STARD; ) as well as general reports of empirical studies. Here we will focus on factors that (a) affect the severity of the target condition, thus altering the diagnostic sensitivity; and (b) affect the composition of the comparison group, thereby changing the diagnostic specificity.

Design factors changing the diagnostic sensitivity of a measure
The more severe the illness, the easier it is to distinguish from other conditions. Factors affecting the severity of the target condition include the stage of illness, the severity of the presentation, and the use of broad or narrow target definitions. For bipolar disorder, the variability in mood states further complicates the picture because the same illness may manifest with periods of euthymia, hypomania, mania, dysthymia, depression, or mixed mood presentations. Additionally, unlike most diagnoses, bipolar diagnoses persist even after the person recovers from an episode, technically being coded as “in remission.” Therefore, bipolar disorder is heterogeneous—spanning from high functioning people in remission all the way to severely disorganized behavior requiring psychiatric hospitalization. In practice, the severity of illness correlates with participants’ recruitment setting: inpatient samples have the highest average degree of mania, community samples the lowest average, and outpatient samples usually fall in between. All else being equal, mania will be easier than hypomania to tell apart from ADHD or depression. Diagnostic sensitivity of tests will vary as a direct function of the severity of the illness.

Similarly, the “broad” versus “narrow” definition of diagnosis plays a prominent role in pediatric bipolar disorder. The narrowest research operational definitions require the presence of elated mood and/or grandiosity, whereas irritable mood would be sufficient using DSM-IV and DSM-5 criteria (sometimes characterized as the “intermediate” phenotype). At the other extreme, some groups may have relaxed the requirement of distinct episodes of change in mood or energy, potentially stretching the “broad phenotype” to include cases that do not share core features of bipolar illness. Consequently, youth diagnosed using “broad” definitions of PBD are likely to be more difficult to distinguish from other cases than youth diagnosed using “narrow” criteria.

Another wrinkle within bipolar disorder comes from the diagnoses of cyclothymic disorder and bipolar Not Otherwise Specified (NOS – the term used in DSM-IV) or Other Specified Bipolar and Related Disorders (OS-BRD – the DSM-5 parlance; American Psychiatric Association, 2013). Cyclothymic disorder is rarely used in clinical practice in the USA yet is more common than bipolar I in epidemiological samples (Van Meter, Moreira, & Youngstrom, 2011) and is associated with a high degree of impairment in youths (Van Meter, Youngstrom, Demeter, & Findling, 2013; Van Meter, Youngstrom, Youngstrom, Feeny, & Findling, 2011). Similarly, bipolar NOS appears more common than bipolar I in outpatient youth samples, and is associated with a high degree of impairment (Findling, Youngstrom, et al., 2005). However, both cyclothymic disorder and bipolar NOS, by definition, have less severe manic symptoms than bipolar I or II and may be harder to identify using manic symptom checklists. Further complicating the matter, both cyclothymia and bipolar NOS progress to bipolar I or II at high rates during prospective follow-up, raising questions about whether these are prodromes or early stages of illness rather than distinct disorders (Van Meter, Youngstrom, & Findling, 2012). This creates tension between internal versus external validity: samples focusing on acutely manic bipolar I presentations will produce higher sensitivity estimates, but the results will generalize less well to applications where the goal is to identify bipolar spectrum disorder or earlier, milder stages of bipolar disorder.

All of these considerations change the diagnostic sensitivity of the test because they change the distribution of scores among the target group. Sensitivity is defined as the percentage of cases having the target diagnosis that also score above a designated threshold on the test of interest. The average score on the scale will be higher if the severity of presentation is more extreme. As the distribution shifts towards higher scores, a larger percentage of people will score above any given threshold, increasing the sensitivity of the test. For our purposes, studies with greater rates of bipolar I, more cases with current manic episodes, or drawing larger percentages from inpatient settings are all likely to have higher average scores on scales intended to detect mania. More subtly, samples including broader definitions of bipolar disorder, or enrolling people in varying states of illness, will tend to have more variation in scores. In addition to altering the sensitivity of the scale, the greater variance within the bipolar group also increases the overlap in score distribution with the comparison group, reducing the scale’s diagnostic accuracy.

Design factors changing the diagnostic specificity of a measure
The composition of the comparison group directly affects the diagnostic specificity of the measure. Anything that lowers the mean, or decreases the variability in the distribution of scale scores in the comparison group will increase the effect size and decrease the amount of overlap between the bipolar and nonbipolar score distributions. One common design element that would have this effect is the inclusion of healthy controls. Healthy controls, by definition, will have low scores on any symptom measure. Adding them to the sample will shift the mean lower. More subtly, because healthy controls tend to show a floor effect on clinical measures, they bunch together at the lower end of the scale and increase skew.

Another design element that can affect specificity is excluding cases with diagnoses that mimic aspects of bipolar disorder. Unipolar depression and bipolar depression look quite similar, for example. ADHD has multiple symptoms and features that overlap with symptoms of hypomania and mania, including high activity, distractibility, and impulsivity. Oppositional defiant disorder and conduct disorder also entail high degrees of irritable mood, aggressive behavior, and rule-breaking that can look like the mood or impulsive risky behavior of mania. Post-traumatic stress disorder and schizophrenia can produce symptoms that overlap with mania, shading into the more psychotic presentations. Symptom overlap raises the average score on scales where the content includes symptoms that multiple disorders “share.” Endorsing the symptoms due to other disorders raises the average score in the comparison group, increasing the percentage of “false positive” results and directly reducing the specificity of tests. Bias may be stronger for scales that mostly contain nonspecific items, such as irritability and distraction.

If the research design exaggerates diagnostic specificity, a weak test could appear better than a stronger one evaluated in a more generalizable sample. Then subsequent applications of the test, under more clinically realistic conditions would produce systematically higher rates of false positives (the converse of lower diagnostic specificity), creating upwardly biased posterior probabilities. Flawed research designs could reintroduce the same bias that rating scales were intended to fix.

Nested Effect Sizes and Technical Issues in Establishing Relative Superiority
Another issue is trying to determine whether some measures perform significantly better than others, after accounting for differences due to informant or research design. It is possible to test the difference between effect sizes from different samples, comparing the discrepancy to what would be expected under the null hypothesis of no difference beyond sampling error. More statistically powerful tests are possible when the effect sizes come from the same sample, and a few reports have already directly tested performance differences between measures in the same sample. These analyses control for illness severity, comparison group composition, interviewer training, and a host of other factors that could differ between studies. Unfortunately there is no “master linking sample” that compares all of the contending measures against each other head to head.

However, a mixed effects meta-analysis model can account for the fact that some measures may be confounded with design features in the available literature (e.g., if only one research group has published results with a particular measure, then it will be harder to tease apart characteristics of the measure from the set of design factors used by the group). Multivariate meta-analyses can disentangle the effects of design artifact from the differences between measures, allowing direct comparison of measure performance (Viechtbauer, 2010). Clinicians would want to know which measure to use for high stakes decisions, and researchers could enhance studies by switching to the more valid measure.

Research Questions/Hypotheses
We expected the average effect size to be large, reflecting a big standardized difference in mean scores for those with bipolar versus other conditions. However, we also expected the effect sizes to show significant heterogeneity, and we had specific hypotheses about moderating variables. Based on the few prior within-sample comparisons, we predicted that caregiver report would show larger average effect sizes than youth or teacher report. We also hypothesized that scale content would matter: measures that ask about symptoms more specific to mania should show larger overall effect sizes than measures that focus on externalizing symptoms in general, or that combine components originally designed to assess depression, attention problems, and aggressive behavior. A third hypothesis was that studies that only directly interviewed the caregiver would produce larger effect size estimates than those that included direct interview and observation of the youth as part of the criterion diagnosis, due to shared source variance. A fourth hypothesis was that the use of more “distilled” samples that identify more homogeneous and symptomatic cases of bipolar, exclude diagnoses that frequently are difficult to distinguish from bipolar, or that include healthy controls in the comparison group, would yield much larger effect sizes.

We hypothesized that all moderators would remain significant when entered together in the regression models. If “distilled design” was a significant moderator, we would give primacy to the “nondistilled” estimates of effect size as more clinically generalizable, and treat them as the main focus of discussion. Similarly, if interview strategy (including the youth versus relying only on the caregiver) moderated results, then the results based on integrated interviews would take precedence, as they would be less affected by shared source variance. We explored whether there were significant differences between scales after controlling for moderators, but anticipated that the variability between samples, combined with the number of multiple comparisons, would make those results tentative. Sensitivity analyses examined whether results changed substantively after controlling for quality of design (following the scheme used in Kowatch et al., 2005) or quality of reporting (using the recommended QUADAS-2 tool developed to operationalize the STARD Guidelines).

Inclusion and Exclusion Criteria
Studies were included if they reported (a) cases with a diagnosis of a bipolar spectrum disorder made via a structured or semi-structured interview, (b) as well as a comparison group, (c) with both groups completing the same checklists assessing manic, hypomanic, or externalizing symptoms, (d) with data reported for participants 18 years or younger. Cases could be drawn from clinical or community samples. Exclusion criteria included having fewer than 10 cases with bipolar diagnoses (per Kraemer, 1992; to provide reasonably stable estimates of diagnostic sensitivity) (examples of excluded cases are cited), not including a rating scale, not publishing results in an English format (note that we did not find any studies that had usable effect sizes that had been published in other languages), only having data for the bipolar group and no comparison group, only reporting clinical diagnoses based on chart review or unstructured interviews. We limited the search period to 1993 and later so that the DSM-IV criteria would be available and used. Functionally, there were no group comparison studies published prior to then on the topic anyway; only case reports. There were no geographical or cultural restrictions. Studies with adult samples were included only if they reported sufficient information about the subset of cases 18 years and younger. This resulted in exclusion of several studies that were reviewed in Waugh et al. (exclusions here:   ). We excluded effect sizes and studies where the groups were not defined by diagnostic interviews, but instead by proxy definitions of bipolarity based on rating scales, such as the CBCL proxy, elevated scores on a parent-reported mania scale, or “corroborated” mania reported by multiple informants on the same rating scale. These scenarios involve criterion contamination, where the scores on the measure contributed directly to the determination of the criterion “diagnosis” definition. Per DSM-IV, bipolar spectrum diagnoses could include bipolar I, bipolar II, cyclothymic disorder, and bipolar Not Otherwise Specified (NOS). All studies included in the analysis reported that they used DSM-IV criteria, but the publications did not report effect sizes separately for the different bipolar diagnoses, so it was not possible to estimate effect sizes for each type of bipolar disorder.

Moderator Definitions
We created a coding manual in Microsoft Excel, where the variable names, definitions, value labels, and examples were in rows or comment boxes next to the coding area. In addition to publication year, country of data collection, clinical setting (epidemiological/general community, outpatient, acute tertiary setting), and variables necessary for coding study design and reporting quality (detailed below), we also coded several potential moderator variables.

Informant
For each effect size, we coded whether the informant completing the checklist or scale was the caregiver (including foster parents or custodial relatives, although in the vast majority of cases across all samples it was the biological mother), the teacher, or the youth. Analyses used dummy codes with caregiver as the reference category.

Type of Scale
For each effect size, we coded whether the scale contained symptoms specific to mania versus comprising items or subscales originally designed to measure other pathology. For example, the “bipolar profile” from the Achenbach System of Empirically Based Assessment (ASEBA) instruments consists of a combination of the Aggressive Behavior, Attention Problems, and Anxious/Depressed scales. There is no “mania” scale on the ASEBA; the manic items it contains are those that overlap with other disorders, and thus factor analyses assigned them to other subscales. The meta-analyses used a dummy code that defined nonspecific scales as the reference category (testing whether there was an advantage in using scales with more mania-specific content).

Interview Strategy
For each sample, we coded whether the criterion diagnoses derived from interviews solely with the primary caregiver versus also involving direct interview of the proband youth. One study also included interview with the teacher on an inpatient/residential unit as an additional source. We included this study in the “not relying solely on the caregiver” category.

Distilled Sample Design
This dichotomous variable coded whether the original study used a design likely to inflate the observed effect sizes. This was coded “yes” if the sample included healthy controls as part of the comparison group, lowering the mean score for the comparison group and also potentially lowering the standard deviation. It also was coded “yes” if the design excluded diagnoses likely to share symptoms similar to those characteristic of bipolar disorder, such as unipolar depression, ADHD, conduct disorder, or psychosis. Many studies of phenomenology of bipolar disorder relied on healthy controls or groups with ADHD but excluded comorbid mood disorder as comparison conditions.

Search Strategies
As recommended in PRISMA, we consulted with a social sciences reference librarian while designing and revising the search strategy. Reference and citation databases searched included PubMed, PsycINFO, SSCI, ERIC, and GoogleScholar. We piloted the search protocol, consulted with a reference librarian, and implemented the revised protocol. Either PubMed or PsycINFO indexed all of the published reports that met inclusion criteria. (Pediatric OR juvenile OR child* OR adolescen*) AND (“bipolar disorder” OR mani* OR cyclothymi*) AND [(Sensitivity AND Specificity) OR comparison] (PubMed search here). Review articles and chapters were checked for additional sources. This generated 1342 hits in PsycINFO, and 4094 hits in PubMed when the search was updated on September 1, 2014. We pulled hits into a RefWorks database, where we could sort them and annotate them to track disposition. Four relevance judges completed the search training (including review of guidelines and a session of orientation and consultation with a reference librarian about search optimization) and then conducted and reviewed the searches. A content expert (EAY) reviewed all ambiguous cases and instances of disagreement. After reviewing titles and abstracts, and initial elimination of multiple publications using the same dataset, we retrieved 69 articles for detailed review and coding. We examined the reference lists in all studies that met inclusion criteria, along with scrutinizing the bibliographies of recent reviews (Geller & DelBello, 2003 (***Put in 2020 chapters here!) The Mick et al. (2003) paper identified two additional samples meeting inclusion criteria, and a chapter in one edited volume provided sufficient information to add another sample and effect size. A review of papers found five datasets where the article captured by the search did not include sufficient information, but a second paper by the same group included the necessary information. We did not locate any primary reports published in languages other than English, although some reports published in English language journals gathered data using translated versions of measures into Korean/Hangul, French and Dutch. The final dataset included 25 distinct reports reporting 27 samples (two reports published data on two samples). The initial PubMed search identified 12 of 25 usable sources (48% search sensitivity) and indirectly identified 3 more samples (60% search sensitivity, broadly defined); PsycINFO identified 11 sources directly (44% search sensitivity) and 5 more indirectly (64% search sensitivity, broadly defined). The low search sensitivity is partly an artifact of our decision to include studies that reported sufficient statistics even if they did not report diagnostic sensitivity and specificity in the article, as few research groups have used receiver operating characteristic analyses in this literature until recently. In three cases our group obtained access to the primary data and estimated effect sizes directly from the raw data. Figure 1 shows the flow diagram for the search process.

Coding Procedures
Coders were undergraduate psychology majors, doctoral students, and the senior investigator. Training included reading methodology papers (QUADAS; PRISMA; STARD;),  sample meta-analyses focused on pediatric bipolar disorder (Kowatch et al., 2005; Van Meter et al., 2011), orientation to diagnostic efficiency statistics, and then coding two articles and comparing scores to those of a content expert, resolving discrepancies and clarifying concepts. We double-coded all studies for effect sizes, moderator variables, and reporting quality. Some articles reported sufficient statistics to estimate the effect size several different ways, contributing to small discrepancies in effect sizes estimates if the coders used different methods. The content expert reviewed all discrepancies and assigned a final code after perusing the source material, using the method that made fewest distributional assumptions to estimate the effect size. In three cases, the raw data were available and analyzed to provide more extensive information than provided in the primary publication, which often was a preliminary report.

Quality Ratings
We used two systems to code the quality of the study design and reporting. The first was based on a prior meta-analysis of pediatric bipolar disorder (Kowatch et al., 2005). It assigned points for adequate sample size (N>30), interviewing both caregiver and youth (versus one informant only), using a formal consensus process, following DSM criteria, including spectrum diagnoses (e.g., cyclothymic disorder, bipolar NOS), recording comorbid diagnoses, and systematically asking about lifetime episodes. Higher scores indicated more comprehensive assessment of the bipolar phenotype.

Rater Reliability
Inter-rater agreement was good (ICC for absolute agreement > .87 for demographics and moderator variables, > .95 for effect size metrics, and > .80 for quality ratings). The most likely source of disagreement was when raters selected different formulae for estimating effect sizes, or when one coder was aware of algebraic methods that could transform reported information into something that could be extracted and coded, and the other rater had coded the parameter as missing.

Statistical Methods
We used Hedges’ g, a standardized mean difference that corrects Cohen’s d for a slight upward bias in small samples, as our summary effect size. There are three advantages to using standardized mean difference for the purposes of meta-analysis: (a) the studies reviewed more often reported Cohen’s d than AUC; (b) meta-analytic techniques are more highly developed for standardized mean difference than combining AUCs, and (c) analysis of sensitivity and specificity create technical challenges avoided by focusing on other metrics. We used standard formulae to convert sufficient statistics into g (see Lipsey & Wilson, 2001, for list and formulae). AUC converts directly to Cohen’s d, and then to g. If only sensitivity and specificity were reported, these could be converted to an AUC estimate, and then to a d and finally a g. Sample means, standard deviations, and n for the bipolar and comparison group also were sufficient for direct estimation of g. Study variance estimate calculations followed standard methods (Viechtbauer, 2010). All estimates used inverse variance weighting, and we report 95% CIs for the weighted effect sizes.

Most studies reported multiple relevant effect sizes. The nesting of several effect sizes in the same sample could occur because of the use of multiple informants (e.g., caregiver, youth, or teacher report), comparison of multiple scales in the same study, or re-analysis of data to examine the influence of sampling design (distilled or not) on effect size estimates. When studies reported multiple scales from the same measure, analyses used a single estimate: Externalizing was the preferred CBCL scale because it has tied or outperformed “bipolar profiles” in multiple samples. We used brief versions instead of full-length versions when both were reported because they are more likely to be used in practice. The metafor package (Viechtbauer, 2010b) in R was the platform for all analyses, as it is one of the few meta-analysis programs that currently handles nested effect sizes within the same sample (Viechtbauer, 2010a), allowing us to test key moderator variables.

Analyses used mixed meta-regression models. We had several hypothesis-driven moderators of interest, but also want to preserve generalizability, so a mixed approach was best (Viechtbauer, 2010a). We examined each moderator separately, but also created a fully augmented model to test whether each moderator showed a unique incremental effect. Cochran’s Q tested homogeneity of effect sizes, along with graphical methods (e.g., forest plots – see Figure 2). Nonsignificant Q values indicate little heterogeneity beyond sampling error. We used a mixed model extension of Egger’s test for publication bias, although we expected publication bias to be low because diagnostic efficiency requires large effect sizes, making statistical significance a relatively low bar to exceed. We examined standardized residuals from the fitted models, instead of funnel plots, as a way of testing for influential outliers while accounting for the nested structure of the data (Viechtbauer, 2010a).

Similarly, the Meta-Analysis Reporting Standards (MARS;) suggest estimating power when conducting meta-analyses. Power exceeded 99.9% to reject the null hypothesis of g ~ 0, because effect sizes need to be g > .5 to begin to provide diagnostically useful information, and preferably much larger. To account for nesting, we bracketed power estimates by using the number of independent samples (27) as a low end and the number of effect sizes as the high end. Power to detect moderate heterogeneity (e.g., values of .67) was between .64 and .91 based on 27 independent samples and 63 disaggregated effects, respectively. Power was between .86 and .99 for large heterogeneity. We used outlier diagnostics to identify influential cases (Viechtbauer, 2010a), and we conducted robustness sensitivity analyses to examine their effects on parameter estimates.

Results
Figure 1 presents a flow diagram showing the search process. We identified 27 distinct samples from 25 reports published between 1995 and 2014, contributing 63 effect sizes. Of the effect sizes, 38 used caregiver report on a total of 10,232 youths between the ages of 5 and 18 years: 1719 with research interview diagnoses of bipolar disorder, 3150 healthy controls or youths from the general community, and 5363 with other disorders besides bipolar spectrum diagnoses. Youth report generated 14 effect sizes (based on 448 youths with bipolar diagnoses, 1028 healthy youths, and 1542 with other diagnoses), and teacher report had 11 effect sizes (based on 377 cases with bipolar diagnoses, 58 healthy youths, and 855 with other diagnoses). All child and teacher effect sizes were nested within subsets of caregiver data with the exception of the Lewinsohn et al. (2003) chapter, which only included youth report. In terms of other candidate moderator variables, 14 of the 27 samples (52%) used distilled sample designs, and 31 of 63 effect sizes (49%) were based on scales with mania symptom content. Effect sizes came from samples in seven countries, most from the USA, but with each informant contributing effect sizes in at least two countries (see Table 1 for a summary of sample-level characteristics). Eight different checklists contributed effect sizes: the ASEBA contributed 25 effect sizes; the General Behavior Inventory (GBI;) contributed nine; the Mood Disorders Questionnaire (MDQ) added eight; the Conners (1999) had seven, the questionnaire version of the Young Mania Rating Scale (YMRS;) added six, the Child Mania Rating Scale (CMRS;) provided four, and the Child and Adolescent Symptom Inventory (CASI) and Child Bipolar Questionnaire (CBQ) had two each. Table 2 reports the effect sizes, along with the moderator variables and other statistics at the level of the effect size. Because some effect sizes used full-length scales and others used short forms, Table 2 also reports the number of items constituting the scale for each effect size. If effect sizes were based on different sample sizes, due to changes in informant or missing data, then we used the N for each effect size rather than a single estimate or weight for the whole sample. Figure 2 displays the forest plot for the raw effect sizes, sorted by informant and magnitude of effect, and using shading to show whether the sample used a distilled design.

Assessment of study quality
We used two a priori measures of study quality. The Kowatch system rated design features important for the investigation of pediatric bipolar disorder. All studies reported sufficient information to code all of the Kowatch criteria (except for one item from Hazell et al., 1998). Scaled as percent of maximum possible score, study quality ranged from 50% to 100%, with an average of 83%. The overall quality of the studies included was good in terms of using semi-structured interviews, implementing DSM criteria, capturing comorbid and confounding diagnoses, and other features that enhance confidence in the robustness of findings.

In terms of the quality of reporting results, scores ranged from 45% to 95%, with an average of 73% on the QUADAS-2. Published reports often omitted QUADAS-2 elements: only 12 samples clearly reported the time interval between the diagnostic interview and gathering the rating scales; only 20 clearly specified whether or not the diagnoses were blind to the rating scales; only 21 made clear whether the rating scale was interpreted without prior knowledge of the diagnosis; and only 9 of 25 studies reported all suggested elements. No study included a flow diagram.

Overall Summary of Effect Sizes
We used a multivariate metaregression (rma.mv in metafor), modeling the nesting of the effect sizes in the 27 samples and treating both the within study and between study variance estimates as random effects. The overall estimate of effect size was g = 1.05. There was tremendous heterogeneity, Cochran’s Q(62 df) = 738.25, p < .00005. There were substantial variance components both for the within samples nesting of effect sizes (level 1 in a hierarchical linear model conceptual framework) – sigma2 = .13, as well as between samples (level 2) – sigma2 = .20. This became the baseline model for exploration of moderators and covariates. Table 3 reports the variance estimates and Cochran’s Q for this and the subsequent augmented multivariate meta-regression models.

Informant: Caregiver versus Youth or Teacher Report
Our primary moderator of conceptual interest was the informant who completed the scale. Multivariate metaregression used two dummy codes, comparing youth versus caregiver and teacher versus caregiver. The multivariate framework allowed simultaneous inclusion of all effect sizes and studies in the analysis versus needing to run analyses separately by informant on different subsets (thus number of studies and number of cases is consistent across all moderator analyses).

Informant type explained a significant amount of the heterogeneity, Q (2 df) = 53.84, p < .00005. Also, the within-study variance estimate dropped to sigma2 = .04 when including informant in the model (see Table 3). The parameter estimates indicated that caregiver report produced the largest effect size, g = 1.20, with youth report averaging g = -0.48 lower, and teacher report g = -0.65 lower (all p < .00005).

Mania scale content
Scales with mania-specific content should reduce false positive response rates in other diagnostic groups, increasing the effect size. Multivariate meta-regression using dummy-coded mania scale content as the sole moderator did not produce significant improvement in fit, Q(1 df) = 1.98, p = .159. However, mania scale content made a significant incremental contribution after controlling for any of the other moderators (including in the fully augmented model, below). Because this was a hypothesized moderator, we retained it in subsequent analyses.

Parent-only diagnostic interview
Another candidate moderator was whether the diagnostic interview relied solely on the parent, without the interviewer also talking directly to the youth in question. This occurred in six of the 27 samples, all of which only reported effect sizes using caregiver-rated scales. Consistent with expectations about shared source variance inflating the predictor-criterion association, interviewing only the parent produced significantly higher g estimates, b = 0.62, Q (1 df) = 5.80, p = .016

Distilled versus Clinically Representative Samples
The final moderator of interest focused on the impact of sampling design. A dummy code contrasting “distilled” versus clinically generalizable designs accounted for significant variance in the effect sizes. Entered as the sole moderator in a multivariate meta-regression, it earned a Q(1 df) of 6.64, p = .010, with distilled samples averaging g values .50 higher than the nondistilled, more generalizable samples.

Fully augmented model
A fully augmented model included all the moderators of interest simultaneously. This model accounted for substantial variance, Q (5 df) = 84.03, p < .00005. It also reduced the random effect variance components both at level 1 (within samples) – sigma2 = .03 versus .13 for the model with no moderators, as well as level 2 (between samples) -- sigma2 = .12 versus .20 in the initial model (see Table 3). There still was significant remaining heterogeneity, Cochran’s Q (57 df) = 279.55, p < .00005. The profile of likelihood plots indicated that the model provided accurate estimates, and the intra-class correlation between the estimated and true effects was 0.22.

Table 5 presents the regression weights and confidence intervals for the fully augmented model. The intercept was b = 0.72, p < .00005, meaning that the average effect size for caregiver report from a non-distilled, generalizable sample, using a measure that did not include specific manic symptom content, and interviewing both the youth and the caregiver, would have a g ~ .7. All moderators remained significant in the augmented model. Teacher report was associated with significantly lower effect sizes, b = -0.61, p < .00005, as was youth report, b = -0.46, p < .00005. Scales with specific mania item content generated moderately larger effect sizes, b = 0.28, p = .004. Relying only on the caregiver during the diagnostic interview significantly inflated effect sizes, b = 0.42, p = .002. As hypothesized, the use of distilled samples produced much larger effect sizes than generalizable samples, b = 0.54, p = .002. Figure 2 shows the distribution of effect sizes broken down by informant and also whether or not the sample used a distilled design, graphically illustrating the effects of the two most potent moderators.

Outlier analyses
Preliminary analyses identified one study as a highly influential outlier. Re-examining the article found that the authors had reported a standard error as if it were a standard deviation (with the T-score SD being less than 3, instead of close to 10). Correcting this and recalculating the effect size, the study no longer was an outlier. We corrected this before running the models reported above.

Standardized residuals flagged two studies as potential outliers in the multivariate analyses: Marchand et al. (2005) reported an effect size 1.04 g units larger than would be predicted based on the meta-regression model, z = 2.60, p < .01. Papachristou et al. (2013) reported an effect size g = 1.21 units smaller than predicted based on the model, z = -3.35, p < .01. Marchand et al. (2005) used the YMRS, which has shown highly variable results across other samples. The article did not explicitly report whether the diagnoses were blind to the scale, and details about the diagnostic procedures also were sparse. Papachristou et al. used the CBCL. Re-running the model with those two studies excluded did not change the substantive pattern of findings; all moderators remained significant with similar coefficient sizes (results available upon request from author). Controlling for year of publication, comorbidity, sample size, and a variety of other parameters that meta-analysis guidelines recommend testing also did not alter the significance or substantive pattern of results.

Publication Bias
All of the analyses described above checked for influential outliers, and examined the effects of omitting outliers on sensitivity analyses. We also used the random effects, mixed model extension of Egger’s regression test of publication bias (Viechtbauer, 2010a). There was no evidence of publication bias, p > .81 in the fully augmented model, and p > .50 in the model with all moderators also including ratings of reporting and design quality.

Because the MARS reporting guidelines ask for fail-safe N, we estimated it using two methods, separately for each informant to reduce the effects of nesting within sample. Table 4 reports the results. By every method, it appears highly unlikely that publication bias threatens conclusions about the validity of the effect sizes. In summary, all three informants produced statistically significant differences between the score distributions for bipolar versus comparison groups, with teacher report producing a medium effect size in Cohen’s (1988) rubric, youth report yielding a large effect size, and caregiver report an effect size more than 50% larger than what conventionally is considered “large.”

Other Sensitivity Analyses
Neither design quality (as operationalized by the Kowatch scoring) nor the reporting quality (operationalized as QUADAS-2 Total) moderated the observed effect sizes, either in isolation or after controlling for the other moderators (all p > .20); nor did including them change the significance of any of the other moderators. We also checked whether the number of scale items, the year of publication, the percentage of cases with ADHD in the sample, or whether the study had sponsorship from a pharmaceutical company had any association with the observed effect; none did after controlling for the a prior hypothesized moderators or by itself.

Exploratory Comparison of Specific Measures
We ran an exploratory version of the multivariate regression model to see if the literature supported any generalizations about the relative performance of measures. We used the effect size based on the ASEBA Externalizing score as the comparator in a set of dummy codes that tested all scales contributing at least two independent effect sizes: the Child and Adolescent Symptom Inventory (CASI), the Child Bipolar Questionnaire (CBQ), the Conners (1999), the General Behavior Inventory (GBI;), the Mood Disorder Questionnaire (MDQ), and the rating scale version of the Young Mania Rating Scale (YMRS). We chose the ASEBA Externalizing score as the comparator because (a) the ASEBA is the most studied scale in this review, (b) the ASEBA contributed the most effect sizes across levels of the informant and distilled moderator variables, (c) the Externalizing scale is easier for clinicians to use than other putative “bipolar” profiles, because it is a standard part of the ASEBA scoring algorithm, (d) Externalizing is not subject to concerns about overfitting to a particular sample, whereas putative “bipolar” profiles necessarily were developed post hoc on the initial sample, and (e) in all samples reporting effect sizes based on both Externalizing and multi-scale profiles, the effect size based on Externalizing was larger in all but two cases. The Externalizing score essentially is an “incumbent” measure that any challenger would need to defeat to supplant it in research or practice. We did not include the dummy code for whether or not the scale content specific manic item content, because it would have been highly collinear with the scale dummy codes.

This model accounted for substantial variance, Q (11 df) = 111.17, p < .00005. It also reduced the random effect variance components both at level 1 – sigma2 = .02 versus .13 for the model with no moderators, as well as level 2 – sigma2 = .14 versus .20 in the initial model. There still was significant remaining heterogeneity, Cochran’s Q (51 df) = 253.85, p < .00005.

The model intercept was b = 0.72, 95% CI [.44, 1.00]; it was the estimated true effect size for the caregiver CBCL Externalizing score from a non-distilled, generalizable sample, including both the youth and caregiver in the diagnostic interview. All four moderators remained significant in the augmented model even after controlling for scales used. Adjusting for all variables, teacher report was associated with significantly lower effect sizes, b = -0.60, 95% CI [-.77, -.44], p < .00005, as was youth report when compared to caregiver report, b = -0.48, 95% CI [-.63, -.33], p < .00005. Three scales demonstrated significant differences compared to the ASEBA Externalizing score: the MDQ averaged g = 0.39 higher than the Externalizing effect size, p =.003. The CMRS averaged g = 0.33 higher after controlling for all variables, p =.024. The GBI averaged g = 0.31 higher, p =.002. See Table 6 for full model.

Clinical Interpretability
To provide more clinically meaningful description of results, we saved the predicted values from the meta-regression, and then converted the g into an estimated ROC AUC (using formula #4 from & Hedges, 1995). We also report the predicted sensitivity that could be expected for a threshold chosen to have specificity = .90 (Hasselbad & Hedges, 1995, formula #13), along with the corresponding diagnostic likelihood ratio for scoring above the specificity =.90 threshold. We report the upper and lower bounds based on the confidence intervals of the mixed model regressions. See Table 7. Figure 3 shows the estimated ROC curves for caregiver, teacher, and youth report for clinically generalizable (nondistilled) designs. The figure also includes three reference curves as benchmarks. The diagonal line represents chance performance (AUC = .50). If the base rate of bipolar were 10%, then just randomly diagnosing cases “betting the base rate” would get 10% of the bipolar cases correct (sensitivity = 10%), and 90% of the nonbipolar cases correct (specificity = 90%).

The shaded gray space marks the performance of clinical diagnosis as usual, with the accuracy of clinical diagnoses of bipolar disorder pegged at kappa ~ .1 based on the converging estimates from both meta-analysis as well as recently published data. Combining the kappa with the estimated base rate of the target disorder allows estimation of the “diagnosis receiver operating characteristic curve”. With a base rate of 10% -- consistent with reports from many outpatient settings – clinical diagnoses would deliver sensitivity of ~ .19, specificity of ~ .91, and an AUC of .55. The sensitivity almost doubles chance performance if clinicians were “betting the base rate,” but it still is far from adequate.

Because studies do not have an objective gold standard, and even semi-structured diagnostic interviews are not perfect, the reliability of the criterion diagnosis also creates a ceiling that limits test performance. If a “perfect” predictor of bipolar disorder existed, it would still appear to be “wrong” when it disagreed with the results of the (imperfect) semi-structured interview. Using kappa of .80, close to the nominal rate of inter-rater reliability reported in those studies that included reliability information, the diagnosis curve for semi-structured interviews would yield sensitivity of .82, and specificity of .98, with an AUC of .90.

Table 1: Summary of sample-level characteristics of studies included in meta-analysis
Note. #Effect sizes estimated using raw data for these samples.

Table 2. Effect size level characteristics and moderators
Note. Hedges’ g is an effect size that adjusts Cohen’s d to correct for upward bias in small samples.

=== Table 3. Tests of homogeneity and estimates of random effects variances between effect sizes within study (level 1) and between samples (level 2) for multivariate meta-regression models using maximum likelihood estimation === Note. The augmented model including all moderators, but not the quality ratings, produced the best fit. Although “mania scale content” was not a significant moderator by itself, it became significant after controlling for other moderators. Because quality ratings did not improve model fit, subsequent analyses are based on the model with all other moderators, but not quality. Results controlling for quality did not change substantively. * p<.05, ** p<.005, *** p<.0005, ****p<.00005, two-tailed.

Table 4. File drawer estimates for disaggregated effect size estimates grouped by informant
Note. The Rosenberg (2005) method estimates the number of unpublished studies with null findings that would be necessary to reduce the average effect size to non-significance, i.e., no longer able to reject the null hypothesis that overall g = 0 at p < .05. The Orwin (1983) method estimates the number of unpublished studies with null results needed to reduce the average effect size to an a priori target magnitude. We selected g = .20 as the target because it is generally considered a “small” effect size and would produce negligible performance in diagnostic applications (corresponding AUC = .56, “poor”).

Table 5. Multivariate meta-regression estimates of the effects of moderators entered together in the model

 * p<.05, ** p<.005, *** p<.0005, two-tailed.

Discussion
The goal of the present study was to meta-analyze the effect sizes for scales used to distinguish youth with pediatric bipolar disorder from other youth. The topic also provided an opportunity to investigate potential moderators of broad conceptual interest. These include informant effects, such as seeing if there were significant differences between youth or teacher report as compared to the responses of the primary caregiver for the youth. We also investigated design features such as whether or not the scale included specific symptoms of mania (Geller et al., 1998), whether the diagnostic interview only included the parent versus directly interviewing parent and youth, and also the effects of sampling design on the observed effect sizes. These design issues also apply more broadly to investigations of psychological assessment and diagnostic efficiency in general, not only in the context of pediatric bipolar disorder.

The goal of the present study was to meta-analyze the effect sizes for scales used to distinguish youth with pediatric bipolar disorder from other youth. The topic also provided an opportunity to investigate potential moderators of broad conceptual interest. These include informant effects, such as seeing if there were significant differences between youth or teacher report as compared to the responses of the primary caregiver for the youth. We also investigated design features such as whether or not the scale included specific symptoms of mania (Geller et al., 1998), whether the diagnostic interview only included the parent versus directly interviewing parent and youth, and also the effects of sampling design on the observed effect sizes. These design issues also apply more broadly to investigations of psychological assessment and diagnostic efficiency in general, not only in the context of pediatric bipolar disorder.

Informant Effects
Caregiver report produced substantially larger effect sizes than teacher or youth report, whether the model included no other variables, all hypothesized moderators, or even during all sensitivity analyses. The gap between caregiver versus youth report was g = -0.46 after controlling for all other moderators. The gap between caregiver and teacher report was slightly larger: -0.61 after controlling for all other moderators. The confidence intervals for youth and teacher report overlapped to a large extent, indicating that they performed similarly. The finding that caregiver report yielded larger effect sizes than youth or teacher report is consistent with the few prior studies that directly tested differences between the accuracy of informants. Among all the samples that contained effect sizes based on both caregiver and youth report, the caregiver effect size was larger in every instance except for the sample from South Korea. Similarly, in every sample providing both caregiver and teacher rating effect sizes, the caregiver report was larger.

The high validity for caregiver report is reassuring in many ways. Caregivers are more likely to initiate seeking services for the child, rather than the child or adolescent self-referring, and interviewers tend to perceive caregivers as a more credible source of information about the youth’s functioning, especially for younger children. Caregivers also are a primary source of information about developmental history, family mental health history, and other factors crucial to the diagnostic and evaluation process. The greater validity for caregiver report also makes sense; caregivers are likely to have better reading ability and be more psychologically minded than young children or adolescents. Plus, caregivers notice symptoms such as irritable mood at lower levels of mania, whereas the mania needs to be markedly more severe before youths endorse the same symptom. Many other symptoms of hypomania and mania also are likely to distress other people before they bother the person expressing the symptom.

Youth report also consistently produced statistically significant differences between bipolar and non-bipolar comparison groups, but the effect size would conventionally be considered “moderate,” and it translates into modest performance in terms of diagnostic efficiency. Some general reasons that youth report might show lower validity include factors undermining the reliability of youth report, such as poorer reading ability or lack of motivation to complete scales thoughtfully. The content of items might be less developmentally appropriate, although that is unlikely to be a factor for instruments designed specifically for use with children and adolescents, such as the Achenbach System for Empirically Based Assessment or newer mania scales such as the CBQ and CMRS. In the context of evaluating potential bipolar disorder, youth report may show diminished validity to the extent that loss of insight into one’s behavior is a feature of hypomania and mania, as has been observed with adults with bipolar disorder or psychosis. A practical consideration is that reading level precludes using most of these scales with youths younger than age 11 years, and normative data for instruments such as the ASEBA and Conners do not extend below age 11, either.

Teacher report produced effect sizes substantially smaller than caregiver report, and somewhat lower than youth report. Though teacher report still produced statistically significant differences between the bipolar and comparison group, the effect size was small to medium. Translated into an area under the curve, teacher report produced AUC values of ~.60 for mania scales in nondistilled samples, often considered the “poor” range for clinical utility. Some of the symptoms more specific to mania, such as a decreased need for sleep, are difficult to observe in a classroom setting. Many behaviors readily seen in the classroom are also symptoms that overlap with ADHD, reducing the diagnostic specificity of teacher report.

Overall, the differences in performance across informant are consistent with the patterns that have emerged in the few head-to-head comparisons within the same sample. The correlation between caregivers, youths, and teachers about the youth’s behavior problems tends to be modest, and some studies have found that the degree of problems reported by teachers or youths is actually significantly higher in cases with bipolar disorder than would typically be predicted based on caregiver report. However, the general level of problems endorsed by teachers or youths is much lower than by caregivers across almost all of the samples (with the exception of the data from South Korea). It is possible that a combination of selection bias and regression artifacts is contributing to the results: because most of the effect sizes come from clinical outpatient samples, the caregiver concerns were the driving factor for the bulk of referrals. Effectively, outpatient samples select participants on the basis of high levels of caregiver reported problems. Because caregiver and youth or teacher report usually correlate less than r ~.3, regression to the mean dictates that the average youth or teacher scores will be much lower. The attenuation will shrink the scores for the bipolar group more, inasmuch as the caregiver reported higher concerns for them, thus reducing the effect sizes comparing the bipolar to other groups.

It also is possible that youth behavior changes between school and home settings. Some argue that mania should be pervasive and observable across multiple settings. Certainly severely disorganized behavior would be easy to note, but other symptoms, such as irritability, difficulty concentrating, or energy changes may not be remarkable in a classroom setting. Additionally, there are likely to be circadian fluctuations in mood and energy. To the extent that bipolar disorder is associated with an evening chronotype or delay of sleep phase, the attendant mood changes are likely to be most pronounced outside of school hours. The emotional significance of relationships also may contribute to differences in behavior. If rejection sensitivity and a sense of intimacy are salient issues for youths with bipolar disorder, then the home life is more emotionally significant than school, at least until peers become more important. A third process that could lead to more conflict at home is the testing of developmental boundaries and the youth’s push for greater control and autonomy. At present, there are no published studies using direct observation of interactions at home involving youths with bipolar disorder. Actimetry and other objective methods for measuring youth behavior could be informative about the extent to which behavior changes between settings or dyads.

The consistent finding that caregiver report produces larger effect sizes than youth or teacher report indicates that the revisions to the psychiatric nosology were justified in not requiring elevated teacher report of manic symptoms to confirm a diagnosis of pediatric hypomania or mania. The validity of parent reported mood symptoms also is supported by sensitivity to treatment effects in double-blinded clinical trials (e.g., Findling et al., 2007; Findling, McNamara, et al., 2005), as well as brain imaging studies—where parent report produces larger correlations with patterns of activation in the youth’s brain than diagnostic categories or youth ratings do (e.g.).

Mania Scale Content
Results also supported our hypothesis that scales including symptoms more specific to mania would show larger effect sizes. The ASEBA and Conners were written at a time when mania was considered an “adult only” phenomenon, and so many relevant items were not included in the pool. Both include manic symptoms, but they are ones that are not diagnostically specific and thus often attributable to other conditions, as indicated by the factor structures in the measures’ normative samples. More recent versions of some broad checklists, such as the BASC and the CASI, have added a mania scale to at least some informant versions. Thus they may perform more similarly to instruments such as the MDQ, CMRS, and CASI that inquire directly about DSM symptoms, or the GBI and CBQ, which also cover associated features beyond the core DSM symptoms.

The source articles did not report sufficient statistics for us to examine directly whether the inclusion of mania symptoms improved the specificity more than the sensitivity of the scales. However, the greater effect on specificity is plausible because the more sensitive symptoms of bipolar disorder tend to be nonspecific features such as irritable mood and difficulty concentrating. These symptoms are well represented on both general, broad-coverage instruments as well as mania-oriented scales. In contrast, more specific symptoms such as decreased need for sleep and elated mood tend to only be included on scales purpose-built to investigate manic symptoms.

With the exception of the CASI, the commercially distributed broad-coverage instruments do not have a mania scale containing diagnostically specific symptoms, or there are not yet published data indicating the diagnostic performance of any new scale that they have added in the most recent revisions. The CBCL Externalizing score, or the putative bipolar profiles of subscales, show good sensitivity but poorer specificity. This positions the broad measures to be good at ruling out cases of bipolar disorder, but makes them prone to high false positive rates if used alone to screen for bipolar. Positive results on tests with moderate specificity are ambiguous because there is a high false alarm rate (the other side of the low specificity coin). Combined with the base rate of bipolar being low in most settings, the result is low positive predictive values. Put bluntly, most cases “testing positive” for bipolar disorder on measures that do not focus on diagnostically specific symptoms will not actually have bipolar disorder unless the setting is an inpatient unit or similar venue where the base rate of bipolar is high.

Interview Strategy: The Importance of Laying Eyes on the Child
Relying solely on the primary caregiver during the semi-structured interview also changed the average effect size significantly. Six of the 27 samples relied solely on the caregiver for the diagnostic interview. Consistent with concerns about shared source variance, relying only on one person’s perspective inflated the association between the caregiver rating and the diagnosis, increasing the observed effect size by g ~.6. The six samples only reported effect sizes for caregiver-reported scales, where the influence of shared source variance would be largest. It is interesting to note that these six studies ran the gamut in terms of youth ages, with the average age extending from 8.7 years to 16.0 years, so results were not driven by youth age. Findings reinforce the value of directly interviewing the youth, even when they may be too young to complete a full semi-structured diagnostic interview. In young children, direct observation during the interview may provide an opportunity to observe behaviors that might indicate the presence of a pervasive developmental disorder or other alternate explanations for child behavior. The credibility and sophistication of youth perspective increases steadily with age and verbal ability, making the older youth’s input more valuable. Integrating multiple sources of information is likely to combat factors that would undermine the validity of any single informant, such as demoralization, malingering, or attempting to minimize problems. Inasmuch as diagnoses that synthesize multiple information sources are likely to be more valid, estimates of diagnostic accuracy that are pegged to such diagnoses are likely also to have better validity, even if the size of the coefficient itself appears more humble.

Distilled Sample Enrollment
The final moderator variable of interest was the design of the sample. Whereas diagnostic sensitivity and specificity were once thought to be intrinsic properties of a measure, methodologists now realize that these parameters can change markedly as a result of design features. The present findings show that this is not an abstract concern. Differences in sampling design changed the observed effect sizes by g ~ .56, even after controlling for other moderators. This ~.5 bias is consistent with the findings of prior work that compared the effects of distilled versus more generalizable sampling inclusion criteria in the same data set. Taking the eight scales examined in that study and comparing the effect sizes observed in distilled versus nondistilled designs (reported in Table 5 of Youngstrom et al., 2006) found an average upward bias of g = +.43. The PBD literature includes many studies focused on phenomenology and careful descriptive validation of the syndrome. These studies often included healthy control comparison groups, and they also often had stringent exclusion criteria that reduced the amount and types of clinical heterogeneity observed. Although these designs were internally valid for their intended purpose, they are less generalizable to typical clinical practice. The results of the meta-analysis underscore that the reduced generalizability produces systematic bias that exaggerates the diagnostic efficiency of scales. Distilled samples create a rising tide that lifts all boats, inflating the effect sizes of all scales and potentially boosting less valid tests even more than others. This is especially true for scales with nonspecific items, such as externalizing or attention problems, which also would be elevated in groups with other diagnoses but not in healthy controls.

The positive bias is pernicious for two reasons. First, it obscures differences between scales. The boost from a distilled design is larger than the difference between most of the valid measures. Design effects swamp the relative differences between scales, potentially making weak scales look better than a more valid scale tested under more generalizable conditions. Second, the exaggerated effect sizes in a distilled design will bias the clinical interpretation of the scales. Inflated sensitivity and specificity estimates lead to more extreme diagnostic likelihood ratios, and more extreme predictive values. If clinicians rely on a simplistic “positive test result” interpretation, the false positive rate will be higher than it appears. Bad designs, in terms of validity for studying diagnostic efficiency, will contribute to overdiagnosis of bipolar disorder. More rigorous and generalizable designs produce more humble estimates of effect size, but these are essentially “pre-shrunk” to fit their application across a broader range of clinical settings.

The clinical “take home” messages from this moderator analysis reinforce the dicta from Evidence Based Medicine to consider both the validity of the study design, and also whether the participants look like the patients with whom the clinician is working. As receiver operating characteristic analyses become more popular, test consumers should cultivate healthy skepticism: if a test claims to produce excellent results at a challenging task, check carefully to see if design flaws are swelling the effect size. It also behooves researchers to consider the potential biasing effects of design features ahead of time before conducting secondary analyses of data originally gathered for a different purpose than studying diagnostic efficiency. The STARD Guidelines and QUADAS-2 checklist were intended in part to provide a convenient checklist for this sort of rapid evaluation of reports.

Exploratory Comparison of Different Scales
We also ran exploratory analyses comparing the performance of measures, adjusting for the moderator variables, and accounting for correlated effects due to nesting within sample. These analyses should be considered tentative, as the available literature has gaps in coverage. Furthermore, the mixed model meta-regression would have less statistical power for direct comparison of measures than would be possible by applying best methods directly to the raw data. However, even with these caveats, the analyses indicated that the GBI, MDQ, and CMRS all performed significantly better than the ASEBA, reflecting that their item sets include the symptoms more specific to bipolar disorder. Reassuringly, the results of the meta-analysis align with the results of the head-to-head comparisons that have been done using raw data in the past. The CASI is likely to also perform better than the ASEBA or other nonspecific measures, based on item content and observed effect size, but the critical region around the observed effect is large based on the analysis only having two effect sizes for the CASI.

Reporting Quality
Assessed against the criteria developed by Kowatch et al. (2005), the design of the studies tended to be good. The samples included large numbers of cases, used semi-structured diagnostic interviews, and routinely assessed common comorbidities. The quality of reporting of results was less strong. The bulk of the publications predated the dissemination of the STARD and other reporting guidelines, so it is not surprising that there were some omissions. Key places for improvement include adding flow diagrams, documenting the length of time between checklists and diagnoses, and clarifying whether all participants were included in the analyses. The bad news is that there was no significant trend for the reporting quality to improve in more recent studies, though hopefully that will change. The good news is that reporting quality was not related to the effect sizes, nor did any of the tests of moderators change when adjusting for quality. This does not mean that reporting quality is unimportant, merely that variations in quality did not add significant bias to the observed effect sizes.

Consideration of Alternative Explanations
A standard concern with meta-analysis is whether the published literature diverges from results that were never published—the “file drawer problem.” The literature reviewed here is less prone to publication bias, for two reasons. Many of the studies included in the review were descriptive, phenomenological studies or epidemiological studies. For these papers, statistical significance or rejection of a null hypothesis was not a major consideration for “publishability,” making it unlikely that non-significant results would be censored from the published literature. The second reason is that measures used for screening and diagnosis require large effect sizes in order to achieve respectable accuracy rates for classification. Consequently, statistical significance is an easy bar to surpass, even with relatively small sample sizes. Consistent with these scenarios, Egger’s test found no evidence of publication bias in the multivariate models. We also ran file drawer analyses separately for caregiver, youth, and teacher report. This is conservative, because it split the overall sample into pieces with fewer effect sizes and constituent cases. Teacher report offered the worst-case scenario, as it included the fewest effect sizes, the least participants, and the smallest average effect size. Even for it, the file drawer would need to be loaded with 10 null studies before the mean effect size for teacher report would decrease to a “small” effect size (g ~ .2), and 70 null studies before teacher report would no longer be significant p < .05. For caregiver report, the number of unpublished studies would need to be more than 19,087 to push the p value greater than .05. Publication bias does not seem to be a major threat here.

A conceptually more interesting issue is potential circularity between the information source for the scale and the criterion diagnosis. This is not the same thing as “criterion contamination,” where the diagnostician would have access to the scale scores when formulating the diagnostic impression. The older concept of shared method variance is closer to the mark. If the same informant fills out two rating scales about different constructs, the resulting scores will correlate with each other due to shared method variance. For the studies reviewed in the meta-analysis, one of three different informants completed the rating scales, and a subset of the same informants contributed to the diagnostic interview. In the most extreme scenario, the caregiver might be the only person interviewed about the youth’s diagnosis, and she or he also completed the rating scale. On one hand, the interview still involves additional information beyond that gleaned from a scale—there are opportunities for probing, follow-up questions, interpretation of nonverbal cues, and clinical judgment. On the other hand, caregiver reported scales might have an unfair advantage if the caregiver’s perspective heavily influences the diagnostic interviews. The apparent advantage for caregiver report in terms of larger effect size could be due in part to the shared source variance contributing to both predictor and criterion diagnosis.

This concern is allayed somewhat by the fact that caregiver report showed large effects across all studies, regardless of whether the other informants participated in the diagnostic interview. Parent report’s validity also is supported by sensitivity to treatment effects in double-blinded clinical trials (e.g., Findling et al., 2007; Findling, McNamara, et al., 2005): the double-blind design controls for expectancy effects, as they would contribute to perceived improvement in the placebo arm. Furthermore, in brain imaging studies parent report produces larger correlations with patterns of activation in the youth’s brain than diagnostic categories or youth ratings do (e.g.). It also is also noteworthy that the source variance artifact would inflate agreement with caregiver report across all diagnoses. However, there are published examples where youth report shows the same validity estimates as caregiver report for anxiety diagnoses (Van Meter et al., 2014) or significantly higher validity for reports of posttraumatic stress disorder. These are secondary analyses of some of the same data used in the meta-analysis here and found larger effects for caregiver report predicting bipolar diagnoses. This indicates that the validity of the caregiver report is higher for bipolar disorder than for some other diagnoses, showing a degree of specificity that argues against a general methodological artifact. Converging evidence also suggests that manic symptoms are associated with a loss of insight that undermines self report, and many symptoms are likely to be noted earlier by collateral informants— although the weaker performance of teacher report still needs to be reconciled with the higher validity of caregiver report in this regard. Situational specificity in behavior, and dyadic patterns of interaction, remain key topics for further exploration.

Generalizability of Conclusions
The literature on the assessment of pediatric bipolar disorder has grown rapidly. At this point it spans a range of measures, multiple research groups, seven different countries drawn from four continents, and five languages of administration. Samples came mostly from outpatient settings, with some high risk offspring samples and an epidemiological sample also contributing. Within the limited range represented in the meta-analysis, clinical setting appears less important than diagnostic inclusion and exclusion criteria.

The literature on the assessment of pediatric bipolar disorder has grown rapidly. At this point it spans a range of measures, multiple research groups, seven different countries drawn from four continents, and five languages of administration. Samples came mostly from outpatient settings, with some high risk offspring samples and an epidemiological sample also contributing. Within the limited range represented in the meta-analysis, clinical setting appears less important than diagnostic inclusion and exclusion criteria.

The data from Korea stand as the notable exception to this overall pattern. It is a single sample, but, it is intriguing that the Korean study used two of the most valid rating scales (the MDQ and the GBI) and found that youth report exceeded parent report on both. Cultural factors, including high levels of stigma against mental health concerns, tremendous parental emphasis on educational excellence, perfectionism, and differences in familial patterns of communication all deserve exploration. Rapid economic and cultural change in Korea also may create age cohort effects, where the adolescents may have more Westernized attitudes towards mood and mental health, and greater awareness and acceptance of mental health issues. These hypotheses would best be addressed by a combination of qualitative studies and replications in other cultures with a high degree of Confucian values, such as Taiwan or Hong Kong, as well as countries with rapidly developing economies but different cultural traditions, such as Chile or India. Present results indicate a good degree of generalizability across groups within the United States and other Westernized countries, with a big asterisk qualifying the validity of parent report in Asian cultures.

Limitations
No meta-analysis can be stronger than the literature upon which it is based. Because many of the primary articles predated the publication of recent reporting guidelines,  it is not surprising that they did not include many of the suggested elements, such as flow diagrams. The quality of design (aside from distilled samples) and the quality of reporting did not significantly moderate the results of these meta-analyses. Still, it will be helpful for the field to make a conscious effort to improve clarity of reporting about key design features. Several other factors potentially influencing the distribution of scores in cases with bipolar disorder were not reported frequently enough to be included in the meta-analysis. These included things such as the rates of bipolar I, bipolar II, cyclothymic, and NOS cases, or the current mood state of cases. As noted in the introduction, cases with more severe current presentation will on average have higher scores on the scales, making the diagnostic sensitivity higher. Conversely, the published reports also did not include enough details about the diagnostic composition of the comparison group to support investigation of effects on diagnostic specificity. These types of questions could be fruitful topics for a “mega-analysis” pooling raw data from multiple samples.

Although all three of our hypothesized moderators produced significant effects, there still remained significant heterogeneity in effect sizes both within and between samples. We tested standard candidate moderators, such as publication year, and did not identify other significant moderators. However, the heterogeneity suggests that other factors exist that were not coded or reported in the literature. Rates of comorbidity or inclusion of different cognate diagnoses—such as ADHD, depression, or conduct disorder—are a likely contender for accounting for some of this variance. Differences in rater training also may be a source of variation in the definition of the criterion measure. Semi-structured interviews give greater latitude to clinical judgment, making the differences due to training potentially sizeable. Conversely, more structured approaches to interviewing may increase the consistency, falling along a continuum to a place where all that differs is whether the items are read by the participant or the interviewer when no alternative phrasing or clinical judgment is allowed. Taken to that extreme, the increased associations could be a form of pseudo-validity if produced by shared source variance versus greater content validity.

Implications for Theory, Policy, and Practice
The results of the meta-analysis confirm several important theoretical points. The first is that caregiver report produces larger effects for assessing bipolar spectrum disorder than self report or teacher report. Research studies investigating the correlates of youth mood symptoms, such as genetic or imaging studies, would do well to include caregiver-reported measures of mood symptoms.

At a policy level, these findings support the DSM-5 decision not to require impairment in multiple settings or across multiple informants as a requirement for diagnosing bipolar disorder in youths. Practice guidelines should (a) emphasize gathering caregiver report when possible to clarify assessment questions around potential bipolar disorder, and (b) not require elevation on report from multiple raters about manic symptoms, recognizing that cross-informant agreement tends to be low in general. Evaluation pathways for mood disorder also should incorporate caregiver-reported measures. The fact that several free and brief measures produced some of the best effect sizes in the meta-analysis improves the feasibility of adoption.

With regard to clinical practice, several measures are now well established as tools for evaluating bipolar symptoms in youth. In the treatment literature, having at least two published reports conducted by independent research groups offers protection from allegiance effects as well as quirks of an individual study. Using a similar criterion of two independent replications, five caregiver report measures have established clinically meaningful effect sizes: the GBI, MDQ, CMRS, ASEBA, and CBQ. The CASI and Conners scales have been evaluated in one sample each, and the YMRS performance is too varied and poor to endorse when so many better alternatives are available. The MDQ and GBI also appear adequate as a self-report measure.

How do the checklists compare to the other alternatives available for assessing potential pediatric bipolar disorder? Benchmarking the effect sizes from this meta-analysis against those reported in other reviews shows that caregiver report on best measures yields effect sizes larger than generated by neurocognitive tasks comparing cases with bipolar disorder to ADHD or healthy controls. Recent studies have applied machine learning algorithms to fMRI data as a way of discriminating bipolar disorder. These studies involve both overfitting, where the algorithm is optimized for the sample at hand, and “distilled designs” that produced significantly larger effect sizes in this meta-analysis. Even with both of those upward biases, the algorithm produced an AUC of .78. The fMRI methods will produce smaller effects when used under clinically realistic conditions with greater diagnostic heterogeneity, especially given the transdiagnostic nature of brain regions involved in affect regulation. Even when fMRI or neurocognitive algorithms evolve to produce comparable or larger effect sizes in generalizable designs, there still are issues of greater cost and more limited accessibility. For the foreseeable future, checklists appear to be the frontrunner in an important niche of clinical assessment despite their imperfections.

Guidelines for Future Research
Present findings limn several guidelines and priorities for future research.

Avoid artifacts – especially criterion contamination and distilled designs
Consult the STARD guidelines when designing studies of diagnostic measures. Although the STARD guidelines were published in multiple journals and endorsed by multiple editorial boards (e.g.), it is sobering that there is no difference in the average quality of reporting of studies published before or after the STARD guidelines were promulgated.

Similarly, researchers need to consider design issues when deciding to do ROC as a secondary analysis. Using biased designs is not an abstract problem—it was pandemic in the literature that we reviewed, and it produced a large bias in the observed results. Comparisons to healthy controls are clinically trivial, and make all measures look good. These designs produce exaggerated effect sizes, which will translate into excessive pseudo-accuracy of clinical decisions. The exclusion of competing diagnoses that are clinically common, such as unipolar depression or conduct disorder, will reduce the number of false positive cases in the research report, biasing the apparent diagnostic specificity upwards. Using the same threshold in a clinical setting that does not exclude these cognate diagnoses will lead to much higher rates of false positives. Put simply, using results from distilled designs will contribute to overdiagnosis when applied in most clinical contexts. Distilled sampling was the single most potent moderator we identified in the meta-regressions. Please do not publish results if based on a biased design—these data are not helpful. Findings that have criterion contamination where the same measure is used to define the proxy diagnosis and then predict are equally problematic. They contribute noise to the literature and risk confusing consumers about choice of measure and interpretation. Peer reviewers should treat these as serious, perhaps fatal, flaws when reviewing manuscripts. On the STARD list of 25 design and reporting considerations, these are 800 pound gorillas; and they have been swinging amok through the mood assessment literature. We excluded a half-dozen studies that used proxy definitions with criterion contamination; and 52% of the samples (44% of the effect sizes) included in the meta-analysis used distilled designs, creating a gargantuan artifact in the literature.

Add objective criterion measures besides diagnosis
The next generation studies examining the validity of mood ratings should use objective, heteromethod measures to explore differences in validity of teacher, caregiver, and youth report. Benchmarking these against actimetry, gene expression, brain imaging of core constructs (e.g.), or ecological validators such as high risk behavior (e.g.) all will help triangulate the relative weight clinicians and policy makers need to assign each perspective. These data also will inform future nosological revisions, and they will map the connections between levels of functioning that the NIMH Research Domain Criteria (RDoC) initiative aims to organize conceptually.

Explore culture as a moderator
Another goal is to prioritize understanding cultural differences contributing to performance. Within the USA, the few studies that have investigated differential item functioning or structural invariance tend to find that bipolar mood scales tend to show little bias. This contrasts sharply with the disparities in clinical diagnoses and service utilization data. The burden of depression and bipolar disorder is equally serious in the developed and developing world. If these assessment tools are similarly valid across cultures, they offer an inexpensive method for early identification and intervention that can be implemented on a much larger scale than semi-structured interviewing or biological/neurocognitive assays. Part of the discrepancy between assessment and diagnostic findings could be closed by greater use of evidence based assessment methods, using the best of the rating scales and interpreting them within an EBM/Bayesian approach.

However, there also is likely to be a context of cultural differences in beliefs about the causes of emotional and behavioral issues, as well as attitudes towards treatment. The Korean data in the meta-analysis are an intriguing anomaly that suggests a potentially powerful yet nuanced role for culture. More research needs to include groups with traditionally Confucian and other non-Western views, not just because it is an interesting academic question, but because most of the human population lives in these cultures, and these factors change the dynamic of medical communication between practitioner and patient.

Study collateral informants in adulthood and late life
Whereas including collateral informants is routine with children and adolescents, it is the exception with adults. The difference in perception of behaviors by self versus other is a robust phenomenon, including the “fundamental attribution error” in social psychology. DSM-5’s new emphasis on energy in the mania A criterion also flowed from recognition that memory may be more accurate for changes in energy than mood. The larger effect size for caregiver report of manic symptoms deserves exploration in other age groups, especially given the loss of insight tied to hypomania and mania and the tremendous interpersonal consequences of manic episodes.

Present results also emphasize the importance of prioritizing dissemination of assessment tools and teaching evidence based assessment methods. The scales that performed best in the meta-analysis are in the public domain, but they are not heavily advertised or widely taught in training programs. The meta-analysis has established that there is a set of caregiver report measures that delivers clinically meaningful effect sizes even when evaluated under externally valid designs. These could produce large improvements in clinical decisions, potentially even more so with under-served minority groups.