Epidemiological studies have provided growing evidence that eating disorders (ED) are relevant illnesses in young females and adolescents (Becker, Thomas, Franko, & Herzog, Reference Becker, Thomas, Franko and Herzog2005). The prevalence rate in western countries is around 4.5% (Cotrufo, Barretta, & Monteleone, Reference Cotrufo, Barretta and Monteleone1998; Favaro, Ferrara, & Santonastasio, Reference Favaro, Ferrara and Santonastasio2003; Morandé, Celada, & Casas, Reference Morandé, Celada and Casas1999; Pérez-Gaspar et al., Reference Pérez-Gaspar, Gual, De Irala-Estévez, Martínez-González, Lahortiga and Cervera2000; Rojo et al., Reference Rojo, Livianos, Conesa, Garcia, Domínguez, Rodrigo and Vila2003; Ruíz-Lazaro et al., Reference Ruíz-Lazaro, Alonso, Velilla, Lobo, Martín, Paumard and Calvo1998; Steinhausen, Winkler, & Meier, Reference Steinhausen, Winkler and Meier1997). The rate may be even greater, however, if all subclinical cases, the most difficult to detect, are included (increasing the rate to a possible 8%).
Although anorexia nervosa (AN) and bulimia nervosa (BN) are well-known disorders, in community studies they are not so prevalent. Most cases belong to the category of Eating Disorders Not Otherwise Specified (EDNOS) (Eddy et al., Reference Eddy, Swanson, Crosby, Franko, Engel and Herzog2010) of the DSM-IV-R (APA, 2002). Although EDNOS can be even more severe and enduring, often they are not diagnosed until the disorder is well established, delaying the diagnosis and appropriate treatment. Above all, because the symptoms are presented first to non-specialists such as family doctors, they are not immediately recognized as related to an ED (Ogg, Millar, Pusztai, & Thom, Reference Ogg, Millar, Pusztai and Thom1997). Thus, it is crucial to identify the subpopulation at risk in order to encourage a higher awareness of the availability of treatments (Becker et al., Reference Becker, Thomas, Franko and Herzog2005; Herpertz-Dahlmann, Wille, Höling, Vloet, & Ravens-Sieberer, Reference Herpertz-Dahlmann, Wille, Hölling, Vloet and Ravens-Sieberer2008). There is a great interest in the public health sector in improving effective screening and the strategies of secondary prevention for ED in adolescents. High-quality tools are required for effective screening in epidemiological studies or in primary care, given that because ED has a relatively low prevalence in the general population, the community studies that assess the diagnostic efficacy of screening tests need large samples to ensure adequate power and accuracy.
One of the tests most often recommended is the SCOFF questionnaire, which was developed as a quick and reliable tool for the screening of ED. Morgan, Reid, and Lacey (Reference Morgan, Reid and Lacey1999) designed the SCOFF as a test that could be administered by non-specialists able to detect symptoms of ED. According to the authors the proposed cut-off allows for a clinically appropriate balance between false positives and false negatives. This cut-off is a cue for suspicion that an ED exists and it should be followed by additional questioning about the patient’s weight and their eating attitudes and behaviors (Hill, Reid, Morgan, & Lacey, Reference Hill, Reid, Morgan and Lacey2010).
Several studies have examined the psychometric characteristics of the SCOFF, both in the original version and in its translation into several other languages (see Table 1). In those studies acceptable trade-offs between sensitivity and specificity have been found, along with high levels of reliability. Today there are enough assessment studies available to conduct a quantitative synthesis. A meta-analysis provides for integration of the results of several studies that assess the diagnostic efficacy of a tool (Botella & Huang, Reference Botella and Huang2012; Devillé et al., Reference Devillé, Buntinx, Bouter, Montori, de Vet, van der Windt and Bezemer2002; Gatsonis & Paliwal, Reference Gatsonis and Paliwal2006; Zweig & Campbell, Reference Zweig and Campbell1993), which can also be done for other psychometric characteristics (Botella, Suero, & Gambara, Reference Botella, Suero and Gambara2010). The results are expressed mainly as the values of sensitivity and specificity of the pooled estimates that combine the estimates provided by primary studies. Our own meta-analysis also attempted to partially account for the variability observed, analyzing the role of several moderator variables.
Table 1. Identification and main characteristics of the 15 studies included in the meta-analysis
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170408221714-12090-mediumThumb-S1138741613000929_tab1.jpg?pub-status=live)
a German; b Spanish; c English; d French; e Finish; f Catalan; g Italian.
I Psychometric test; II interview.
* The three values are the frequencies of cases diagnosed with the diagnostic reference being AN/BN/EDNOS.
** The mean has been estimated as the average between the higher and lower limits, given that the interval is very narrow.
In short, our aim was to perform a systematic review and meta-analysis to summarize the literature on assessments of the diagnostic efficacy of the SCOFF for detecting ED.
Method
Procedure
The SCOFF
The acronym refers to the SCOFF’s questions, which are related to five key features in ED (Sick, Control, One, Fat and Food; see appendix A). Individuals must answer Yes/No to each question, scoring one point for each positive answer (Morgan et al., Reference Morgan, Reid and Lacey1999). The authors proposed a cut-off of two or more points for recommending the clinician to follow up with a deeper and more rigorous assessment.
Literature search
We performed a systematic search of the literature using five international databases: Medline, PsycINFO, EMBASE, Web of Science (Science Citation Index Expanded and Social Sciences Citation Index) and Cochrane Library. Two of the authors (ARS and HG) searched for all the papers written in English, German or Spanish, published in peer-reviewed journals between 1999 and August 2011. The list of key words for the search included: SCOFF questionnaire combined with each of the terms ‘eating disorder’, ‘anorexia nervosa’, ‘bulimia nervosa’, ‘screening’, ‘primary care’, ‘validation’, ‘psychometrics’ and ‘prevalence’. Each of those combinations was also combined with the term ‘screening’. More than 50 combinations of words were searched in each of those data bases. We also did manual searches of the references cited in the selected papers.
Once we had read the abstracts we recovered copies of the relevant papers. Most of the retrieved studies reported the use of the SCOFF in a sample of participants in intervention studies or prevention programmes, but they did not provide relevant information related to diagnostic efficacy.
Inclusion and exclusion criteria
After the process described in the previous section we still had 30 papers for possible inclusion in the meta-analysis. Studies were finally selected if they reported the use of the SCOFF in a sample of ED and other comparison sample, besides any measurement with a diagnostic reference. We excluded studies as follows: (a) those written in languages other than English, German or Spanish; (b) those not published in peer-review journals; (c) those which reported results for the SCOFF as a recommended tool for screening, including psychometric studies; (d) those which compared two forms of administration (oral versus written); (e) those in which although sensitivity and specificity were reported, the size of both samples was not reported, so that proper statistical treatment was not allowed; (f) those which related to a sample voluntarily participating in intervention programmes, as the data were likely to be biased.
No restrictions were placed on the gender and age of the participants, the type of sample, or the type of reference employed as a diagnostic reference. A total of 15 studies met all the inclusion criteria and were finally selected. They were performed in eight different countries and in seven different languages. Most of them are assessments of new translations to a different language. This gives us an opportunity to assess generalizability. The advantage of a tool with high generalizability is that it allows for comparisons and synthesis of the results from studies performed in different languages and countries.
The decision criteria for exclusion were independently applied by two of the authors (JB and ARS), with a high inter-rater agreement (coefficients equal to 1 for most variables, and all above .85).
Diagnostic references for the Gold Standard
Studies that assess the SCOFF’s efficacy for screening have employed a large variety of tools as the diagnostic reference (gold standard). In 60% of cases individual diagnostic interviews are used which employ several tools (CIDI, DSM-IV, MINI, SCAN). Although they are not 100% reliable, their performance is good enough for them to be accepted as a gold standard.
Some studies, however, have chosen as diagnostic references specific cut-offs in psychometric tools, such as EAT-40, EAT-26 and EDI-2. Given that these procedures are less appropriate for the role of a gold standard, lower levels of diagnostic efficacy are to be expected compared with studies based on interviews.
Extraction of basic data
The data from each paper were systematically and independently extracted by two of the authors using a structured database; they reached perfect agreement for the four raw frequencies in the studies. The data related to the tool were transformed, when necessary, to reach the four frequencies: true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN). When the frequencies were reported separately according to the disorder (AN, BN, EDNOS) they were summed to reach the totals. In most cases the papers provided sensitivity and specificity, plus the size of each group, so that the frequencies were obtained from those quantities.
The following fields were coded: year of publication, country, language of the version employed, size of the samples, mean age, gender (percentage of women) and the tool employed for the diagnostic reference. We abandoned some initial fields because the number of studies providing the relevant information was very low (e.g. duration of the illness, mean body mass index).
Table 1 identifies each of the studies, with the values of the most relevant variables analyzed in the present research. The sources are marked with an asterisk in the references.
Statistical analyses
We have adjusted hierarchical models of the summary ROC curve (Gatsonis & Paliwal, Reference Gatsonis and Paliwal2006; Macaskill, Reference Macaskill2004) employing the NLMIXED procedure of SAS (2008). These models provide combined estimates of the parameters that reflect diagnostic efficacy, the criterion for classification and the scale parameter. The parameters allow combined estimates of the performance indexes of the test to be derived, within the framework of Random Effects models. Random Effects models assume that the individual data sets are drawn from a distribution of populations. This is more credible than the assumption that all them come from a single population, as in Fixed Effect models (Borenstein, Hedges, Higgins, & Rothstein, Reference Borenstein, Hedges, Higgins and Rothstein2010).
Besides the basic model we have explored several models that include as covariates the moderators for which we had enough information (mean age, gender, type of reference). These are Mixed models, as they involve a main Random Effect component, whereas the moderators are included as a Fixed Effect.
We have also obtained descriptive summaries and several types of figures from Review Manager (2008) and METADISC (Zamora, Abraira, Muriel, Kahn, & Coomarasamy, Reference Zamora, Abraira, Muriel, Khan and Coomarasamy2006).
Results
Characteristics of the studies
The set of 15 studies aggregated 882 cases with an ED and 4350 healthy controls, according to the diagnostic references employed. The main characteristics of the selected studies are described in Table 2.
Table 2. Primary studies’ characteristics
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170408221714-08670-mediumThumb-S1138741613000929_tab2.jpg?pub-status=live)
Women predominate in the composition of the samples (between 50% and 100%). The average age is reported in only two-thirds of the studies (although all report the range).
Very varied procedures were employed for classification as the diagnostic reference. In 60% they are interviews, often structured by several well-known tools (CIDI, DSM-IV, MINI, SCAN). In the other 40% the classification was done by means of a cut-off in another test (EAT, EDI, EDEQ).
Diagnostic efficacy of the SCOFF
First of all we checked for any threshold effect. This effect is produced by shifts of the criteria for classification between different studies. It is reflected in a negative correlation between sensitivity and specificity. Although this effect is typical in tests with an implicit criterion for classification, it is also achievable with an explicit criterion that reflects a latent variable (such as the SCOFF). The Spearman’s correlation between the sensitivities and the specificities in our 15 studies was not statistically significant (r se = .377; p = .166), so we discarded this effect in our set of studies.
We fitted hierarchical models of the summary ROC curve (Gatsonis & Paliwal, Reference Gatsonis and Paliwal2006; Macaskill, Reference Macaskill2004) by means of the NLMIXED procedure (SAS, 2008). Assuming logistic distributions, this model estimates several parameters, although there are really just two that characterize the performance of a binary classification tool. The first one, alpha, reflects the ability of the test for discrimination. The second one, theta, reflects how conservative or liberal the threshold for classification is. The model estimates the parameters jointly, so that it takes into account the joint variation inherent in any potential threshold effect (even if it is small and non-significant). Then the estimates of the parameters are converted into pooled estimates of the sensitivity and specificity.
The basic hierarchical model (without covariates) fitted with the 15 studies provides combined estimates of the parameters that involve a sensitivity of .801 and a specificity of .934. The diagnostic odds ratio for these estimates is 56.96. The scale parameter, beta, is statistically different from zero (p = .046). This means that the summary curve is asymmetric (alpha and theta are not independent). The values for sensitivities show higher heterogeneity than those for specificities. This is a consequence of the fact that the size of the samples of targets is generally lower than the size of the samples of controls.
The homogeneity test for the values of sensitivity and specificity in the 15 studies shows that in both indexes the observed variability is larger than expected [sensitivity, Q(14) = 139.98, p < .001; specificity, Q(14) = 202.55, p < .001]. Figure 1 shows the combined forest plot of both indexes.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170408221714-53661-mediumThumb-S1138741613000929_fig1g.jpg?pub-status=live)
Figure 1. Forest plot of the 15 studies included in the meta-analysis.
We fitted HSROC models that incorporate the moderator variables mean age, gender and type of diagnostic reference as covariates associated with the parameters alpha and/or theta. Whereas the models with the mean age did not show a significant association with alpha or theta, those including the gender and the type of diagnostic reference did show marginally significant associations with the parameter alpha, but not with theta. As regards gender, we did not have any a priori hypothesis for the direction of any eventual association. The observed association is positive. That is, the ability of the SCOFF to discriminate between cases and normals is greater the larger the percentage of women in the sample. The efficacy of the test is marginally higher for women than for men (p = .052).
As regards the type of tool employed as the diagnostic reference, we expected a greater ability for discrimination when tools based on individual interviews were employed as compared with psychometric tools. Actually, the psychometric tools are not a real gold standard, given that their results are not beyond question. They have an imperfect reliability that underestimates the efficacy for classification of the tool for screening, as they classify some individuals (cases and normals) in the wrong group. This covariate has been included in the model as a dummy variable. As expected, it is associated with the test’s capability for discrimination. The diagnostic efficacy is marginally greater when the diagnostic reference is based on an interview than when it is based on a psychometric test (p = .057).
Given that the diagnostic efficacy is significantly different according to the type of diagnostic reference and that the measurements based on interviews are clearly superior, the best estimates we can afford of the performance of the test are the estimates based on this last type of measurement. Fitting a new HSROC model only to the nine studies that employed diagnostic references based on interviews gives pooled estimates of .882 for sensitivity and .925 for specificity. The corresponding odds ratio for those estimates is 92.19.
Some of the primary studies have highlighted that perhaps the sensitivity of the SCOFF could be different for AN, BN and EDNOS (García-Campayo et al., Reference Garcia-Campayo, Sanz-Carrillo, Ibañez, Lou, Solano and Alda2005). In order to study this issue we selected the studies that provide relevant data. Of course, those are only the studies in which the measure of the diagnostic reference is based on any form of interview, given that the psychometric tests do not discriminate between the three diagnostics. Eight from the nine studies that employ some form of interview provide frequencies of positive classification by the SCOFF, disaggregated according to the diagnostic (see Table 1).
After testing as potential covariates the frequency of AN and BN we have concluded that they are not significantly different. When the proportion of EDNOS in the sample is included, however, a significant association with diagnostic efficacy is apparent. Specifically, the diagnostic efficacy is significantly better for EDNOS than for AN/BN (p = .011). This difference can be associated with a greater tendency of the AN/BN cases to deny the illness, hiding the symptoms when answering the SCOFF. It can also be related to a lower sensitivity in individuals with previous low weight. Specifically, question 3 of the SCOFF makes reference to a recent extreme weight loss. It is possible that patients with AN or BN consider that it has not happened recently, given that these patients reduce the Body Mass Index (BMI) step by step over a long period. Furthermore, the SCOFF was developed as a tool for detecting patients with ED in an early stage in the general population, not for identifying whether the adolescent has a BMI of 17.5 kg/m2, a key criterion for diagnosing a AN.
Discussion
The joint analysis of 15 studies that assess the diagnostic efficacy of the SCOFF shows that the test is highly effective as a quick screening tool for ED. Its brevity and simplicity have allowed easy translation into multiple languages that have shown comparable values of performance. The pooled estimates from the present 15 studies are .801 for sensitivity and .934 for specificity.
The analysis of the moderator variables shows that the diagnostic efficacy is associated with the gender and the type of diagnostic reference. Specifically, the efficacy increases as the percentage of women in the sample increases and if the diagnostic reference is based on an interview. The best estimates from these data are those combining the nine studies based on interviews; they provide values of .882 for sensitivity and .925 for specificity. The comparison of these values with those provided by the combined estimates of the 15 studies reflects another interesting characteristic. When interviews instead of psychometric tests are used the specificity remains practically unchanged; however, the sensitivity rises from .801 to .882. This increase reveals the ability of a professional (even a non-specialist) to detect symptoms, without any associated increase in the false positives. That is, the effect is not based on a shift of the criteria, but on the genuine advantage of the interview over the psychometric tests in terms of discriminating between the cases and the controls.
Nevertheless, the effects associated to gender and the type of diagnostic reference have been detected with statistical tests only marginally significant (p < .06). Convergent evidence is needed to reinforce the role of both moderators.
In short, the results from the present meta-analysis reinforce the idea that the set of five questions of the SCOFF constitute a highly efficient tool for the detection of ED, even by a non-specialist, in several languages. Its use as a screening tool is highly recommended.
Appendix A The five questions of the SCOFF:
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary-alt:20170408221714-82685-mediumThumb-S1138741613000929_tabau1.jpg?pub-status=live)