Introduction
Approximately 50–85% of people with severe mental disorders receive no treatment (Patel et al. Reference Patel, Flisher, Hetrick and McGorry2007; World Health Organization, 2011). People with mental illness have difficulty accessing mental health care due to many factors, amongst which stigma against mental illness is one significant barrier, according to a recent systematic review on variables influencing mental health help-seeking (Gulliver et al. Reference Gulliver, Griffiths and Christensen2010).
Stigma of mental illness is ‘a trait that is deeply discrediting that reduces the barer from a whole to a tainted, discounted one’ (Goffman, Reference Goffman1963). Several conceptual frameworks have been created, including labelling theory (Goffman, Reference Goffman1963; Link et al. Reference Link, Cullen, Frank and Wozniak1987), social attribution theory (Corrigan et al. Reference Corrigan, Markowitz, Watson, Rowan and Kubiak2003), cognitive behavioural modelling (Thornicroft, Reference Thornicroft2006) and social stigma modelling (Jones et al. Reference Jones, Farina, Hastorf, Marcus, Miller and Scott1984), to both help understand and evaluate stigma related to mental illness, and guide stigma reduction interventions. As a result, the dimensions of the stigma of mental illness vary from one theory to another, and so do the stigma measurement tools created under different theories. More recently, the mental health literacy framework (Kutcher et al. Reference Kutcher, Bagnell and Wei2015a, Reference Kutcher, Wei and Morganb, Reference Kutcher, Wei and Coniglio2016) considers stigma reduction as one of its core constructs and stresses how stigma reduction and the improvement of mental health knowledge may enhance help-seeking behaviours. Research, such as randomised controlled trials and longitudinal cohort studies (McLuckie et al. Reference McLuckie, Kutcher, Wei and Weaver2014; Kutcher et al. Reference Kutcher, Bagnell and Wei2015a, Reference Kutcher, Wei and Morganb; Milin et al. Reference Milin, Kutcher, Lewis, Walker, Wei, Ferrill and Armstrong2016; Thornicroft et al. Reference Thornicroft, Mehta, Clement, Evans-Lacko, Doherty, Rose, Koschorke, Shidhaye, O'Reilly and Henderson2016) have demonstrated the effectiveness of interventions designed based on this approach.
Under these frameworks, a plethora of measurement tools have been developed to evaluate the stigma of mental illness from different lenses. This includes the evaluation of public stigma/personal stigma, people's own attitudes towards people with mental illness; perceived stigma that people perceive as held by others towards people with mental illness; self-stigma that people with mental illness hold against themselves; and experienced stigma that people with mental illness have encountered at the individual, community and society levels (Batterham et al. Reference Batterham, Griffiths, Barney and Parsons2013). A recent scoping review (Wei et al. Reference Wei, McGrath, Hayden and Kutcher2015), a systematic approach to map the literature in an area of interest and to accumulate and synthesise evidence available, identified 65 stigma measures and a narrative review (Brohan et al. Reference Brohan, Slade, Clement and Thornicroft2010) identified another 14, and categorised them according to different theoretical models. Another narrative review discussed more than 100 stigma measures informed by labelling theory specifically (Link et al. Reference Link, Yang, Phelan and Collins2004). One narrative review (Boyd et al. Reference Boyd, Adler, Otilingam and Peters2014) discussed 47 versions of one tool, Internalized Stigma of Mental Illness, and summarised related reliability and validity. However, despite the abundance of stigma measurement tools, and stigma impact research using them, there has been little, if any, research identified to investigate the quality of currently available stigma measurement tools. Furthermore, this has been no research identified to aggregate, analyse and compare stigma measurement tools developed under different stigma theoretical frameworks.
We conducted a systematic review to critically analyse the methodological quality of studies on psychometrics of available stigma tools and further to determine the level of evidence of the overall quality of their psychometrics across studies. Based on our analysis we then make recommendations for further stigma research and the application or ongoing development of these tools.
Methodology
This review followed the protocol recommended by the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) (Moher et al. Reference Moher, Liberati, Tetzlaff and Altman2009) to report its findings. We conducted risk of bias analysis with the adapted Consensus-based Standards for the Selection of Health Measurement Instruments (COSMIN) checklist (Terwee et al. Reference Terwee, Mokkink, Knol, Ostelo and Bouter2012); assessed the quality of each individual psychometric property, using criteria developed by the COSMIN group (Terwee et al. Reference Terwee, Bot, de Boer, van der Windt, Knol, Dekker, Bouter and de Vet2007); and then rated the level of evidence of overall quality. COSMIN checklist is a consensus-based checklist used to evaluate the methodological quality of studies on the measurement properties of health status instruments (Terwee et al. Reference Terwee, Mokkink, Knol, Ostelo and Bouter2012).
Search strategy
We searched the databases of PubMed, PsycINFO, EMBASE, CINAHL, the Cochrane Library and ERIC for relevant studies without limit on publication dates. The search period was between January and June 2015 and updated the search between April and May 2016, assisted by a local university health librarian. To ensure our search covered all dimensions of stigma as framed within the mental health literacy approach, regardless of theoretical foundations they were affiliated with, our search strategy covered all three outcomes of mental health literacy (knowledge, stigma and help-seeking) and we did not exclude studies that self-identified as focused on knowledge or help-seeking outcomes until the last stage of data extraction because some mental health literacy measures include all three components. We applied the search strategy from the scoping review (Wei et al. Reference Wei, McGrath, Hayden and Kutcher2015) that contained four sets of key words and phrases regarding general mental health and mental disorders, three outcomes of mental health literacy, assessment tools and study designs. Appendix 1 provides details of all search words and phrases applied searching PubMed.
Two team members independently searched the citations identified from database searches for relevant studies. Both members followed the same procedures to assess potential relevance of studies: reviewing titles in general (stage 1), reviewing titles and scanning abstracts (stage 2), briefly scanning full papers (stage 3) and reading full papers for data extraction (stage 4). Following these stages, we checked the reference list of each included study for additional studies and further searched narrative reviews on stigma measurement tools for additional studies (Link et al. Reference Link, Yang, Phelan and Collins2004; Brohan et al. Reference Brohan, Slade, Clement and Thornicroft2010; Boyd et al. Reference Boyd, Adler, Otilingam and Peters2014). The two reviewers discussed their identified studies and reached consensus on the final inclusion of studies. Three mental health professionals and/or research methodologist were available to solve any discrepancies on the final decisions for included studies.
Selection criteria
We included any type of quantitative studies assessing and reporting any psychometrics (reliability, validity and responsiveness) of a stigma measurement tool. According to the literature review, we defined that a stigma measurement tool evaluates: perceived stigma, experienced stigma, emotional responses to mental illness or self-stigma of mental illness. Our search focused on tools addressing stigma of mental illness in general or stigma against common specific mental illnesses: anxiety disorder, depression, attention deficit hyperactivity disorder (ADHD) and schizophrenia. An eligible study had to report not only the psychometrics of the tool, but also the statistical analysis of these psychometrics. We searched databases for studies published in English and did not limit the date of publication, or study participant age.
We excluded studies that only provided psychometrics of the tool applied, but did not report the statistical analysis of these psychometrics. For example, many studies evaluating anti-stigma interventions reported the internal consistency of the tool applied but did not describe the statistical analysis related to it and therefore were excluded from our review. We did not include studies addressing stigma related to substance use and addictions as they cover a wide range of domains that need independent evaluation.
Data extraction
We followed the COSMIN checklist manual (Terwee et al. Reference Terwee, Mokkink, Knol, Ostelo and Bouter2012) and created a data extraction form a priori to document basic information of each included study, such as author information, the tool content, response option of the tool, population, study location and study sample size. We further documented information about measurement properties as: (1) reliability (internal consistency, reliability (test–retest and intra-rater reliability) and measurement errors; (2) validity (content validity, structural validity (factor analysis), hypothesis testing (construct validity), cross-cultural validity and criterion validity); and (3) responsiveness (sensitivity to change).
We considered adapted tools (adding/reducing items or changing original items) as separate tools. However, if a tool was created in one study but in another was assessed for its factors and the number of final items was adjusted from the original tool due to the factor analysis, we considered them as the same tool as this is part of the usual ongoing process of finalising scales.
Methodological quality of included studies (risk of bias)
We rated the quality of a study for a particular measurement property as: ‘excellent’, ‘good’, ‘fair’ or ‘poor’. As a study may assess more than one measurement property, it may have multiple levels of quality for different measurement properties it assesses. The COSMIN checklist (Terwee et al. Reference Terwee, Mokkink, Knol, Ostelo and Bouter2012) created 7–18 criteria items to assess the methodological study quality for each measurement property, rated as ‘excellent’, ‘good’, ‘fair’ or ‘poor’ under each item, respectively. The final ranking of the study quality for each property takes the lowest criteria ranking. For example, the COSMIN checklist contains seven criteria items to assess the study quality assessing structural validity, and if under each item the study has different ranking ranging from ‘poor’ to ‘good’, the final ranking for this study would be ‘poor’ for structural validity.
Quality of measurement property and level of evidence of overall quality
In addition, the COSMIN group developed quality criteria for each psychometric property (except for cross-cultural validity) (Terwee et al. Reference Terwee, Bot, de Boer, van der Windt, Knol, Dekker, Bouter and de Vet2007). Each property must reach a quality threshold to receive a positive rating (+), otherwise a negative rating (−) or indeterminate rating due to the lack of data (?), or conflicting rating (+/−) if the findings are contradictory (Appendix 2). Based on both the methodological study quality and the quality of each psychometric property, we determined the level of evidence of overall quality of a psychometric property. The ratings were determined by adapting and applying criteria from a systematic review on measures of continuity of care (Uijen et al. Reference Uijen, Heinst, Schellevis, van den Bosch, van de Laar, Terwee and Schers2012) and the Cochrane Back and Neck Group's recommendations on the overall level of evidence of each assessed outcome (Furlan et al. Reference Furlan, Malmivaara, Chou, Maher, Deyo, Schoene, Bronfort and van Tulder2015) (Appendix 3). As a result, the levels of evidence are: strong (S) (+++ or −−−), moderate (M) (++ or −−), limited (L) (+ or −), conflicting (C) (+/−), or unknown (U) (x). We considered measurement properties with positive strong evidence (+++) as ‘ideal’, moderate positive evidence (++) as ‘preferred’, and limited positive evidence (+) as ‘minimum acceptable’.
We defined the level of evidence as unknown (U(x)) if: (1) a property is assessed in one study only and the study quality is ‘poor’, or the psychometric property is indeterminate (?); (2) a property is assessed in two studies, and the study quality is poor or property is indeterminate (?) in both studies; (3) a property is assessed in more than two studies, and the study quality is poor or property is indeterminate (?) in ≥ half of the studies.
If a property is assessed in two studies and study quality is ≥ ‘fair’, and the quality of the measurement property is positive (+) in both studies, we used the ‘worst score’ approach for the level of evidence, otherwise we determined the level of evidence as conflicting (C(+/−)). If a property is assessed in more than two studies and we found fair, good or excellent study quality in more than half of the studies, we considered the level of evidence as strong, moderate or limited, using the ‘worst score account’ approach. For example, if a measurement property is rated as (+) or (−) consistently in studies with the mixed study quality of excellent, good and fair, the final rating is limited level of evidence (L(+) or L(−)). For the rest of the cases, the level of evidence is conflicting (C (+/−)).
Results
Study selection and characteristics
Figure 1 presents the flow chart of study selection process. The data were first imported into Reference 2.0 database management software (RefWorks-COS PL, ProQuest, 2001) and duplicates were removed. We then screened 21 089 studies, and excluded studies that were not the topic of interest (e.g., studies addressing HIV/AIDS stigma, CBT, resilience, social and emotional learning, mental disorders that were not the topic of interest of this review) through four screening stages. As a result, we identified 117 studies reporting and analysing psychometric properties of 101 stigma measurement tools (Table 1). We classified tools according to what they measured (Table 1): perceived stigma against mental illness or the mentally ill; perceived stigma against mental health care (e.g., treatment, help-seeking, mental health institutions or psychiatry as a profession); emotional responses to mental illness; experienced stigma by people with mental illness or their relatives/caregivers; self-stigma by people with mental illness. We did not categorise tools under a specific stigma theory because most were developed with combined components from various theories or based on interviews with target population.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180823164258077-0866:S2045796017000178:S2045796017000178_fig1g.gif?pub-status=live)
Fig. 1. Flow chart of search results.
Table 1. Study characteristics
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180823164258077-0866:S2045796017000178:S2045796017000178_tab1.gif?pub-status=live)
A: Stigma against mental illness or the mentally ill; B: stigma against help-seeking, treatment, mental health institution or psychiatry; C: Emotional responses to mental illness; D: Experienced stigma; E: self-stigma; ?: not reported.
Ninety-one out of 101 tools applied Likert-scale response format asking participants to rate the level of agreement on items addressing stigma (Table 1). The other 10 tools applied formats such as multiple choices (e.g., yes/no/do not know); responses on a 100 mm visual analogue scale; error-choice response; open-ended questions; or prevalence and frequency of stigma experience.
Study participants were mostly people with mental illness (n = 36) and their relatives and caregivers (n = 6), followed by community members/general public (n = 20), health care providers and staff (n = 20), college students (n = 15), secondary school students (n = 8); and people from other professions such as educators (n = 2), police (n = 1), athletes (n = 1), employers (n = 1) and military personnel and veterans (n = 1). Some studies used multiple groups of participants mentioned above (n = 8). Most studies took place in developed countries with the USA as the most studied site (n = 44), followed by the UK (n = 21), Canada (n = 8) and China (n = 8). The rest of the studies were conducted in 19 different countries.
Methodological study quality
Table 2 summarises the study quality as: ‘excellent’, ‘good’, ‘fair’ or ‘poor’. Each study demonstrated mixed quality from ‘poor’ to ‘good’, when addressing different measurement properties of a tool, except one study on the Generalized anxiety stigma scale (GASS) demonstrating ‘good’ or ‘excellent’ study quality for all measurement properties assessed (Griffiths et al. Reference Griffiths, Batterham, Barney and Parsons2011).
Table 2. Methodological quality of included studies and the quality of each measurement property
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180823164258077-0866:S2045796017000178:S2045796017000178_tab2.gif?pub-status=live)
Study quality: E = Excellent, G = Good, F = Fair, P = Poor; Quality of each measurement property: positive rating (+), negative rating (−), indeterminate rating (?), conflicting rating (+/−); Overall level of evidence: Strong (S) (+++ or −−−), Moderate (M) (++ or −−), Limited (L) (+ or −), Conflicting (C) (+/−), or unknown (U) (x); N/A = Not applicable.
**, 12 tools of which all their measurement properties met the criteria of Limited (+ or −) (minimum acceptable) evidence or above; ??, 20 tools of which no measurement properties met the criteria of minimum acceptable evidence (limited level of evidence) or above.
A total of five studies met criteria for ‘excellent’ quality. These are studies measuring the internal consistency of Stigma-Devaluation scale (Dalky, Reference Dalky2012), the construct and structural validity of GASS (Griffiths et al. Reference Griffiths, Batterham, Barney and Parsons2011), as well as the content validity of Opening Minds Scale for Health Care Providers, Self-stigma scale and the revised Discrimination and stigma scale (Thornicroft et al. Reference Thornicroft, Brohan, Rose, Sartorius and Leese2009; Mak & Cheung, Reference Mak and Cheung2010; Kassam et al. Reference Kassam, Papish, Modgill and Patten2012).
‘Good’ quality studies were mostly those measuring internal consistency (n = 67) (Table 2), followed by five studies on the content validity, one study on test–retest reliability, one study on hypothesis testing (construct validity) and one study on structural validity.
Studies of ‘fair’ quality were found in most studies evaluating structural validity (89 out of 93), construct validity (hypothesis testing) (85 out of 92), test–retest reliability (38 out of 45), as well as in most studies evaluating cross-cultural validity (three out of four), and all studies (n = 7) evaluating criterion validity. We further identified studies of ‘fair’ quality in some studies evaluating internal consistency (n = 5) and content validity (n = 8).
No studies on structural validity and criterion validity were identified as of ‘poor’ quality, however the only two studies [86, 111] (Kassam et al. Reference Kassam, Glozier, Leese, Henderson and Thornicroft2010; Modgill et al. Reference Modgill, Knaak, Kassam and Szeto2014) on the responsiveness of related tools were rated as ‘poor’. We also found some studies with ‘poor’ quality in evaluating: the internal consistency (n = 36), content validity (n = 10), test–retest reliability (n = 5), construct validity (hypothesis testing) (n = 5) and cross-cultural validity (n = 1).
Level of evidence on the overall quality of measurement properties of stigma tools
As described in previous sections, the study quality (Excellent, Good, Fair or Poor) and the quality of measurement property (+, −, +/− or ?) were combined to determine the level of evidence as: strong (S) (+++ or −−), moderate (M) (++ or −−), limited (L) (+ or −), conflicting (C) (+/−), or unknown (U) (x), as shown in Table 2. The quality of each measurement property helped to determine the direction of the level of evidence of overall quality as positive (+) or negative (−) and their ratings were presented in Table 2 as well.
We found strong evidence (+++) among three tools: the content validity of the revised Discrimination and stigma scale (Thornicroft et al. Reference Thornicroft, Brohan, Rose, Sartorius and Leese2009) and Self-stigma scale (Mak & Cheung, Reference Mak and Cheung2010); the internal consistency, structural validity (factor analysis) and construct validity of the GASS (Griffiths et al. Reference Griffiths, Batterham, Barney and Parsons2011). Moderate level of evidence (M(++); M(−−)) were mostly the internal consistency of related tools (55 tools in 63 studies), as well as the content validity of five tools (Table 2). We further found limited level of evidence (L(+); L(−)) for construct validity of 55 tools in 68 studies, structural validity of 46 tools in 56 studies, test–retest reliability of 23 tools in 29 studies, content validity of eight tools, criterion validity of seven tools, and internal consistency of one tool (Table 2).
We identified conflicting (C(+/−)) evidence for the test–retest reliability of nine tools, the internal consistency of six tools, the construct validity of five tools, and the structural validity of three tools (Table 2). We were unable to determine the level of evidence for a number of measurement properties (U(x)) of some tools due to the lack of information provided. This includes the internal consistency of 29 tools (37 studies), structural validity of 25 tools (26 studies), content validity of 11 tools, construct validity of 11 tools, test–retest reliability of four tools and responsiveness of two tools. There are also four tools addressing cross-cultural validity rated as (U(x)) because the COSMIN checklist has not developed criteria for the quality of this property.
Of 101 tools, 12 met the criteria of limited, moderate or strong positive level of evidence on all their assessed measurement properties (highlighted with ** in Table 2), and 69 tools reached these levels of evidence for some of their measurement properties. None of the measurement properties for the rest of the 20 tools (highlighted with ?? in Table 2) reached at least the minimum acceptable level of evidence (+).
Discussion
This review is the first of its kind to investigate the quality of studies containing tools evaluating stigma against mental illness, and the level of evidence of overall quality of measurement properties. As indicated above, a total of 81 tools met the criteria of minimum acceptable, preferred, or ideal level of evidence with positive ratings for all or some of their measurement properties. These results may be useful for researchers and community members to consider for application in practice.
However, it is a challenge to conclude one tool is better than the other for a number of reasons: (1) included tools contained different items addressing various domains of stigma, even for tools developed under the same theoretical framework; (2) studies evaluated different measurement properties; and (3) study quality and level of evidence varied even in the same study depending on the properties measured. For example, Attitudes to Severe Mental Illness measured general attitudes of the general public and is one of the 12 tools of which all measurement properties reached ‘limited’ or ‘moderate’ level of evidence (Madianos et al. Reference Madianos, Economou, Peppou, Kallergis, Rogakou and Alevizopoulos2012). Another tool, Reported and Intended Behaviour scale (Evans-Lacko et al. Reference Evans-Lacko, Rose, Little, Flach, Rhydderch, Henderson and Thornicroft2011) also measured general attitudes of the general public in multiple studies and had mixed level of evidence from ‘unknown’ (x) to ‘moderate’ (++). In this circumstance when choosing which tool for application, evidence of each individual property matters and we should also consider whether the purpose of the chosen tool (e.g., the content of the tool, target population, and the setting) is consistent with our actual application, either in developing an anti-stigma intervention or to measure public stigma of mental illness.
Based on the current evidence, we recommend to use the 12 tools with all their evaluated measurement properties reaching at least ‘limited’ level of evidence or above (highlighted with ** in Table), as well as tools reaching these quality levels (limited or above) for at least half of their evaluated measurement properties (Table 2). Yet, we do not recommend tools with negative ratings (−--,−− or −) because the statistics of these measurement properties were below the criteria threshold, nor are we confident about the application of tools with conflicting (+/−) or unknown (x) evidence. We also however raise the caveat that future recommendations on the use of these tools may change as we know that the validation of a tool is an ongoing process (Streiner & Norman, Reference Streiner and Norman2008) and as more studies are conducted with more appropriate designs, tools that currently do not meet our criteria may do so following further future research.
The finding that there are currently over 100 different stigma measurement tools raises concerns about the overall value of this body of research, as it is simply not possible to come to general considerations about issues related to stigma in mental illness given the use of so many different tools to measure the concept. As such, we were unable to decide which tool is the ‘gold standard’ in this area and this is probably why only 2 (Vogel et al. Reference Vogel, Wade and Ascheman2009; Gibbons et al. Reference Gibbons, Dubois, Morris, Parker, Maxwell and Bédard2012) out of seven studies measuring criterion validity showed significant correlations with the pre-defined ‘gold standard’ tools. Future research should focus on using a much smaller number of tools, those with the best psychometric properties to help decrease the uncertainty arising from the application of so many different tools of varying quality. One important step to achieve this goal may be to reconstruct and synthesise various stigma theories and reach consensus on what a measure of stigma against mental illness should entail.
The study characteristics of these included validated tools are consistent with findings from the scoping review (Wei et al. Reference Wei, McGrath, Hayden and Kutcher2015) that there are few tools (six tools) assessing people's emotional responses to mental illness. Further, most research was conducted in the USA and it is not known if tools applied this population can be compared with those applied in other countries. Similarly, there are few tools validated among secondary school students (n = 8) and teachers (n = 2), indicating a substantial contrast against the fact that most mental disorders onset between the age of 12 and 25 (Kieling et al. Reference Kieling, Baker-Henningham, Belfer, Conti, Ertem, Omigbodun, Rohde, Srinath, Ulkuer and Rahman2011) and most young people attend school during this period of time.
Measuring stigma against mental illness is challenging because of social desirability bias where people tend to answer questions in a manner that will be viewed favourably by others (Maccoby & Maccoby, Reference Maccoby, Maccoby and Lindzey1954). This bias may seriously jeopardise the validity of findings when the tool is applied. We found that only 1 out of the 101 tools addressed this potential bias by applying error-choice response (Hepperlen et al. Reference Hepperlen, Clay, Henly, Barké, Hehperlen and Clay2002). Future application of stigma tools may need to consider evidence-based approaches to reduce social desirability bias. Some recommended techniques include the integration of social desirability scale assessment into the stigma assessment tool, the application of random response techniques, the addition of disguising of scale intent or an indirect questioning approach (Streiner & Norman, Reference Streiner and Norman2008).
Based on our findings and informed by the COSMIN checklist, we also have recommendations for researchers to consider. First, psychometric studies need to obtain an adequate sample size, and address missing items for relevant measurement properties. In addition, checking unidimensionality of items is as important as reporting Cronbach's alpha or KR-20 in deciding the study quality of internal consistency. Further, in examining test–retest reliability, the analysis on the independence of the test administration, the appropriate timing between tests, and the stability of test conditions were often ignored but matter in improving study quality. When assessing content validity, piloting the items in the targeting population (≥10) for comprehensiveness is equally important as item selection process. In analysing the structural validity/factor analysis, it is essential that researchers report the variances explained by factor analysis to improve study quality. When measuring construct validity, it is suggested that studies formulate hypotheses in advance and pre-define the direction and the magnitude of the mean difference or correlations of related statistical analysis to ensure the appropriateness of analysis.
It is noted that the most assessed measurement properties were internal consistency, structural and construct validity, while responsiveness was the least studied property and measurement errors were not assessed by included studies. Rising from this analysis is the question of what and how many psychometric properties should be included for psychometric analysis. Although the COSMIN checklist established criteria for nine properties, it is a modular framework that does not require the evaluator to complete analysis of all nine properties. However, informed by the findings from this review, it is reasonable to propose that the validation of a tool should at least analyse whether: the tool items are appropriately related (internal consistency); it is reliable over time (test–retest reliability); and the tool constructs are adequately established (structural and construct validity).
Additionally, when it is applied in culturally different settings, cross-cultural validity has to be evaluated prior to its application. The lack of cross-culturally validated tools (only four tools) makes cross cultural conclusions about stigma against mental illness difficult if not impossible. To address cross-cultural validity, researcher should make sure the culturally adapted tool is an adequate reflection of the original one. This could be achieved through a number of processes, including: multiple forward and backward translations of the tool with a committee to review the final translation; a pre-test of the tool with the target population performed to check cultural relevance; and the hypothesised factor structure tested with confirmatory factor analysis.
Limitations
Our review is limited in excluding non-English publications (25 non-English potentially relevant citations were identified at the title and abstract screening stages) and therefore may have missed some eligible studies otherwise. Secondly, the COSMIN checklist may not be the most appropriate critical appraisal approach although it is the only available one, because it is originally designed for health status questionnaire.
Conclusions
This is the first systematic review to investigate the study quality and overall level of evidence of tools evaluating stigma of mental illness. We categorised included tools, and provided rich evidence on the psychometric properties of current stigma measurement tools so that researchers and decision makers can choose best available tools for use in practice. However, no matter what tools researchers or decision makers choose, it is recommended that researchers continue to validate tools in different settings to ensure that these tools are able to be appropriately used in numerous different contexts and populations.
Acknowledgements
We would like to acknowledge that this study is supported by Yifeng Wei's Doctoral Research Award – Priority Announcement: Knowledge Translation/Bourse de recherché, issued by the Canadian Institutes of Health Research. Dr McGrath is supported by a Canada Research Chair. In addition, we would like to thank Ms Catherine Morgan and Michelle Xie for their help with data collection and analysis, and the health librarian, Ms Robin Parker, who helped with designing the search strategies of this review.
Conflict of Interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Ethical Standards
An approval by ethics committee was not applicable to this review.
Availability of Data and Materials
Owing to the large amount of data (risk of bias analysis, quality of each measurement properties for 117 studies), we choose to share it upon audience's requests.
Appendix 1: Search strategies in PubMed
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180823164258077-0866:S2045796017000178:S2045796017000178_tabA1.gif?pub-status=live)
Appendix 2: Quality criteria of measurement properties (Terwee et al. Reference Terwee, Bot, de Boer, van der Windt, Knol, Dekker, Bouter and de Vet2007)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180823164258077-0866:S2045796017000178:S2045796017000178_tabA2.gif?pub-status=live)
Appendix 3: Levels of evidence for the overall quality of the measurement property (Uijen et al. Reference Uijen, Heinst, Schellevis, van den Bosch, van de Laar, Terwee and Schers2012; Furlan et al. Reference Furlan, Malmivaara, Chou, Maher, Deyo, Schoene, Bronfort and van Tulder2015)
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20180823164258077-0866:S2045796017000178:S2045796017000178_tabA3.gif?pub-status=live)