Introduction
A clinician from any discipline selects a treatment for a patient based on evidence from clinical trials. The clinician applies the evidence based on the assumption that the patient has a given disease and that available treatments produce an outcome – response, remission or failure to respond – for that disease. We will argue that currently, much of the clinical trial evidence in psychiatry relies on the assumption that diagnosis is an adequate proxy for a disease or disorder and this leads us to use an inappropriate model of outcome (Joyce et al. Reference Joyce, Kehagia, Tracy, Proctor and Shergill2017). This results in evidence that informs us only of the average response for a group of patients presumed to be homogenous with respect to their categorical diagnosis. This may also explain the limited changes in prescribing practices after the publication of large trials (Berkowitz et al. Reference Berkowitz, Patel, Ni, Parks and Docherty2012).
Figure 1-1 describes a model of the relationship between trial outcomes and the underlying disorder; a concrete example being chronic systemic inflammatory disorders such as rheumatoid arthritis, where a disease process (DP: autoimmune-mediated inflammation) is reflected in a disease state (S: pain symptoms, inflammatory changes in joints, biochemical changes) that can be quantified by instruments (Y: pain and activity function scales; serological erythrocyte sedimentation rate, rheumatoid factor and anti-cyclic citrullinated peptide; radiological evidence of joint changes) and for which outcomes can be defined as changes in those instruments (Z: differences in pain and function scales, changes in serological markers and reduction in joint injury). When a patient is treated, disease states (S) change, and if the instruments (Y) are sensitive to these changes, they can be subjected to statistical methods that establish treatment efficacy (response, or failure to respond) by e.g. defining thresholds on Z, and identifying which patient-specific factors, X, mediate response.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20171221012537181-0545:S0033291717001726:S0033291717001726_fig1g.gif?pub-status=live)
Fig. 1. (left) illustrates the typical model of a clinical trial in medicine. 2 (right) illustrates the typical model as applied to psychiatry, where there is a lack of a clear link between disease state (S) and disease process (DP) and consequently, they are usually replaced by a diagnostic category (Dx). Arrows indicate dependence relations between variables – for example, disease state (S) depends on the DP.
In the idealised model shown in Fig. 1-1, for a given disorder or DP, there will be a disease state (S) that corresponds with that disorder – but not necessarily in a one-one relationship. The discovery of disease states and instruments that have predictive power to identify diagnoses is the domain of biomarker and psychometric research – see for example, (Marquand et al. Reference Marquand, Wolfers, Mennes, Buitelaar and Beckmann2016) for discussion of statistical methodology. Here, we require that one or more instruments (Y) quantify variables of the disease state (S) at a given time and we take this collection of variables to be a vector identifying a location in a multidimensional space – discussed below, and see also (Joyce et al. Reference Joyce, Kehagia, Tracy, Proctor and Shergill2017) for a more detailed discussion. It is, however, common practice in clinical trials to aggregate the variables in an instrument, Y, to obtain e.g. a total ‘score’ for measuring the severity of the patient's disease state at any given time (i.e. pre- and post-intervention). Finally, an outcome (Z) must then capture changes measured by instruments (Y) that are clinically meaningful. We recognise that the terms ‘disorder’, ‘disease’ and so forth can be contentious in mental health, but they are herein adopted for convention throughout this paper. Further, as the analogy above suggests, we adopt a position consistent with biological realism (Kendler, Reference Kendler2016) regarding the nature of psychiatric disorders.
Psychiatry does not benefit from as clear a correspondence between disease state (S) and DP nor are there instruments (Y) analogous to erythrocyte sedimentation rate, radiological evidence or joint changes from the example of rheumatoid disease. Consequently, psychiatry is faced with the model shown in Fig. 1-2, where a diagnostic category (Dx) such as schizophrenia or bipolar affective disorder replaces DP.
In psychiatry, the instruments (Y) that measure disease state are multivariate scales that capture the severity of signs and symptoms – for example, in psychotic disorders, the positive and negative symptoms scale (PANSS) or the brief psychiatric rating scale (BPRS) (Overall & Gorham, Reference Overall and Gorham1962; Kay et al. Reference Kay, Fiszbein and Opler1987). Psychiatric diagnoses represent constellations of signs and symptoms, but it is possible for these to overlap between diagnoses: for example psychotic features such as auditory verbal hallucinations are common to both schizophrenia, bipolar and major depressive disorder (Toh et al. Reference Toh, Thomas and Rossell2015) and borderline personality disorder (Nishizono-Maher et al. Reference Nishizono-Maher, Ikuta, Ogiso, Moriya, Miyake and Minakawa1993; Barnow et al. Reference Barnow, Arens, Sieswerda, Dinu-Biringer, Spitzer and Lang2010; Glaser et al. Reference Glaser, Van Os, Thewissen and Myin-Germeys2010; Schroeder et al. Reference Schroeder, Fisher and Schäfer2013). In bipolar disorder and schizophrenia, there are similarities in non-verbal communication (Annen et al. Reference Annen, Roser and Brüne2012), affective symptoms (Keshavan et al. Reference Keshavan, Morris, Sweeney, Pearlson, Thaker, Seidman, Eack and Tamminga2011) and cognitive deficits (Green, Reference Green2006; Jabben et al. Reference Jabben, Arts, Krabbendam and Van Os2009).
There is also consensus, for example, that the diagnosis of schizophrenia is not a single DP, but rather a categorical label for a syndrome with different aetiologies (Walker et al. Reference Walker, Curtis and Murray2002; Jablensky et al. Reference Jablensky2006; Demjaha et al. Reference Demjaha, Morgan, Morgan, Landau, Dean, Reichenberg, Sham, Fearon, Hutchinson, Jones, Murray and Dazzan2009; Demjaha et al. Reference Demjaha, MacCabe and Murray2012; Ripke et al. Reference Ripke, Neale, Corvin, Walters, Farh, Holmans, Lee, Bulik-Sullivan, Collier, Huang, Pers, Agartz, Agerbo, Albus, Alexander, Amin, Bacanu, Begemann, Belliveau, Bene, Bergen, Bevilacqua, Bigdeli, Black, Bruggeman, Buccola, Buckner, Byerley, Cahn, Cai, Campion, Cantor, Carr, Carrera, Catts, Chambert, Chan, Chen, Chen, Cheng, Cheung, Ann Chong, Robert Cloninger, Cohen, Cohen, Cormican, Craddock, Crowley, Curtis, Davidson, Davis, Degenhardt, Del Favero, Demontis, Dikeos, Dinan, Djurovic, Donohoe, Drapeau, Duan, Dudbridge, Durmishi, Eichhammer, Eriksson, Escott-Price, Essioux, Fanous, Farrell, Frank, Franke, Freedman, Freimer, Friedl, Friedman, Fromer, Genovese, Georgieva, Giegling, Giusti-Rodríguez, Godard, Goldstein, Golimbet, Gopal, Gratten, de Haan, Hammer, Hamshere, Hansen, Hansen, Haroutunian, Hartmann, Henskens, Herms, Hirschhorn, Hoffmann, Hofman, Hollegaard, Hougaard, Ikeda, Joa, Julià, Kahn, Kalaydjieva, Karachanak-Yankova, Karjalainen, Kavanagh, Keller, Kennedy, Khrunin, Kim, Klovins, Knowles, Konte, Kucinskas, Ausrele Kucinskiene, Kuzelova-Ptackova, Kähler, Laurent, Lee Chee Keong, Hong Lee, Legge, Lerer, Li, Li, Liang, Lieberman, Limborska, Loughland, Lubinski, Lönnqvist, Macek, Magnusson, Maher, Maier, Mallet, Marsal, Mattheisen, Mattingsdal, McCarley, McDonald, McIntosh, Meier, Meijer, Melegh, Melle, Mesholam-Gately, Metspalu, Michie, Milani, Milanova, Mokrab, Morris, Mors, Murphy, Murray, Myin-Germeys, Müller-Myhsok, Nelis, Nenadic, Nertney, Nestadt, Nicodemus, Nikitina-Zake, Nisenbaum, Nordin, O'Callaghan, O'Dushlaine, O'Neill, Oh, Olincy, Olsen, Van Os, Pantelis, Papadimitriou, Papiol, Parkhomenko, Pato, Paunio, Pejovic-Milovancevic, Perkins, Pietiläinen, Pimm, Pocklington, Powell, Price, Pulver, Purcell, Quested, Rasmussen, Reichenberg, Reimers, Richards, Roffman, Roussos, Ruderfer, Salomaa, Sanders, Schall, Schubert, Schulze, Schwab, Scolnick, Scott, Seidman, Shi, Sigurdsson, Silagadze, Silverman, Sim, Slominsky, Smoller, So, Spencer, Stahl, Stefansson, Steinberg, Stogmann, Straub, Strengman, Strohmaier, Scott Stroup, Subramaniam, Suvisaari, Svrakic, Szatkiewicz, Söderman, Thirumalai, Toncheva, Tosato, Veijola, Waddington, Walsh, Wang, Wang, Webb, Weiser, Wildenauer, Williams, Williams, Witt, Wolen, Wong, Wormley, Simon Xi, Zai, Zheng, Zimprich, Wray, Stefansson, Visscher, Trust Case-Control Consortium, Adolfsson, Andreassen, Blackwood, Bramon, Buxbaum, Børglum, Cichon, Darvasi, Domenici, Ehrenreich, Esko, Gejman, Gill, Gurling, Hultman, Iwata, Jablensky, Jönsson, Kendler, Kirov, Knight, Lencz, Levinson, Li, Liu, Malhotra, McCarroll, McQuillin, Moran, Mortensen, Mowry, Nöthen, Ophoff, Owen, Palotie, Pato, Petryshen, Posthuma, Rietschel, Riley, Rujescu, Sham, Sklar, St Clair, Weinberger, Wendland, Werge, Daly, Sullivan and O'Donovan2014; Reininghaus et al. Reference Reininghaus, Böhnke, Hosang, Farmer, Burns, McGuffin, Bentall, Insel, Adam, First, Reed, Hyman, Saxena, Bebbington, Caspi, Belsky, Goldman-Mellor, Harrington, Israel, Meier, Carpenter, Bustillo, Thaker, Os, Krueger, Green, Craddock, Owen, Cardno, Rijsdijk, Sham, Murray, McGuffin, Lichtenstein, Yip, Bjork, Pawitan, Cannon, Sullivan, Brown, Os, Driessens, Hoek, Susser, Clarke, Harley, Cannon, Matheson, Shepherd, Pinchbeck, Laurens, Carr, Henquet, Krabbendam, Graaf, Have, Os, Henquet, Murray, Linszen, Os, Heinz, Deserno, Reininghaus, Kirkbride, Errazuriz, Croudace, Morgan, Jackson, Boydell, Os, Kapur, Reininghaus, Priebe, Bentall, Russo, Levine, Demjaha, Forti, Bonaccorso, Fearon, McGuffin, Farmer, Harvey, Burns, Creed, Fahy, Thompson, Tyrer, White, Cohen-Woods, Craig, Gaysina, Gray, Gunasinghe, Craddock, Spitzer, Endicott, Robins, Rucker, Newman, Gray, Gunasinghe, Broadbent, Brittain, Chalmers, Kass, Wasserman, Peterson, Coleman, Kohavi, Efron, Gong, Böhnke, Croudace, Cardno, Jones, Murphy, Murphy, Asherson, Scott, Markon, Reininghaus, Morgan, Simpson, Dazzan, Morgan, Doody, Morgan, Reininghaus, Fearon, Hutchinson, Morgan, Dazzan, Reininghaus, Craig, Fisher, Hutchinson, Fearon, Morgan, Tamminga, Ivleva, Keshavan, Pearlson, Clementz and Witte2016) and shared genetic risk factors (Craddock et al. Reference Craddock, O'Donovan and Owen2009; Lichtenstein et al. Reference Lichtenstein, Yip, Björk, Pawitan, Cannon, Sullivan and Hultman2009; Purcell et al. Reference Purcell, Wray, Stone, Visscher, O'Donovan, Sullivan and Sklar2009). There is progress in trying to parse diagnostic categories along phenotypes, endophenotypes, biomarkers and underlying cellular and molecular aetiologies (e.g. Insel et al. Reference Insel, Cuthbert, Garvey, Heinssen, Pine, Quinn, Sanislow and Wang2010; Morris & Cuthbert, Reference Morris and Cuthbert2012; Cuthbert & Insel, Reference Cuthbert and Insel2013; Schumann et al. Reference Schumann, Binder, Holte, de Kloet, Oedegaard, Robbins, Walker-Tilley, Bitter, Brown, Buitelaar, Ciccocioppo, Cools, Escera, Fleischhacker, Flor, Frith, Heinz, Johnsen, Kirschbaum, Klingberg, Lesch, Lewis, Maier, Mann, Martinot, Meyer-Lindenberg, Müller, Müller, Nutt, Persico, Perugi, Pessiglione, Preuss, Roiser, Rossini, Rybakowski, Sandi, Stephan, Undurraga, Vieta, van der Wee, Wykes, Haro and Wittchen2014).
Currently, clinical trials in psychiatry have to contend with a lack of clear relationship between disease state and process. Patients are therefore recruited into trials on the basis of their diagnostic category (Dx) and treatment efficacy is established based on usually dichotomous outcomes (Z), defined as threshold changes in aggregates of instruments (Y). For example, in mood disorders, remission of symptoms can be defined as the summed (total) Hamilton depression scale score of ⩽7 for at least 2 months (Frank et al. Reference Frank, Prien, Jarrett, Keller, Kupfer, Lavori, Rush and Weissman1991), and similarly for schizophrenia, a 50% reduction in baseline PANSS or BPRS score (Leucht et al. Reference Leucht, Davis, Engel, Kissling and Kane2009; Jakubovski et al. Reference Jakubovski, Carlson and Bloch2015). In the language of statistics, disease state and process (S and DP) are latent (or hidden, unmeasured) variables that are largely ignored.
We will show that the model exemplified in Fig. 1-2 results in patients being assigned the same outcome (Z) if we define aggregates on instruments (Y) without attending to differences in disease states (S). For example, using the PANSS instrument, patients with high positive and low negative symptoms severity risk being equated with patients who have low positive but high negative symptom severity. Given the uncertainty in relationships between disease states and processes, ignoring how disease states differ between patients means we are effectively failing to identify groups of patients that may benefit from an intervention. Just as importantly, we may also be subjecting patients to treatments and side-effects that are not effective for their specific manifestation of disorder.
Our proposal is that instead of assuming the model of Fig. 1-2, adopting the model shown in Fig. 1-1 allows the derivation of outcomes (Z) by directly attending to the concept of differing disease states as they are measured and represented by instruments (Y). For example, at a given time, the PANSS instrument measures 30 individual symptoms – a measure of disease state – which can change over time, for example, in response to intervention. We note that a single instrument may not be sufficient to capture disease state with enough fidelity to have analytical utility, for example, we may augment PANSS with some measure of affective symptoms or instruments measuring social and occupational functioning. This has the potential to expose treatments that are effective for some patients (e.g. those with a certain profile of positive, negative and general symptoms) and avoids measuring efficacy as the ‘average’ response for the homogenous diagnostic category.
Weak aggregate outcomes
To illustrate the inherent problems with current definitions of outcomes, consider the PANSS scale which measures 30 individual variables that if used to measure outcome are analytically intractable because clinical trial statistical methodology requires a univariate (one-dimensional or scalar) measure of change e.g. in response to treatment with respect to multiple predictors (i.e. patient specific factors, X, in Fig. 1). Tractability is obtained by ‘collapsing’ these 30 variables into a single aggregate by summation and then the outcome measure, Z, summarises clinically meaningful change (e.g. response or remission) as a threshold change in this sum. We refer to this approach as weak aggregation. For example, in mood disorders, remission of symptoms can be defined as the summed (total) Hamilton depression scale score of ⩽7 for at least 2 months (Frank et al. Reference Frank, Prien, Jarrett, Keller, Kupfer, Lavori, Rush and Weissman1991). In schizophrenia, there are proposals for ways of aggregating variables in a more structured way (Andreasen et al. Reference Andreasen, Carpenter, Kane, Lasser, Marder and Weinberger2005) and these represent thresholds on selected combinations of variables in scales that are believed to be clinically meaningful. However, clinical trials require a single primary outcome across the whole participant group, and secondary outcomes are used to measure subtle variation of response in sub-groups of the study population. Using a number of secondary outcomes carries the cost of making analysis vulnerable to criticisms of false positives, or Type I error. The most common approach to statistical analysis of this scalar outcome Z, is to use some variant of generalised linear modelling (GLM) against a set of predictors X (McCullagh & Nelder, Reference McCullagh and Nelder1989).
One consequence of the relationships described above, and in Fig. 1, is that defining a primary outcome (Z) by thresholds on a weak aggregates of multiple variables (Y) ‘collapses’ information about patients’ disease state (S) into a single univariate, scalar value that may obscure important discriminating information that (optimistically) speaks to the DP being treated by an intervention (as in the example given above of patients with opposing patterns of positive and negative symptoms). This is especially problematic for psychiatry, where the correspondence of disease states to processes stands in a many-to-many relationship and we have traditionally used diagnostic category (Dx) as a proxy for both.
As a more detailed illustrative example, consider disease state measured by the three domains in the PANSS instrument: the total positive (P) and negative (N) symptoms scores range from 7 (no symptoms) to 49 (severe) and the general symptoms domain (G) ranges from 16 to 122. To derive an outcome Z, we consider the weak aggregate that is the sum of the total positive and negative symptoms, Z = P + N. It is obvious that there are many combinations of P and N (with each combination representing a discrete disease state) that could yield the same outcome value for Z. For example, a patient with P = 23 and N = 44 has Z = 67 whereas another patient with P = 38 and N = 29 has the same aggregate outcome Z = 67, despite these measurements representing quite different disease states; the first patient having high negative (but low positive) symptom severity and the second patient having the opposite pattern. Ignoring these differences – as the weak aggregate sum Z does – results in an outcome measure that cannot differentiate between disease states that may be clinically distinct and meaningful. Using this example, there are 32 combinations of values for P and N that yield the sum 67 and therefore, 32 disease states which would be assigned the same outcome for the weak aggregate Z = P + N. This problem becomes exponentially larger over three variables: defining the aggregate outcome measure over the positive, negative and general scales (Z = P + N + G) results in 741 discrete combinations of values of P, N and G that yield a total score of Z = 67. Although some combinations assigned Z = 67 will be clinically meaningful – for example, patients with (P, N, G) = (18, 25, 24) and (18, 26, 23) are suitably alike – in general, many will not. It is clear that weak aggregation blindly collapses variables from instruments such as PANSS to a single scalar variable, ignoring clear differences in disease state.
Strong aggregate outcomes
There is an additional problem with weak aggregation in psychiatry illustrated in Fig. 1-2, when DP and disease state are left un-modelled. In standard analyses using GLMs, by definition, it is only the mean change in the aggregate outcome (Z) that is modelled as a function of the predictors (X). All patients in the GLM analysis will be assumed to have effectively the same disorder that responds according to a unimodal, average response over the whole trial population: we know that response to an intervention is rarely uniform across patients with psychiatric disorders. A relevant analogy in fibromyalgia is given by (Moore et al. Reference Moore, Derry, McQuay, Straube, Aldington, Wiffen, Bell, Kalso and Rowbotham2010; Moore et al. Reference Moore, Derry, Eccleston and Kalso2013) where response to treatment with pregabalin demonstrates a bimodal response; some patients have clinically significant reduction in pain but others show little or no response at all.
Our proposal for strong aggregates is as follows: firstly, to retain the strengths and tractability inherent in current statistical methodology (e.g. GLMs) we must find univariate measures – but avoid simply ‘collapsing’ measurements of disease state into a single scalar, which may not expose or reflect clinically meaningful differences (e.g. to avoid the problem exposed in the example above for positive and negative symptoms). This requires a way of representing difference in disease state, as measured by instruments Y, so that the outcome Z will reflect clinically relevant differences in the inevitable variation in response between patients i.e. different ‘modes’ of response. An additional (but less essential) requirement would be that the outcome can also be used to index patients who share similar disease states. This might, for example, be suitable for use in N-of-1 trial designs (Schork, Reference Schork2015) to collect naturalistic evidence for treatments in the absence of complete understanding of DP.
Clinical example
To illustrate our proposal, consider Fig. 2 (equivalent colour versions can be found in online Supplementary Information), that shows 1459 individual patient's disease states as measured by the PANSS domain scores (positive, negative and general symptoms) from the baseline assessment of the Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) trial (Stroup et al. Reference Stroup, Mcevoy, Swartz, Byerly, Qlick, Canive, Mcqee, Simpson, Stevens and Lieberman2003). We use three variables purely to ease visualisation and exposition of the key concepts, acknowledging that we have summed the items in the PANSS to obtain three domains, but the principles will generalise and do not require summed domains. Indeed, in practice, it would be prudent to use factor-analytic decompositions of these domains [see, for example (Lindenmayer et al. Reference Lindenmayer, Grochowski, Hyman, Powchik and Davidson1995; Wallwork et al. Reference Wallwork, Fortgang, Hashimoto, Weinberger and Dickinson2012], but to keep our explanation simple and easy to visualise, we restrict ourselves to the 3 original groupings of signs and symptoms in PANSS. In our example, the PANSS domains form a three-dimensional space, where each patient is represented by a point located on orthogonal (perpendicular) axes, representing low-to-high severity on positive (P), negative (N) and general (G) domains. Visualising this space is difficult, so following standard practice in multivariate analysis, we present three ‘views’ obtained by plotting each combination of P, N and G in two-dimensional planes: P × N, G × P and N × G shown in Figs. 2-1, 2-2 and 2-3 respectively.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20171221012537181-0545:S0033291717001726:S0033291717001726_fig2g.gif?pub-status=live)
Fig. 2. 1459 patients represented as positive, negative and general psychopathology scores under a weak aggregation scheme.
We then define four prototypes proposed to represent hypothesised disease states with clinically meaningful or interesting locations in this space: for example, prototype A represents disease states that are low severity across positive, negative and general symptoms (i.e. a relatively well patient). Prototype B represents the opposite extreme – a patient that is globally unwell with high symptom severity across positive, negative and general symptoms. Prototype C represents a patient who has relatively high positive but low negative and relatively low general symptom severity. Prototype D represents a disease state where a patient has relatively low positive but high negative and general symptom severity. Note that in defining prototypes, we are specifying structure with respect to the actual study population that can be exploited to define an outcome that preserves (rather than collapses) information that has clinical relevance, as required by our definition of strong aggregates. Patients with disease states near prototype A (relatively well) are clearly very different to those near B (globally unwell), and similarly for C and D where these patients are far from ‘well’ but whose disease states reflect different patterns of disease state.
Measures derived from weak aggregates
Figure 2-4 shows the resulting univariate (i.e. one-dimensional) scale of the summed weak aggregate (where Z = P + N + G), with the location of each patient and the four prototypes along this scalar measure by their respective scores Z. Of note, relatively well (A) and globally unwell (B) prototypes are well demarcated (i.e. distant) at each end of the univariate scale, but C and D less so. This stands in contrast to the relative positions in the original space where C, D and B are well separated along the positive and negative symptoms dimension (Fig. 2-1) as well as C and D being distinguished along both general and positive (Fig. 2-2) negative and general dimensions (Fig. 2-3). The critical point is that a weak aggregate can only discriminate between well and unwell patients. This is emphasised in the original three-dimensional space where each point (a single patient) in Figs. 2-1, 2-2 and 2-3 are shaded light-to-dark according to their weak (summed) aggregate score Z (e.g. the greyscale of the points and scale bar shown in Fig. 2-4). Notice how the gradient from light-to-dark (reflecting low to high Z) is broadly uniform in direction (bottom left to top right) over each of the three views, enabling distinction between A and B, but less so for C, and poorly for B and D. This demonstrates graphically how weak aggregates collapse and obscure meaningful distinctions between potential different disease states unless they remain well separated on the univariate scale of Z (Fig. 2-4).
Measures derived from strong aggregates
We now consider how the structure displayed in the prototypes can be captured in a way that enables a univariate outcome measure to be derived, but preserving distinctions between them in a meaningful way i.e. strong aggregation. The method we used was singular value decomposition (SVD), which is similar to principle components analysis (Strang, Reference Strang2004), embedding a high-dimensional space into a lower dimensional representation and in this case, exposes properties of interest (e.g. the separation of clinically relevant prototypes). We note that SVD is one of many possibilities for this embedding transformation; the key requirements of any chosen method are dimensionality reduction with distance preservation (isometry) and other approaches include, for example, multidimensional scaling (Krzanowski, Reference Krzanowski2000), isomap embedding (Tenenbaum et al. Reference Tenenbaum, de Silva and Langford2000), locally linear embedding (Roweis & Saul, Reference Roweis and Saul2000) and self-organising maps (Kohonen, Reference Kohonen1995). Essentially, rather than ‘blindly’ summing, we use a method of combining or mapping each variable from Y (equivalently, the axes in Fig. 2) such that clinically relevant regions of that space (prototypes in Fig. 2) are mapped onto sufficiently different values in the univariate aggregate Z. After applying SVD to the same patients and prototypes in Fig. 2, we are able to find a new, univariate strong aggregate that preserves the proposed clinically relevant difference in disease states exemplified by the prototypes (details are given in online Supplementary Information).
Figure 3-4 shows the shape of the new strong univariate aggregate Z (using a soft rather than hard threshold scheme) with the new values of the prototypes illustrated. This aggregate crucially separates the clinically relevant prototypes C and D (compared with the weak aggregate shown in Fig. 2). In Figs. 3-1, 3-2 and 3-3 (colour versions are reproduced in online Supplementary Information), the patients in Fig. 2 are assigned new values according to the strong aggregate Z (light grey = low Z, dark grey = high Z). Note the difference in how the gradient of greyscales compares with Fig. 2, emphasising the score of patients varying to their proximity along a line dividing C and D.
![](https://static.cambridge.org/binary/version/id/urn:cambridge.org:id:binary:20171221012537181-0545:S0033291717001726:S0033291717001726_fig3g.gif?pub-status=live)
Fig. 3. 1459 patients represented as positive, negative and general psychopathology scores under a strong aggregation scheme.
Discussion and conclusions
In this paper, we have discussed how clinical trials in psychiatry have to cope with uncertain relationships between the treatment, disease state and process, and how this has potentially hindered clinical trial research. Further, relationships among patients can be anchored to prototypes of clinical interest, in disease states as measured by clinical instruments. This may provide more useful information when defining outcomes by essentially exploiting geometric structure. Our example used the PANSS as an exemplar instrument chosen for its general clinical familiarity, but the principles extend to any other instrument for other disorder areas (e.g. affective disorders and the Hamilton Depression Scale).
We suggest increased use of strong aggregates because they capture important structure between regions of the measured disease state that weak aggregates ignore by blindly summing (or averaging) and that we have shown cause patients with different disease states to be mapped onto the same aggregate value. We used one specific method (SVD) to define a strong aggregate, but any similar method that captures clinically relevant structure over regions of the space of disease states and then assigns a single univariate (scalar) variable that preserves the distance between these regions would be suitable. Importantly, we defined prototypes only with respect to the actual patient population – remaining agnostic to the actual unknown DP or classical diagnostic category, but specifying tentative clinically relevant disease states. The resulting strong aggregate is univariate and therefore compatible with current statistical methodology. We now consider specific implications for trial methodology, design and analysis.
Prototype and response definition
The crux of our proposal is that univariate strong aggregates expose differences between relevant prototypes. In the example provided earlier, we chose four prototypes in the 3-dimensional space of positive, negative and general symptoms representing globally well (A), globally unwell (B), as well as dominantly positive (C) and negative symptoms (D). As a further example of a priori prototypes, the remission criteria defined in (Andreasen et al. Reference Andreasen, Carpenter, Kane, Lasser, Marder and Weinberger2005) can be interpreted in our framework as follows; one dimension measures reality distortion (R, the sum of items P1, G9 and P3), another captures disorganisation (D, sum of P2 and G5) and another captures negative symptoms (N, the sum of N1, N4 and N6). This similarly forms a three-dimensional space with axes R, D and N that can be visualised similarly to the examples presented earlier. There are then two relevant prototypes capturing the two extremes of full (best) and no (worst) remission. The participants and prototypes are then transformed into a space V (e.g. online Supplementary Information, Fig. S3; by singular-valued decomposition) which can then be visualised to find the single univariate dimension, V*, that best exposes the gradient between these two extreme prototypes. Prior to the trial intervention, each participant is then assigned a value (the strong aggregate) defined as their location along this dimension V* (as in Fig. 3-4 above) that defines the participant's pre-intervention remission state. At the end of the trial, each participant's post-treatment values of R, D and N are transformed into V, and their positions along the dimension V* are ‘read off’ resulting in the participant's strong aggregate measure of remission state after intervention. Alternatively, a ‘hard’ threshold over V* can be defined – for example, if seeking to use the strong aggregate in a binary logistic/probit GLM analysis. We note that in principle, disease states and prototypes need not be restricted to measures in three-dimensional space and is used here only for ease of exposition. The key is that one identifies the single dimension (in the transformation, by e.g. SVD) that exposes the differences between prototypes.
Power and recruitment
In calculating a priori the sample size required for an adequately powered trial, the distribution of the data, the expected effect size (means) or response rate, and measures of variance are required. Therefore, just as for weak aggregates used as outcomes, if data is available from previous studies or pilot data, then the distribution, means/response rates and variance assumptions should be justified by applying the proposed strong aggregate definition on available pilot data. Our proposal for strong aggregates is motivated by the idea that we may be failing to capture meaningful differences using weak aggregates; for this reason, there is potential to increase the power of a given study design.
Recruitment to a prospective trial need not differ when using strong aggregates – however, they offer a potential advantage because participants are assigned a continuous univariate score based on their relationship to prototypes (see Fig. 3). The prototypes can define not only proposed or desired endpoints (e.g. defining two prototypes at extremes of positive symptoms) but also, ‘landmarks’ of interest (for example, as we did for prototypes A, B, C and D in the above examples). Then, participants could be stratified to treatment by their score in relation to prototypes – for example, those closer to prototype A may be assigned one treatment, those closer to prototype B another. Further, we note a further application in N-of-1 trials, where diagnostic uncertainty is likely to be even more problematic and uncontrolled; in such instances, using SVD, it is straight-forward to index patients by their similarity to each other and their similarity to prototypes. In practice, a new patient attending a clinic with a certain disease state (e.g. measured by PANSS) can be easily compared to patients and prototypes using an SVD model (see online Supplementary Information), assigned a predicted outcome and stratified to treatment if, historically, there were treatments that worked for patients similarly proximal to certain prototypes.
Trial data re-analysis
One compelling reason to use strong aggregates is that it mitigates against multiple secondary analyses by (i) requiring prototypes to be a priori defined to capture the proposed disease states (e.g. globally well, or dominant negative symptoms) relevant to the intervention and (ii) providing a univariate measure over these disease states that reflects a given participant's symptoms (or response to a treatment). We suggest this prevents the scenario where, after failing to find a desired result using a weak aggregate primary outcome, secondary analyses are then required on ‘subsets’ of participants. Necessarily, defining prototypes forces us to consider the clinically meaningful states – rather than looking for a global change in e.g. ‘total’ PANSS scores – and provides a way to define a single measure that captures participants’ disease state relative to these. There is significant potential for many re-analyses of existing data using this paradigm. To illustrate, we recently conducted a systematic review of trials for treating the cognitive symptoms of schizophrenia registered on ClinicalTrials.org in the period 2004–2015 (Joyce et al. Reference Joyce, Kehagia, Tracy, Proctor and Shergill2017). We identified a total of 114 studies, but when we specifically examined definitions of primary outcome and available results, only 18 were eligible for inclusion. We explored the definition of primary outcomes on instruments (e.g. PANSS), finding only 4 of the 18 studies considered specific combinations of variables or domain scores (a necessary first step to define prototypes for strong aggregates) instead of using weak aggregates. Unsurprisingly, secondary outcomes were often used to understand the multi-dimensional (rather than univariate) measurement of disease state.
In summary, the implications of our proposal are two-fold; first, given our arguments for the importance of disease state, we should reorient clinical trials towards recruiting for the specific symptoms rather than diagnoses. Second, clinical trial analyses should explore and then exploit the heterogeneity of disease states (measured by familiar clinical instruments) and seek strong aggregates that focus outcomes on specific, relevant definitions of treatment response or failure rather than assuming homogeneity, weak aggregation and settling for the average response for a diagnostic category.
Supplementary material
The supplementary material for this article can be found at https://doi.org/10.1017/S0033291717001726
Acknowledgements
The authors gratefully acknowledge data used in the preparation of this article that resides in the NIH-supported NIMH Data Repositories (Clinical Antipsychotic Trials of Intervention Effectiveness). D.W.J. is funded by a National Institute of Health Research Integrated Academic Training Clinical Lectureship. S.S.S. is supported by a European Research Council Consolidator Award (grant number 311686) and the National Institute for Health Research (NIHR) Mental Health Biomedical Research Centre at South London and Maudsley NHS Foundation Trust and King's College London. The funders had no role in study design, data collection, data analysis, data interpretation or writing of the report.
Declaration of Interests
None.