Measurement practices exacerbate the generalizability crisis: Novel digital measures can help

Brittany I. Davidson; David A. Ellis; Clemens Stachl; Paul J. Taylor; Adam N. Joinson

doi:10.1017/S0140525X21000534

Measurement practices exacerbate the generalizability crisis: Novel digital measures can help

Published online by Cambridge University Press: 10 February 2022

Paul J. Taylor and

Brittany I. Davidson: Affiliation:
School of Management, University of Bath, Claverton Down, BathBA2 7AY, UKbid23@bath.ac.uk, https://www.brittanydavidson.co.uk/; dae30@bath.ac.uk, http://www.davidaellis.co.uk/; aj266@bath.ac.uk, http://www.joinson.com/home/Welcome.html Department of Engineering, University of Bristol, BristolBS1 5DD, UK
David A. Ellis: Affiliation:
School of Management, University of Bath, Claverton Down, BathBA2 7AY, UKbid23@bath.ac.uk, https://www.brittanydavidson.co.uk/; dae30@bath.ac.uk, http://www.davidaellis.co.uk/; aj266@bath.ac.uk, http://www.joinson.com/home/Welcome.html
Clemens Stachl: Affiliation:
Institute of Behavioral Science & Technology, University of St. Gallen, CH-9000, Switzerlandcstachl@stanford.edu, https://www.clemensstachl.com
Paul J. Taylor: Affiliation:
Department of Psychology, Lancaster University, Bailrigg, LancasterLA1 4YW, UKp.j.taylor@lancaster.ac.uk, https://pauljtaylor.com/
Adam N. Joinson: Affiliation:
School of Management, University of Bath, Claverton Down, BathBA2 7AY, UKbid23@bath.ac.uk, https://www.brittanydavidson.co.uk/; dae30@bath.ac.uk, http://www.davidaellis.co.uk/; aj266@bath.ac.uk, http://www.joinson.com/home/Welcome.html

Article contents

Abstract
References

Rights & Permissions

Abstract

Psychology's tendency to focus on confirmatory analyses before ensuring constructs are clearly defined and accurately measured is exacerbating the generalizability crisis. Our growing use of digital behaviors as predictors has revealed the fragility of subjective measures and the latent constructs they scaffold. However, new technologies can provide opportunities to improve conceptualizations, theories, and measurement practices.

Type: Open Peer Commentary
Information: Behavioral and Brain Sciences , Volume 45 , 2022 , e10

DOI: https://doi.org/10.1017/S0140525X21000534 [Opens in a new window]
Copyright: Copyright © The Author(s), 2022. Published by Cambridge University Press

Yarkoni highlights the disconnect between psychology's descriptive theories and its inferential tests – a problem we argue is exacerbated by inadequate measurement. The primacy of measurement in psychology's history has ebbed-and-flowed, from the absolute focus on what was observable and quantifiable that defined behaviorist approaches (Hayes & Brownstein, Reference Hayes and Brownstein1986; Skinner, Reference Skinner1963, Reference Skinner1976) to the overreliance on button presses and mouse clicks that characterizes some modern research (Baumeister, Vohs, & Funder, Reference Baumeister, Vohs and Funder2007). Today, digital trace data provide new opportunities for rich measurement that captures behavioral, situational, and environmental/contextual factors simultaneously (Lazer et al., Reference Lazer, Pentland, Watts, Aral, Athey, Contractor and Wagner2020; Mischel, Reference Mischel2004). For instance, smartphones are a powerful data source – a collection of sensors and logging routines that we carry with us for large swathes of the day – that psychologists are utilizing to predict a variety of outcomes, from social interaction, personality, mood, to general health (Davidson, Reference Davidson2020; Ellis, Reference Ellis2020; Harari et al., Reference Harari, Müller, Stachl, Wang, Wang, Bühner and Gosling2020; Miller, Reference Miller2012; Piwek, Ellis, Andrews, & Joinson, Reference Piwek, Ellis, Andrews and Joinson2016; Stachl et al., Reference Stachl, Au, Schoedel, Gosling, Harari, Buschek and Bühner2020).

Improved methodology alone will not result in rapid progress for the behavioral sciences (see Kaplan, Reference Kaplan1964; Uttal, Reference Uttal2001). For example, digital trace data have re-ignited problems with traditional operationalizations of latent variables. Research demonstrating associations between new and old measures often fails to articulate why a connection between a latent measure (e.g., mood disturbance) and a behavioral (digital) predictor (e.g., keystroke speed) should exist in advance of an analysis (Davidson, Reference Davidson2020; Zulueta et al., Reference Zulueta, Piscitello, Rasic, Easter, Babu, Langenecker and Leow2018). Without specification or theory, the focus on prediction over explanation restricts generalizability further. A related challenge is the disconnect between subjective and objective measures (e.g., Taylor et al., Reference Taylor, Banks, Jolley, Ellis, Watson, Weiher and Julku2021), where predictive studies find their survey data predict an outcome, but objective measures do not (Eisenberg et al., Reference Eisenberg, Bissett, Zeynep Enkavi, Li, MacKinnon, Marsch and Poldrack2019). Here, the problem is an overreliance on subjective methodologies to measure both latent and observable constructs. For example, the gold standard for personality measurement relies on surveys (e.g., HEXACO, OCEAN, Big 5) and remains contested (Cattell, Reference Cattell1958; Kagan, Reference Kagan2001). Similarly, other measures including estimates of everyday behavior rarely align with reality (Parry et al., Reference Parry, Davidson, Sewall, Fisher, Mieczkowski and Quintana2020). While latent measurement remains core to psychological science, many constructs are developed rapidly, with little standardization, and rely on face validity alone (e.g., “internet addiction,” despite being sardonic in origin, has spawned 100s of technology addiction scales; Howard & Jayne, Reference Howard and Jayne2015). New digital sources need to avoid these issues if they are to prosper.

Illuminating the complex relationship between generalizability and measurement further – observations of behavior via digital traces will often only explain (or predict) part of a broad latent construct. At face value, predicting part of extraversion may appear straightforward from digital recordings of speech, or time spent using social apps. However, there are other sub-components of extraversion that these data will struggle to explain (e.g., feeling indifferent to social activities). Other personality factors such as openness and agreeableness remain conceptually more challenging to map onto (a single) digital behavior (Hinds & Joinson, Reference Hinds and Joinson2019; Stachl et al., Reference Stachl, Au, Schoedel, Gosling, Harari, Buschek and Bühner2020). Hence, it is critically important psychology shifts away from predictive validity alone as evidence for successful operationalization and parameterization, especially from new data sources (Boyd, Pasca, & Lanning, Reference Boyd, Pasca and Lanning2020). Any new digital measure has to be developed incrementally, where researchers first describe how it conceptually aligns with an existing latent construct (Glewwe & van der Gaag, Reference Glewwe and van der Gaag1990). Assuming that digital traces are behavioral expressions of latent variables, researchers should be able to qualitatively express links at a more general level first across contexts, then move to specifics, which would enhance generalizability.

Of course, refocusing on actual behavior via digital traces will not be a panacea. Some digital traces may be “objective,” but they are rarely error-free (Sen, Floeck, Weller, Weiss, & Wagner, Reference Sen, Floeck, Weller, Weiss and Wagner2019). For example, a microphone-based audio classifier can detect whether ambient conversations are taking place around an individual, but it may not distinguish real conversations from someone watching television. Similarly, little consideration is given to how measurement variance might be reduced or maximized for a new digital source. For example, while some assessments in psychology (e.g., cognitive tasks) do not produce reliable individual differences, others (e.g., mood) purposefully reflect variations in individual responses (Hedge, Powell, & Sumner, Reference Hedge, Powell and Sumner2018). Hence, it is critical to find ways to share raw data, processing pipelines, and analysis scripts for digital trace research, as the degrees of freedom are vast, which causes large variance in conclusions made from the same data (Silberzahn et al., Reference Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey and Carlsson2018; Towse, Ellis, & Towse, Reference Towse, Ellis and Towse2020). Validation procedures are likely to reflect the disparity of digital data sources, but combining small and large-scale approaches (e.g., N = 1 sample, case studies) can successfully quantify errors associated with smartphone sensing-based methods (Geyer, Ellis, Shaw, & Davidson, Reference Geyer, Ellis, Shaw and Davidson2020; Sen et al., Reference Sen, Floeck, Weller, Weiss and Wagner2019; Szot, Specht, Specht, & Dabrowski, Reference Szot, Specht, Specht and Dabrowski2019). Only then can related work explore how signals from multiple systems may be combined to improve data efficiency. Failure to ensure this basic research is completed will result in little progress as research agendas risk shifting in the wrong direction if the grounding principles are weak, particularly in applied settings, such as security and health, which are increasingly interested in digital traces (Davidson, Reference Davidson2020; Guttman & Greenbaum, Reference Guttman and Greenbaum1998).

Moreover, we acknowledge that research in this space remains challenging to conduct because data derived from digital sources can be difficult to access, handle, and interpret (DeMasi, Kording, & Recht, Reference DeMasi, Kording and Recht2017). This challenges the way psychologists are trained and incentivized (not) to publish descriptive findings in an interdisciplinary landscape. However, we are hopeful that new methods and emerging forms of data will complement psychology's diverse measurement practices. Collectively termed the Internet of Things, the future potential for data linkage that could further leverage real-world research remains an exciting prospect. In the long term, taking time to understand how behavioral, situational, and environmental/contextual factors can be extracted from objective digital data will allow psychology to develop robust contextualized and comprehensive theory (Lazer et al., Reference Lazer, Pentland, Watts, Aral, Athey, Contractor and Wagner2020).

Our muse are people and psychology should critically consider how it moves forward and merges old and new. Generalizability requires sound measures first, but there is still little agreement between psychologists on what is worth measuring.

Financial support

This work was part-funded by the Centre for Research and Evidence on Security Threats (ESRC Award: ES/N009614/1 to PJT; ANJ; DAE), www.crestresearch.ac.uk and by the National Science Foundation (SES-1758835 to CS).

Conflict of interest

None.

References

Baumeister, R. F., Vohs, K. D., & Funder, D. C. (2007). Psychology as the science of self- reports and finger movements: Whatever happened to actual behavior? Perspectives on Psychological Science, 2(4), 396–403. https://doi.org/10.1111/j.1745-6916.2007.00051.x.CrossRef Google Scholar PubMed

Boyd, R. L., Pasca, P., & Lanning, K. (2020). The personality panorama: Conceptualizing personality through Big behavioural data. European Journal of Personality, 34(5), 599–612. https://doi.org/10.1002/per.2254.CrossRef Google Scholar

Cattell, R. B. (1958). What is “objective” in “objective personality tests”? Journal of Counseling Psychology, 5(4), 285. https://doi.org/10.1037/h0046268.CrossRef Google Scholar

Davidson, B. I. (2020). The crossroads of digital phenotyping. General Hospital Psychiatry. https://doi.org/10.1016/j.genhosppsych.2020.11.009.Google Scholar PubMed

DeMasi, O., Kording, K., & Recht, B. (2017). Meaningless comparisons lead to false optimism in medical machine learning. PLoS ONE, 12(9), e0184604. https://doi.org/10.1371/journal.pone.0184604.CrossRef Google Scholar PubMed

Eisenberg, I. W., Bissett, P. G., Zeynep Enkavi, A., Li, J., MacKinnon, D. P., Marsch, L. A., & Poldrack, R. A. (2019). Uncovering the structure of self-regulation through data-driven ontology discovery. Nature Communications, 10(1), 2319. https://doi.org/10.1038/s41467-019-10301-1.CrossRef Google Scholar PubMed

Ellis, D. A. (2020). Smartphones within psychological science. Cambridge University Press.CrossRef Google Scholar

Geyer, K., Ellis, D. A., Shaw, H., & Davidson, B. I. (2020). Open source smartphone app and tools for measuring, quantifying, and visualizing technology use [Preprint]. PsyArXiv. https://doi.org/10.31234/osf.io/eqhfa.CrossRef Google Scholar

Glewwe, P., & van der Gaag, J. (1990). Identifying the poor in developing countries: Do different definitions matter? World Development, 18(6), 803–814. https://doi.org/10.1016/0305-750X(90)90003-G.CrossRef Google Scholar

Guttman, R., & Greenbaum, C. W. (1998). Facet theory: It's development and current status. European Psychologist, 3, 13–36. https://doi.org/10.1027/1016-9040.3.1.13.CrossRef Google Scholar

Harari, G. M., Müller, S. R., Stachl, C., Wang, R., Wang, W., Bühner, M., … Gosling, S. D. (2020). Sensing sociability: Individual differences in young adults’ conversation, calling, texting, and app use behaviors in daily life. Journal of Personality and Social Psychology, 119(1), 204–228. https://doi.org/10.1037/pspp0000245.CrossRef Google Scholar PubMed

Hayes, S. C., & Brownstein, A. J. (1986). Mentalism, behavior-behavior relations, and a behavior-analytic view of the purposes of science. The Behavior Analyst, 9(2), 175–190. https://doi.org/10.1007/BF03391944.CrossRef Google Scholar

Hedge, C., Powell, G., & Sumner, P. (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods, 50(3), 1166–1186.CrossRef Google Scholar

Hinds, J., & Joinson, A. (2019). Human and computer personality prediction from digital footprints. Current Directions in Psychological Science, 28(2), 204–211.CrossRef Google Scholar

Howard, M. C., & Jayne, B. S. (2015). An analysis of more than 1,400 articles, 900 scales, and 17 years of research: The state of scales in cyberpsychology, behavior, and social networking. Cyberpsychology, Behavior, and Social Networking, 18(3), 181–187.CrossRef Google Scholar PubMed

Kagan, J. (2001). The need for new constructs. Psychological Inquiry, 12(2), 84–103. https://doi.org/10.1207/S15327965PLI1202_03.CrossRef Google Scholar

Kaplan, A. (1964). The conduct of inquiry: Methodology for behavioral science. Chandler Publishing Company.Google Scholar

Lazer, D. M. J., Pentland, A., Watts, D. J., Aral, S., Athey, S., Contractor, N., … Wagner, C. (2020). Computational social science: Obstacles and opportunities. Science, 369(6507), 1060–1062. https://doi.org/10.1126/science.aaz8170.CrossRef Google Scholar PubMed

Miller, G. (2012). The smartphone psychology manifesto. Perspectives on Psychological Science, 7(3), 221–237. https://doi.org/10.1177/1745691612441215.CrossRef Google Scholar PubMed

Mischel, W. (2004). Toward an integrative science of the person. Annual Review of Psychology, 55(1), 1–22. https://doi.org/10.1146/annurev.psych.55.042902.130709.CrossRef Google Scholar

Parry, D. A., Davidson, B. I., Sewall, C., Fisher, J. T., Mieczkowski, H., & Quintana, D. (2020). A systematic review and meta-analysis of discrepancies between logged and self-reported digital media use. PsyArXiv. doi:10.31234/osf.io/f6xvz.CrossRef Google Scholar

Piwek, L., Ellis, D. A., Andrews, S., & Joinson, A. (2016). The rise of consumer health wearables: Promises and barriers. PLoS Medicine, 13(2), e1001953. https://doi.org/10.1371/journal.pmed.1001953.CrossRef Google Scholar PubMed

Sen, I., Floeck, F., Weller, K., Weiss, B., & Wagner, C. (2019). A total error framework for digital traces of humans. ArXiv:1907.08228 [Cs]. http://arxiv.org/abs/1907.08228.Google Scholar

Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., … Carlsson, R. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3), 337–356.CrossRef Google Scholar

Skinner, B. F. (1963). Behaviorism at fifty. Science, 140(3570), 951–958.CrossRef Google Scholar PubMed

Skinner, B. F. (1976). About behaviorism. Vintage Books.Google Scholar

Stachl, C., Au, Q., Schoedel, R., Gosling, S. D., Harari, G. M., Buschek, D., … Bühner, M. (2020). Predicting personality from patterns of behavior collected with smartphones. Proceedings of the National Academy of Sciences, 117(30), 17680–17687. https://doi.org/10.1073/pnas.1920484117.CrossRef Google Scholar PubMed

Szot, T., Specht, C., Specht, M., & Dabrowski, P. S. (2019). Comparative analysis of positioning accuracy of Samsung Galaxy smartphones in stationary measurements. PLoS ONE, 14(4), e0215562. https://doi.org/10.1371/journal.pone.0215562.CrossRef Google Scholar PubMed

Taylor, P. J., Banks, F., Jolley, D., Ellis, D. A., Watson, S. J., Weiher, L., … Julku, J. (2021). Oral hygiene effects verbal and nonverbal displays of confidence. Journal of Social Psychology, 161(2), 182–196. doi: 10.1080/00224545.2020.1784825.CrossRef Google Scholar

Towse, J. N., Ellis, D. A., & Towse, A. S. (2020). Opening Pandora's Box: Peeking inside psychology's data sharing practices, and seven recommendations for change. Behavior Research Methods, 53(4), 1455–1468. https://doi.org/10.3758/s13428-020-01486-1.CrossRef Google Scholar

Uttal, W. R. (2001). The new phrenology: The limits of localizing cognitive processes in the brain. MIT Press.Google Scholar

Zulueta, J., Piscitello, A., Rasic, M., Easter, R., Babu, P., Langenecker, S. A., … Leow, A. (2018). Predicting mood disturbance severity with mobile phone keystroke metadata: A biaffect digital phenotyping study. Journal of Medical Internet Research, 20(7), e241.CrossRef Google Scholar PubMed