Outliers in L2 Research in Applied Linguistics: A Synthesis and Data Re-Analysis

Christopher Nicklin; Luke Plonsky

doi:10.1017/S0267190520000057

Outliers in L2 Research in Applied Linguistics: A Synthesis and Data Re-Analysis

Published online by Cambridge University Press: 30 June 2020

Christopher Nicklin and

Luke Plonsky

Show author details

Christopher Nicklin*: Affiliation:
Temple University, Japan and Northern Arizona University
Luke Plonsky: Affiliation:
Temple University, Japan and Northern Arizona University
*: *Corresponding author. E-mail: christophernicklin79@gmail.com

Article contents

Abstract
Introduction
Literature Review
Outlier Detection and Treatment
Data Transformations
Methodological Synthesis and Data Re-analysis
Method
Phase II: Re-analysis of L2 SPR Studies
Data Analysis
Results
Phase II: Re-analysis of L2 SPR Studies
Discussion
Future Research
Limitations
Conclusion
Footnotes
References

Rights & Permissions

Abstract

Data from self-paced reading (SPR) tasks are routinely checked for statistical outliers (Marsden, Thompson, & Plonsky, 2018). Such data points can be handled in a variety of ways (e.g., trimming, data transformation), each of which may influence study results in a different manner. This two-phase study sought, first, to systematically review outlier handling techniques found in studies that involve SPR and, second, to re-analyze raw data from SPR tasks to understand the impact of those techniques. Toward these ends, in Phase I, a sample of 104 studies that employed SPR tasks was collected and coded for different outlier treatments. As found in Marsden et al. (2018), wide variability was observed across the sample in terms of selection of time and standard deviation (SD)-based boundaries for determining what constitutes a legitimate reading time (RT). In Phase II, the raw data from the SPR studies in Phase I were requested from the authors. Nineteen usable datasets were obtained and re-analyzed using data transformations, SD boundaries, trimming, and winsorizing, in order to test their relative effectiveness for normalizing SPR reaction time data. The results suggested that, in the vast majority of cases, logarithmic transformation circumvented the need for SD boundaries, which blindly eliminate or alter potentially legitimate data. The results also indicated that choice of SD boundary had little influence on the data and revealed no meaningful difference between trimming and winsorizing, implying that blindly removing data from SPR analyses might be unnecessary. Suggestions are provided for future research involving SPR data and the handling of outliers in second language (L2) research more generally.

Type: Research Article
Information: Annual Review of Applied Linguistics , Volume 40 , March 2020 , pp. 26 - 55

DOI: https://doi.org/10.1017/S0267190520000057 [Opens in a new window]
Open Practices: Open materials Open data
Copyright: Copyright © The Author(s), 2020. Published by Cambridge University Press

Introduction

Self-paced reading (SPR) tasks involve the visual presentation of stimulus sentences on a computer screen. The speed with which the stimulus words, segments, or lines are presented is controlled by participants' keystrokes, the timing of which is recorded. Those ‘reading times’ (RTs) have been used to shed light on a wide variety of cognitive processes and mechanisms such as grammatical ambiguities (Dekydtspotter & Outcalt, Reference Dekydtspotter and Outcalt2005; Gerth, Otto, Felser, & Nam, Reference Gerth, Otto, Felser and Nam2017; Jackson & Roberts, Reference Jackson and Roberts2010; Juffs, Reference Juffs1998; Kim & Christianson, Reference Kim and Christianson2017), anomalies (Dong, Wen, Zeng, & Ji, Reference Dong, Wen, Zeng and Ji2015; Juffs & Harrington, Reference Juffs and Harrington1996; Sagarra & Herschensohn, Reference Sagarra and Herschensohn2011), distance dependency, such as wh-movement (Dussias & Piñar, Reference Dussias and Piñar2010; Johnson, Fiorentino, & Gabriele, Reference Johnson, Fiorentino and Gabriele2016; Juffs, Reference Juffs2005), and the processing effects of lexical qualities, such as frequency, case-marking information, and orthographic distinctiveness (Hopp, Reference Hopp2016; Jackson, Reference Jackson2008; Kim, Crossley, & Skalicky, Reference Kim, Crossley and Skalicky2018).

One challenge to research that involves SPR is the presence of statistical outliers, which are data points that are distinct from the other observations in the sample (Roth & Switzer III, Reference Roth, Switzer and Rogelberg2004). Outliers are particularly prevalent in reaction-time data, which tend to be highly sensitive; even the slightest distraction or momentary hesitation on the part of participants can yield statistical outliers. Outliers are also of concern for quantitative researchers in language and psychological sciences because of their potential to increase error variance, reduce statistical power, bias estimates of interest, and lead to violations of test assumptions such as requirements for a normal distribution (e.g., Aguinis, Gottfredson, & Joo, Reference Aguinis, Gottfredson and Joo2013; Osborne & Overbay, Reference Osborne and Overbay2004). In second-language (L2) research, outliers might be of particular concern owing to more pronounced variance in processing speeds across samples of L2 users than for their first-language (L1) counterparts, as evidenced by larger SDs in eye-tracking studies (Godfroid, Reference Godfroid2020). However, the literature surrounding outliers is plagued by uncertainty regarding outlier treatment, particularly in applied linguistics, where methods tend to lag behind neighboring disciplines. In the present study, a body of L2 research involving SPR was synthesized with the initial aim of establishing how researchers in the field treat outliers. In Phase II of the study, a number of outlier treatments were implemented to re-analyze datasets from published L2 SPR studies in order to determine which might be best suited to L2 SPR research.

Literature Review

Quantitative researchers are often concerned about the presence of outliers, and not without good reason. Such data points can increase error variance, reduce statistical power, bias estimates of interest, and lead to violations of test assumptions, such as normality (Osborne & Overbay, Reference Osborne and Overbay2004). In analyses that compare group scores, such as analysis of variance (ANOVA), outliers in one direction or another can also lead to Type I or Type II errors. More specifically, the presence of high and low outliers can artificially stretch the standard deviation (SD) and, therefore, increase the chances of a Type II error (Cousineau & Chartier, Reference Cousineau and Chartier2010). Despite these clear problems, L2 researchers do not always check for the presence of outliers (Marsden et al., Reference Marsden, Thompson and Plonsky2018; Plonsky & Ghanbar, Reference Plonsky and Ghanbar2018). Moreover, there is a lack of clarity regarding what exactly constitutes an outlier and how they should be handled.

Outlier Detection and Treatment

One of the central issues in this domain is outlier detection (Aguinis et al., Reference Aguinis, Gottfredson and Joo2013). There are a number of approaches for doing so that can be used individually or in concert, such as the visual examination of histograms, boxplots, Q-Q plots, and data-cleaning. Data-cleaning involves establishing boundaries to determine whether reaction times were too fast or too slow to be valid measures of the phenomenon of interest. However, a lack of agreed-upon standards for what comprises an outlier has led to widespread ambiguity and inconsistency in the published social science literature. Simmons, Nelson, and Simonsohn (Reference Simmons, Nelson and Simonsohn2011), for example, reported cutoff points to determine outliers set to exclude the fastest 2.5% of values, results faster than 2.5 SDs from the mean, and results faster than 300, 200, 150, or even 100 ms. Boundaries to determine the slowest were set to exclude the slowest 2.5% of values, results slower than 2.5 SDs from the mean, and results slower than 1,000, 2,000, 3,000, or even 5,000 ms. This lack of inconsistency in boundary selection renders comparisons between studies problematic.

In a methodological synthesis of SPR tasks in L2 research (K = 74), Marsden et al. (Reference Marsden, Thompson and Plonsky2018) found SD-based cutoffs to be predominantly chosen over time- and percentage-based boundaries. However, this is potentially problematic for a number of reasons. First, if the data lying outside the SD-based boundaries are excluded, because the SD of the dataset is highly influenced by outliers, a paradox occurs whereby outliers are influencing the criterion that has been chosen to detect and eliminate them (Field, Reference Field2018). Second, blindly trimming datasets with arbitrarily determined SD-based boundaries potentially removes legitimate data (Baayen, Reference Baayen2008). Finally, if such trimming is administered to the general distribution, outliers within categories can remain unaffected (Lachaud & Renaud, Reference Lachaud and Renaud2011).

When time-based RT cutoffs have been utilized in SPR task research, an assortment of values and justifications have emerged. Lower word boundaries for excluding unusually fast RTs in SPR research are generally around 200 ms for word-by-word presentations (Jegerski, Reference Jegerski2016; Jiang, Reference Jiang2007; Sagarra & Herschensohn, Reference Sagarra and Herschensohn2010). However, they can be 100 ms (Kim, Reference Kim2018; Leal, Slabakova, & Farmer, Reference Leal, Slabakova and Farmer2017; Litcofsky & Van Hell, Reference Litcofsky and Van Hell2017), or even as low as 40 or 50 ms (Cook, Reference Cook2018; Frank, Trompenaars, & Vasishth, Reference Frank, Trompenaars and Vasishth2016), despite research suggesting 100 ms as a minimum to respond to the identity of a signal (Luce, Reference Luce1991). Rather than rely on conventions, some researchers provide research-driven justifications to determine a cutoff. Sagarra and Herschenson (Reference Sagarra and Herschensohn2011, Reference Sagarra and Herschensohn2012) implemented a 200 ms cutoff to discard RTs, which was based on research suggesting that Anglophone college students require between 225–300 ms to process individual words (Rayner & Pollatsek, Reference Rayner and Pollatsek1989). This is slightly higher than magnetoencephalography (MEG) research findings, which have suggested around 150–200 ms for lexical identification, although the findings are inconsistent (Hsu, Lee, & Marantz, Reference Hsu, Lee and Marantz2011). However, these research-derived estimates are based upon L1 speaker norms and might not reflect language learners. Occasionally, researchers identify outliers based on other studies, and, in place of theoretical reasoning or citations, provide explanations such as “following standard practice” (e.g., Dussias & Piñar, Reference Dussias and Piñar2010, p. 457; Marinis, Roberts, Felser, & Clahsen, Reference Marinis, Roberts, Felser and Clahsen2005, p. 64).

Regardless of whether and how cutoffs are justified, none of the inconsistencies in outlier identification listed above are necessarily incorrect. Yet the fact that such a wide array of options exists makes them “potential fodder for self-serving justifications” (Simmons et al., Reference Simmons, Nelson and Simonsohn2011, p. 1360), and unscrupulous authors might be tempted to explore a variety of alternatives in their quest to obtain statistical significance or an otherwise desired result (i.e., “p-hacking”). Alternatively, if a significant result is discovered with a priori selected cutoffs, experimenting with a spread of cutoffs to make sure that a significant effect remains significant and is not merely a product of the chosen cutoffs could be considered good practice, provided all results are reported (Ratcliff, Reference Ratcliff1993).

Another issue addressed by Aguinis et al. (Reference Aguinis, Gottfredson and Joo2013) was outlier treatment. A wealth of research exists that argues both for and against the deletion of outlying data, but, before considering whether or not to remove outliers from a dataset, it is important to consider their source. Orr, Sackett, and Dubois (Reference Orr, Sackett and Dubois1991) provided a list of outlier sources consisting of (a) observations from subjects who do not belong to the population being targeted (e.g., returnee students in a population of L2 learners), (b) erroneous data preparation (e.g., failing to remove unusually fast RTs for a participant), (c) the result of extreme values (e.g., guessing correctly on all items on a multiple-choice test), (d) erroneous observations (e.g., accidental responses), and, finally, (e) legitimate observations that allude to pertinent information regarding the construct and population of interest. Further sources of outliers include faulty assumptions regarding the distribution of the data (Osborne & Overbay, Reference Osborne and Overbay2004) and inattentive participants (Ratcliffe, Reference Ratcliff1993), which are both particularly relevant to reaction time research. If the source of an outlier is revealed to be anything but a legitimate observation, then the data point should be removed (Osborne & Overbay, Reference Osborne and Overbay2004).

Once an outlier is judged to be a legitimate observation, the decision over whether or not to eliminate can be made. Several researchers have advocated for outlier removal, providing examples of how extreme scores have a negative effect on accuracy estimates and make significant changes to correlation and t-test statistics (Barnett & Lewis, Reference Barnett and Lewis1994; Judd & McClelland, Reference Judd and McClelland1989; Osborne & Overbay, Reference Osborne and Overbay2004). Nevertheless, a number of options exist for handling outliers beyond simply deleting them. Whereas trimming involves removing outliers outside of SD- or time-based boundaries, winsorizing involves replacing outliers with a chosen boundary value (e.g., mean value plus 2 SDs), which is preferable to including outliers that bias the model (Field, Reference Field2018). Some researchers favor replacing outliers with the grand mean of the general distribution, or the mean of the outliers’ respective condition. However, as mentioned above, this practice is paradoxical, and it can also distort main effects, interactions, and relations between conditions (Lachaud & Renaud, Reference Lachaud and Renaud2011). A number of researchers favor setting a priori outlier treatment procedures, reporting results both with and without outliers, and providing justification of why any data points were removed (Aguinis et al, Reference Aguinis, Gottfredson and Joo2013; Aguinis & Joo, Reference Aguinis, Joo, Lance and Vandenberg2015; Baayen & Milin, Reference Baayen and Milin2010; Bakker & Wicherts, Reference Bakker and Wicherts2014; Kruskal, Reference Kruskal1960; Ratcliff, Reference Ratcliff1993; Simmons et al., Reference Simmons, Nelson and Simonsohn2011; Streiner, Reference Streiner2018). This method is advantageous as a means to assure readers that results were not affected by outliers. Furthermore, clear justification for outlier removal assures readers that the motive for removal was not to achieve more favorable results.

Data Transformations

Data transformations are another well-established method of outlier control. Data transformations involve the application of a mathematical modification to each value in a dataset, reducing the influence of outliers and yielding a distribution closer to normal. Data transformations can be particularly useful for the analysis of reaction time data, because such distributions are often positively skewed with a long tail, which violates the normality assumption making them inappropriate for common, parametric statistical analyses. The most common data transformation in reaction time research is the logarithmic, or log, transformation. A logarithm is the power that a chosen base number must be raised by in order to obtain the original data value. For example, if the base number is 10 and the reaction time is 1000 ms, the transformed value will be 3, because 10³ equals 1000. In addition to base values, such as 2 and the Natural Logarithm, e (2.7182818), a relatively large base value of 10 is frequently used with reaction time data that contain outliers, because higher base values pull extreme data values inwards (Osborne, Reference Osborne2002).

Other common data transformations include the inverse transformation and the square root transformation. The inverse, or reciprocal, transformation is simply 1/x, in which x is the reaction time. For example, if the reaction time is 1000 ms, the inverse transformation will be 0.001. Because the inverse transformation reverses the order of scores, Baayen (Reference Baayen2008) recommends using -1000/x, which facilitates interpretation by using a negative to re-reverse the order, while multiplication by 1000 avoids tiny numbers. Whereas log transformations attend to the skewness of a distribution, the inverse transformation reduces the impact of long reaction times in the tail. However, inverse transformations have been shown in data simulations to lower statistical power (Schramm & Rouder, Reference Schramm and Rouder2019). Although rarely used in L2 studies, calculating the square root of a reaction time is another transformation that can address positive skew by pulling larger scores closer to the center (Field, Reference Field2018).

When debating which of these three transformations to decide upon, researchers in the social sciences have traditionally utilized the Box-Cox procedure (Box & Cox, Reference Box and Cox1964). This procedure involves obtaining the lambda (λ) value that corresponds to the correlation coefficient when the RT data is plotted on a Box-Cox normality plot. The λ value determines which of the three transformations should be used. For instance, an inverse transformation is preferable if λ = -1, a log transformation is more suitable if λ = 0, and a square root transformation should be implemented if λ = 0.5. At present, despite the common practice of applying transformations along with data cleaning methods such as winsorizing and trimming, there is no research exploring the effectiveness of such combinations, which the present study aims to address.

Methodological Synthesis and Data Re-analysis

L2 research has seen a recent surge in methodological syntheses. These syntheses systematically examine and evaluate research in a given domain for strengths and weaknesses with the aim of advancing methodological practices in L2 research. Methodological syntheses in L2 research have been both broad and narrow in focus, with some attending to individual substantive domains (e.g., interaction in Plonsky & Gass, Reference Plonsky and Gass2011; L2 writing in Liu & Brown, Reference Liu and Brown2015) and others concerned with different aspects of research design, analyses, and reporting practices, such as instrument development (Derrick, Reference Derrick2016), factor analysis (Plonsky & Gonulal, Reference Plonsky and Gonulal2015), multiple regression (Plonsky & Ghanbar, Reference Plonsky and Ghanbar2018), use of effect sizes (Norouzian & Plonsky, Reference Norouzian and Plonsky2018; Plonsky & Oswald, Reference Plonsky and Oswald2014), reliability coefficients (Plonsky & Derrick, Reference Plonsky and Derrick2016), statistical assumptions (Hu & Plonsky, Reference Hu and Plonsky2019), and others (e.g., Al-Hoorie & Vitta, Reference Al-Hoorie and Vitta2019; Plonsky, Reference Plonsky2013). Furthermore, methodological syntheses have also focused on individual research tools, such as grammaticality judgment tasks (Plonsky, Marsden, Crowther, Gass, & Spinner, Reference Plonsky, Marsden, Crowther, Gass and Spinner2019) and eye-tracking (Godfroid, Reference Godfroid2020), as well as instructed second language development (Sok, Kang, & Han, Reference Sok, Kang and Han2019). Most pertinent to the present study is Marsden et al.'s (Reference Marsden, Thompson and Plonsky2018) review of SPR tasks, which demonstrated a number of inconsistencies in how SPR data are cleaned and handled. The authors called for greater standardization of SPR research and for empirical research to determine a set of norms for outlier detection in SPR data. Marsden et al. also concluded that the wide variation and opaqueness regarding data cleaning methods in SPR tasks affects and, in fact, hinders comparability of individual study results. In response to their study, the 74 studies gathered by Marsden et al., along with a further 45 studies published since, were synthesized to determine how outliers in SPR tasks are currently being treated by L2 researchers.

Another type of methodologically oriented study, and one that has featured less prominently in L2 research, is data re-analysis. As the name suggests, studies of this nature involve obtaining datasets from previously published research in order to re-examine the results using alternative methods of analysis. In L2 research, bootstrapping has been the subject of two data re-analyses; Larson-Hall and Herrington (Reference Larson-Hall and Herrington2010) re-analyzed data from an unpublished study, while Plonsky, Egbert, and LaFlair (Reference Plonsky, Egbert and LaFlair2015) solicited data from 255 studies for re-analysis using bootstrapping. However, only 37 (14.50%) datasets were shared, and only 26 (10.20%) were useable. Re-analysis of the shared data revealed Type I error in four out of 16 significant results, indicating that bootstrapping is a useful tool, particularly with small, quasi-experimental samples.

The reluctance of researchers to share data is not confined to L2 research, with similar issues observed in psychological research (Craig & Reese, Reference Craig and Reese1973; Wicherts, Borsboom, Kats, & Molenaar, Reference Wicherts, Borsboom, Kats and Molenaar2006; Wolins, Reference Wolins1962). However, researchers’ disinclination to share data is unfortunate, because data sharing has the potential to transform analytical practices, advance methodology, provide professional development opportunities, and make research more engaged, democratic, and practically relevant for any field that embraces it (Maienschein, Parker, Laubichler, & Hackett, Reference Maienschein, Parker, Laubichler and Hackett2019). Furthermore, the American Psychological Association (APA) (2017) instructs researchers to not withhold datasets from other competent professionals who wish to submit the results to re-analysis. In part two of the present study, 19 datasets from published SPR studies were re-analyzed using several different outlier treatment techniques in order to investigate their effects on RT distributions and determine the existence of an outlier treatment method particularly suited to L2 SPR data. With this goal in mind, the research questions for the present study are as follows:

• How are outliers generally treated in L2 studies involving SPR?
• How do the various outlier treatments utilized in L2 SPR research affect RT data?
• Are any methods of outlier treatment particularly suited to L2 SPR research?

Method

Phase I: Synthesis of Outlier Treatment

Study selection

To determine how outliers are typically handled in L2 SPR tasks, a sample of studies was compiled. The first stage involved obtaining all 67 studies analyzed in Marsden et al.'s (Reference Marsden, Thompson and Plonsky2018) methodological synthesis of SPR tasks in L2 research. As this sample was collected by March 2016, a further three searches were conducted to acquire studies published since. The same databases as Marsden et al. (Communication and Mass Media; Education Source; Education Resources Information Center [ERIC] PsycARTICLES, and PsycINFO) were searched using the terms self-paced reading, subject-paced reading, and moving window by abstract, title, or subject. The results for self-paced reading resulted in 1,348 hits, with the first 176 from between March 2016 to July 2019. Of the 176 articles, 43 L2 studies were acquired. The search for subject-paced reading resulted in seven hits, all of which were L1 studies and were, therefore, discarded. The search for moving window resulted in 455 hits, with 57 published after March 2016, from which two more L2 studies were acquired. Overall, the three searches resulted in a further 45 studies to Marsden et al.'s 67, making a total of 112 studies available for coding.

Coding . Once the sample was collected, all 112 studies were coded for their handling of outliers. The coding scheme (see Supplementary Materials A) began with the known set of techniques for handling outliers as described in Aguinis et al. (Reference Aguinis, Gottfredson and Joo2013), Field (Reference Field2018), and other discussions of outlier handling. However, as new techniques were encountered, new items and values for existing items were added to the coding scheme. During coding, eight studies were eliminated for utilizing “paper-based” SPR, lexical decision tasks, or cross-modal priming studies as opposed to SPR tasks. This resulted in a final sample of 104 studies for synthesis (see Supplementary Materials B). Once coding was completed, a research assistant (a doctoral student in applied linguistics) re-coded 10% of the sample. Agreement was perfect for 84.51% of the items (Cohen's Kappa median κ = 1, IQR = 0) (see Supplementary Materials A). Where disagreement occurred, the first author re-coded the item for a third time to settle the discrepancy.

Phase II: Re-analysis of L2 SPR Studies

Collection of datasets for re-analysis

In order to obtain raw datasets for re-analysis, the corresponding authors of all 104 studies in the sample were contacted by email. The purpose of the study was explained to them, and it was made clear that a public critique of their work was not intended (see template email in Supplementary Materials C). In total, due to a number of researchers being contact author on multiple studies, a total of 69 authors were approached to share data. Of those 69 authors, 31 (44.93%) replied to the email, and 20 (28.99%) were able to share a total of 22 datasets for re-analysis, each representing a unique study.Footnote ¹ This figure is slightly more than the 25.70% of authors who shared data with Wicherts et al. (Reference Wicherts, Borsboom, Kats and Molenaar2006) and more than the 14.50% who shared with Plonsky et al. (Reference Plonsky, Egbert and LaFlair2015).

Outlier treatments

Once the datasets were collected, they were re-analyzed according to the procedures described in the original study. First, a single result representing the main finding of the study was chosen as the target of the re-analysis. A result was considered the main finding, if it was represented by the title of the study or was focused upon in the abstract. For the majority of studies, missing data or insufficient details in the published report, regarding either the analysis or outlier treatment, meant that an exact replication of the published results was unachievable. In these cases, the results were analyzed to ensure that they followed the pattern of the published results, and the result closest to one of the published results was chosen for further re-analysis. When studies employed length-adjusted residual RTs, re-analysis was conducted upon the raw RTs for consistency and, also, because results between residual RTs and raw RTs are qualitatively identical (Fine, Jaeger, Farmer, & Qian, Reference Fine, Jaeger, Farmer and Qian2013). After this initial stage, two of the datasets were discarded for containing only aggregated data and another incomplete dataset was also withdrawn, which left a total of 19 datasets representing 19 unique studies. Of the remaining 19 studies, one investigated the construct validity of grammaticality judgment tasks (GJTs) (Vafaee, Suzuki, & Kachisnke, Reference Vafaee, Suzuki and Kachisnke2017), while the pattern of the re-analyzed results in another two studies contradicted the published results. For these three studies, residual statistics were not recorded. Table 1 contains descriptive statistics for the final sample of 19 studies.

Table 1. Descriptive Statistics for the Re-Analyzed Studies

Note. n = number of participants per study. k = number of RTs per study. Med(k) = median RT length per study (raw data, 150 ms to 10,000ms). IQR(k) = Interquartile range of median RT length per study (ms). LMEM = Linear mixed effects model. GLME = generalized mixed effects model.

The re-analysis required the creation of a new, subdataset for each combination of outlier treatments, which consisted of the raw data, four transformations, nine sets of SD boundaries, winsorizing, and trimming (see Data transformations and Data cleaning sections below). This resulted in 95 subdatasets for each of the 19 useable datasets that were shared. The subdatasets for the first two re-analyzed studies, which equated to 190 subdatasets, were created manually using Microsoft Excel (Version 16.0.11929.20298). The subdatasets for the remaining 17 studies were generated using R (R Core Team, 2018). The manually produced subdatasets acted as checks to ensure that there were no errors with the R code for the generated subdatasets. In total, 1,805 subdatasets were created for re-analysis.

The first stage in creating the 95 subdatasets for each shared dataset consisted of removing extreme, time-based outliers from the raw dataset. Although time-based boundaries varied across studies, this variable was held constant at 150 ms for the lower boundary and 10,000 ms for the upper boundary. This decision was made in order to concentrate on the effects and interactions between data transformations and SD-based boundaries, which were the most common outlier treatments found in the synthesis of L2 SPR research (see Results section). The lower time-based boundary was theoretically driven by MEG research, where findings have suggested around 150 ms as a lower boundary for lexical identification (Hsu et al., Reference Hsu, Lee and Marantz2011). The upper boundary of 10,000 ms was selected as this was the highest boundary used for single word presentation across the 104 synthesized studies. Using such a large upper boundary guaranteed that outliers would be included, which was important because the aim of the current study was to investigate outlier treatments both individually and in concert. All RTs falling outside of the 150 to 10,000 ms boundary were removed from the initial raw dataset before any transformations or other outlier treatments were performed. The 95 subdatasets required for the re-analysis of each of the 19 shared datasets comprised of combinations of the following data transformations and data cleaning methods.

Data transformations

Each dataset from the 19 shared studies was initially transformed using (a) base 10 logarithmic transformation, (b) e logarithmic transformation, (c) Baayen's (Reference Baayen2008) inverse transformation (-1000/x), and (d) a square root transformation, resulting in five base datasets, including the original raw, untransformed set. Although the square root transformation was not implemented in any of the synthesized L2 SPR studies, it was included here, because it is one of the three Box-Cox procedure transformations. Some SPR studies involved centered RTs, or, in one case, RT z-scores for analysis, while a number of studies transformed RTs to residual RTs. These transformations were not implemented for re-analysis, because they are utilized for reasons other than outlier control.

Data cleaning

The five base datasets were subjected to a number of data cleaning methods involving SD-based boundaries. First, SDs for each participant or group were calculated, and a new dataset was created for each of the selected boundaries, -1.5 to 1.5, -2 to 2, -2.5 to 2.5, and -3 to 3, which represented the full range of boundaries used in the synthesized studies. To achieve as accurate a re-analysis as possible, the decision to use participant or group mean for determining the SDs for the boundaries followed the decision made by the original authors. Because reaction time distributions are positively skewed, asymmetrical boundaries holding the lower values were also investigated. Asymmetrical boundaries with a lower negative boundary were utilized, because they treat the tail end of the distribution more leniently, thus including true data points that would otherwise be eliminated. Such boundaries have also been shown to control for Type I errors and eliminate the effects of skewness (Keselman, Wilcox, Othman, & Fradette, Reference Keselman, Wilcox, Othman and Fradette2002). The asymmetric boundaries consisted of -1.5 to 2, -1.5 to 2.5, -1.5 to 3, -2 to 2.5, and -2 to 3, resulting in a total of nine different boundary combinations and, therefore, nine new datasets. Each of these datasets was subjected to both winsorizing and trimming, resulting in 18 new datasets for each of the five base datasets from the 19 studies. These SD-based data cleaning techniques were implemented on the untransformed set and the four transformed sets, resulting in 95 subdatasets per original shared dataset (see Figure 1), and a grand total of 1,805 subdatasets for analysis.

Figure 1. Composition of 95 subdatasets created for each of the 19 shared, raw datasets.

Statistics recorded

All subdatasets were re-analyzed with the same form of modeling as in the original study, which, in most cases, involved either factorial or repeated-measures ANOVA, or linear mixed-effects models (LMEMs). One set of SPR results was analyzed using a generalized linear mixed-effects (GLME) model, while another set was used to investigate the construct validity of GJTs (see Table 1). In this case, only the data removed and distribution statistics for the RTs, not the residual statistics, were recorded. For the studies involving ANOVA, re-analysis was conducted using R. For the studies involving LMEMs, re-analysis was conducted using the R package lme4 (Bates, Maechler, Bolker, & Walker, Reference Bates, Maechler, Bolker and Walker2015). If even one subdataset resulted in a singular fit, the model was simplified until a model could be successfully fit to all 95 subdatasets. This usually involved removing random slopes from the model one at a time, leaving critical slopes intact and avoiding random intercepts-only models where possible (Barr, Levy, Scheepers, & Tily, Reference Barr, Levy, Scheepers and Tily2013). In total, seven statistics were recorded from the re-analyses for further analysis. All of the results from the re-analyses and the R code used to produce them will be made available on the IRIS digital repository of instruments and materials for research into second languages (Marsden, Mackey, & Plonsky, Reference Marsden, Mackey, Plonsky, Mackey and Marsden2016).

Distribution statistics

The assumption of a normal, Gaussian distribution is perhaps the most well-known assumption of parametric testing, yet it goes unaddressed in the majority of L2 research (Al-Hoorie & Vitta, Reference Al-Hoorie and Vitta2019; Plonsky, Reference Plonsky2013). Outliers in the upper tail of an RT distribution can influence the shape of the distribution, making it inappropriate for numerous analyses. For ANOVA, the assumption of normality refers to the distribution of the dependent variable within each group, although ANOVA has been shown to be somewhat robust against skewed data (Glass, Peckham, & Sanders, Reference Glass, Peckham and Sanders1972). For regression methods utilizing continuous variables, including ANOVA, normally distributed residuals are a mathematical requirement (Baayen, Reference Baayen2008) and improve inferences made back to the population from which the sample was drawn (Pek, Wong, & Wong, Reference Pek, Wong and Wong2017; Williams, Grajales, & Kurkiewiwicz, Reference Williams, Grajales and Kurkiewicz2013). However, analyses are generally improved if all of the variables are also normally distributed (Tabachnick & Fidell, Reference Tabachnick and Fidell2013). In the present study, normality was assessed for (a) the ungrouped RTs for all subdatasets, and (b) the residuals for all subdatasets but three (see Table 1). Analysis of ungrouped data enabled the ANOVA- and LMEM-based data to be more easily compared.

In order to assess the distribution of the RTs and residuals under each treatment and detect outliers, skewness and kurtosis z-scores were calculated. First, skewness and kurtosis statistics were obtained through R using the psych package (Revelle, Reference Revelle2019). Next, the standard error (SE) of the skewness and kurtosis were generated from the n size using R code based upon the algorithms found in the SPSS manual (Seltman, Reference Seltmann.d.). The skewness and kurtosis statistics were then divided by their respective SEs to produce z-scores for the RT distributions in all 1,805 subdatasets. However, for the one study that investigated GJT construct validity, the two studies that could not be accurately re-analyzed, and the inverse transformation data for the study that utilized GLME, residual z-scores could not be recorded, meaning only 1,501 subdatasets contained residual skewness and kurtosis z-scores for further analysis. A threshold of 3.29 (p < .001) was used as a gauge of normality, whereby skewness and kurtosis values below -3.29 and above 3.29 are considered significantly different from zero (Tabachnick & Fidell, Reference Tabachnick and Fidell2013). Histograms were also consulted to assess normality.

In total, four distribution statistics were calculated for the majority of the 1,805 subdatasets consisting of skewness and kurtosis z-scores for the RTs, and skewness and kurtosis z-scores for the residuals. The effect of each of the 95 treatments on each of the four statistics was assessed individually using descriptive statistics and boxplots. If an outlier treatment displayed a strong effect on the z-scores, it could be considered as being either beneficial or detrimental for achieving a normal distribution.

Data Affected

For each of the 1,805 subdatasets, the percentage of data points excluded or altered by the data cleaning methods were recorded. This resulted in three values, representing the lower and upper tails of the distributions along with the combined total.

Data Analysis

Once the re-analysis was completed, a preliminary investigation of the data revealed some critical information that resulted in a number of subdatasets being removed from further analysis. First, although the log-e and log-10 transformation results produced different RT values, the statistics produced by the re-analyses for the two transformations were identical. For the final analyses, the subdatasets representing the log 10 results were thus discarded, and the log e results were used to represent log transformations. Second, boxplots revealed no significant difference between trimming and winsorizing with regard to any of the outcome statistics recorded (see Figure 2). Although the trimming condition produced negligibly superior results in terms of normalizing RT and residual distributions, the winsorized results were used in the subsequent analyses because winsorizing keeps potentially legitimate observations in a dataset, whereas trimming blindly removes them (Baayen, Reference Baayen2008). Third, subdatasets utilizing asymmetric boundaries in conjunction with an inverse transformation were removed, because this type of transformation resulted in both positive and negative skew. With negatively skewed data, using the asymmetrical boundaries resulted in leniency on the negative side of the distribution, which contradicted the reason for using such boundaries. Finally, inverse transformation results were removed for the study that utilized a GLME model, because lme4 was unable to process negative values for this type of analysis. Once all of these datasets were culled, 665 subdatasets remained for analysis.

Figure 2. Boxplots comparing the effects of trimmed and winsorized data on distribution statistics.

After these changes were made, the statistics listed above were analyzed using descriptive statistics, histograms, and boxplots. Because a number of the variables and distributions were nonnormal, medians and IQRs were reported as measures of central tendency and dispersion. Cohen's d_av effect sizes, which are appropriate for within-group comparisons (Lakens, Reference Lakens2013), were calculated to determine the magnitude of a treatment's effect in comparison with the raw data and other treatments. Cohen's d_av utilizes averaged SD values as a denominator; therefore, effect sizes were only calculated for distributions with skewness and kurtosis z-scores between -3.29 and 3.29 (see Supplementary Materials D, Tables 2–5). Effect sizes were interpreted according to Plonsky and Oswald's (Reference Plonsky and Oswald2014) within-group guidelines, whereby 0.60, 1.00, and 1.40 represent generally small, medium, and large effects in L2 research. However, caution should be adhered to when using such guidelines for RT data, because d values in psychology tend to be somewhat smaller (Brysbaert, Reference Brysbaert2019).