Over the last 40 years, researchers in special education have made substantial contributions in developing and testing a variety of instructional approaches and organisational structures to help children with additional needs to better achieve at school (Hanley-Maxwell & Bottge, Reference Hanley-Maxwell, Bottge, Conrad and Serlin2006). Notwithstanding these contributions and changes in both legislation and increased awareness of inclusion of students with special needs (Dempsey, Reference Dempsey, Foreman and Arthur-Kelly2014), substantial numbers of these students encounter poor postschool outcomes. The research consistently shows that children who are unable to show proficiency in basic academic and social skills are at considerable risk of ongoing limitations in their future opportunities (National Early Childhood Technical Assistance Center, 2011; Schaeffer, Petras, Ialongo, Poduska, & Kellam, Reference Schaeffer, Petras, Ialongo, Poduska and Kellam2003). However, as Kauffman and Lloyd (Reference Kauffman, Lloyd, Kauffman and Hallahan2011) correctly note, statistical and mathematical realities mean that there will always be a group of school students who perform substantially less well than their peers.
Challenges in Assessment of Special Education Outcomes
Regardless of these statistical certainties, interest in the efficacy of expensive public education programs, including special education programs, has increased in recent times. Identification and use of evidence-based practice (Slavin, Reference Slavin2002) and the development of implementation fidelity (Fixsen, Blasé, Metz, & Vandyk, Reference Fixsen, Blasé, Metz and Vandyk2013) have emerged as potential strategies to maximise student outcomes. Allied with this, measurement of national student academic outcomes has continued apace in most developed countries, including Australia. In this country, the National Assessment Plan – Literacy and Numeracy (NAPLAN; Australian Curriculum, Assessment and Reporting Authority, 2015) allows broad longitudinal conclusions to be reached at school, state, and national levels with regard to students’ academic skills. However, there are several reasons why NAPLAN does not permit conclusions to be reached about the academic skills of students with additional needs. NAPLAN results cannot be disaggregated into groups of students with and without additional needs, and a substantial proportion of students with additional needs do not take NAPLAN tests (Dempsey & Davies, Reference Dempsey and Davies2013).
These logistical problems with national assessment are compounded by the difficulties associated with evaluating the effectiveness of special education via large-scale experimental studies. Randomised control trials (RCTs), regarded as the gold standard in efficacy research, are generally impossible to run in special education contexts because of the diversity of characteristics of students with additional needs and, more importantly, because the withholding of access to special education support for a control group of students will be unethical and likely illegal. Often the best that can be achieved in special education experimental research is to draw conclusions about the effectiveness of an approach for a group of participants in a particular situation (Carter & Wheldall, Reference Carter and Wheldall2008).
These limitations have not prevented some researchers from attempting to determine the effectiveness of special education. However, virtually all existing studies have substantial methodological flaws that include the lack of adequately matched treatment and control groups, groups matched on a limited range of covariates, and a reliance on cross-sectional designs. A further limitation in this area is that replication research is relatively rare in special education (Travers, Cook, Therrien, & Coyne, Reference Travers, Cook, Therrien and Coyne2016). Although written three decades ago, Tindal's (Reference Tindal1985) remark still holds: ‘The only conclusion that can be made at this time is that no conclusion is yet available about special education efficacy. . . . Without sound and valid methodology, the question of effectiveness is simply not worth asking’ (p. 109).
Propensity Score Analysis
The difficulties of conducting RCTs are not limited to education and special education. Other human sciences also experience such problems, and so in the last 15 years many researchers have turned to propensity score analysis (PSA) to reduce the imbalance between important covariates (selection bias) and to allow contrasts between naturally occurring experimental and control groups. The control group is a subset of the untreated groups who display very similar likelihoods of experiencing the intervention because of their observed characteristics (Austin, Reference Austin2011). First proposed by Rosenbaum and Rubin (Reference Rosenbaum and Rubin1984), PSA is a procedure intended to provide an unbiased estimate of treatment outcomes by reducing the confounding effects of covariates and consequentially increasing confidence that differences in dependent variables across groups are due to the treatment. Fundamental to PSA is the calculation of the propensity score for all participants. The propensity score is ‘. . . the conditional probability of receiving the treatment given the observed covariates’ (Rosenbaum, Reference Rosenbaum2002, p. 296). A wide range of covariates with known relationships with the treatment should be used in this calculation. When participants are grouped into treatment and non-treatment groups (e.g., children receiving and not receiving special education support), then a logistic regression analysis (with treatment as the dependent variable and covariates as independent variables) allows the probability of treatment to be saved from the analysis. This probability serves as the propensity score for each participant.
The next step in PSA involves matching participants who did and did not receive treatment on their propensity scores. A number of different matching methods are available to be used with the goals of matching participants with adequately similar propensity scores and either eliminating or substantially reducing significant imbalance in covariates across matched groups. The final step in PSA uses standard bivariate and multivariate analyses to assess the magnitude of differences in effect across treatment and control groups (Caliendo & Kopeinig, Reference Caliendo and Kopeinig2008).
There has been conjecture that PSA offers no substantial advantages over traditional multivariate regression methods (Stürmer, Schneeweiss, Avorn, & Glynn, Reference Stürmer, Schneeweiss, Avorn and Glynn2003). However, reviews demonstrate that, when the conditions warrant, PSA should be the preferred method (Glynn, Schneeweiss, & Stürmer, Reference Glynn, Schneeweiss and Stürmer2006). Winklemayer and Kurth (Reference Winklemayer and Kurth2004) noted that ‘. . . if the outcome is rare relative to the number of confounders and the number of study subjects in the smaller exposure group is sufficiently large to warrant multivariable PS estimation, then this statistical technique has a . . . role to potentially reduce bias’ (p. 1673).
PSA Studies in Special Education
Several PSA studies related to special education have been published in the last decade and each is now briefly reviewed. In the first of these, Morgan, Frisco, Farkas, and Hibel (Reference Morgan, Frisco, Farkas and Hibel2010) used PSA to develop two matched groups of students from the Early Childhood Longitudinal Study who were receiving (n = 363) and not receiving (n = 5,995) special education services in schools in the United States (US). Propensity scores were derived from 35 covariates and students’ school placements ranged from regular classrooms with assistance, to brief class withdrawal from the regular class, and to special school placement. The results of this study were not consistent across study outcomes. Special education services made either a negative or a statistically nonsignificant improvement on children's learning and behaviour. However, special education services did provide a small, positive effect on children's learning-related behaviours (i.e., remaining attentive, persistence at tasks, being organised).
The second study also made use of the same early childhood US longitudinal database with a younger cohort (N = 8,000), the Early Childhood Longitudinal Study – Birth Cohort (Sullivan & Field, Reference Sullivan and Field2013). Over 30 covariates were used with propensity score weighting methods to generate two matched groups of young children who either did or did not receive special education services. The results demonstrated that receiving special education support had significant moderate negative effects on children's reading and mathematics skills.
The final PSA study in special education is that reported by Dempsey, Valentine, and Colyvas (Reference Dempsey, Valentine and Colyvas2016), which used the Longitudinal Study of Australian Children database. Eight different PSA matching methods were used with young school children receiving (n = 291) and not receiving (n = 1,926) special education services. Again, students’ school placements included regular classrooms with assistance, brief class withdrawal from the regular class, and special school placement. Across all eight matching methods, the group of children receiving special education assistance performed significantly less well than their matched peers not receiving such support in literacy, numeracy, and their behavioural and social skills. The effect size of this difference ranged from large to small across outcome measures.
Taken together, these three studies suggest that special education services may not be bringing expected benefits to children over and above the benefits they might experience in regular classrooms without special education support. However, the pool of PSA studies in special education is very small and the studies cover educational jurisdictions with quite different legislative and school delivery systems, which makes generalisation of these results imprudent at this time. A further limitation of the existing work in this area is that the studies report outcomes for all children receiving special education support. Although conclusions about the efficacy of special education for the total group of children receiving those services may be helpful, it does not permit conclusions to be reached about the effectiveness of special education for some groups of students with additional needs or for students receiving special education support across different settings.
The present study sought to make a contribution to our limited knowledge base on PSA and special education by replicating the study by Dempsey et al. (Reference Dempsey, Valentine and Colyvas2016) with a second cohort of school students from the same database. The second goal of the study was to determine if the average treatment effects (ATEs) of special education support for the specific subgroup of students with learning disability/learning problems (students with literacy and/or numeracy problems but without a diagnosis of developmental disability) were consistent with the ATE of all students with special needs.
Method
Participants
The Longitudinal Study of Australian Children (LSAC) began recruitment in 2004 of over 10,000 children and their families and teachers in a stratified random sample from the Medicare (national healthcare system) database. The first wave of data collection involved approximately equal numbers of children in two cohorts of 0–1 (birth cohort) and 4–5 years of age (kindergarten cohort). LSAC has collected data from participants every 2 years and later data collection waves are planned. The purpose of LSAC is to permit examination of the interaction between a variety of social and environmental variables and childhood development (Australian Institute of Family Studies, 2015a). Information on children's physical and mental health, their education, and social, cognitive, and emotional development is being collected from parents, carers, and teachers, and from the children themselves. Specific detail on overall response rates and response rates from subpopulations are available in several LSAC technical papers (Australian Institute of Family Studies, 2015b).
In this paper we reported on data collected during the period mid-2004 (Wave 1) to mid-2012 (Wave 5). In particular, this paper relates to the birth cohort of study children (SC) who were 8 or 9 years of age in 2012. The SC included in this research were those reported as receiving or not receiving special education support in 2010 and for whom data were available for all included covariates and for all 2012 measures of children's learning and behaviour (n ranged from 1,835 to 1,857 depending on the outcome measure).
Study Outcome Measures
LSAC data is collected by parent interview, parent questionnaire, SC interview, and teacher questionnaires (Australian Institute of Family Studies, 2015c). The four outcome measures used in the present research were two measures of child learning (literacy and numeracy), and two measures of the child's social/emotional development (behaviour problems and prosocial skills). These four measures were completed by the teacher of the SC.
The literacy and numeracy measures were an adapted version of the Academic Rating Scale (ARS) that was developed for the US Early Childhood Longitudinal Study, Kindergarten Cohort (National Center for Education Statistics, 2008). There were 10 Wave 5 literacy items of increasing complexity that included ‘contributes relevant information to classroom discussion’ and ‘able to write sentences with more than one clause’. The eight numeracy items ranged from ‘can continue a pattern with three items’ to ‘uses a variety of strategies to solve maths problems’. Teachers rated each SC for each skill on a 5-point scale from not yet displayed to proficient. Rasch-modelled literacy and numeracy scores, which are standardised measures taken at Wave 5 (2012), were used in this paper. Higher scores indicated higher academic skills.
The language and literacy section of the ARS has a moderate correlation (.34) with the Peabody Picture Vocabulary Test (Dunn & Dunn, Reference Dunn and Dunn2007) and the correlation between the numeracy and the literacy sections was high (.82) in LSAC Wave 3. Internal reliability (Cronbach's α) of both components of the ARS ranged from .95 to .97 in Wave 3 (Australian Institute of Family Studies, 2005).
The study measures of social and emotional rating of behaviour were derived from the Strengths and Difficulties Questionnaire (SDQ; Goodman, Reference Goodman1997). The SDQ is a widely used 25-item scale with good psychometric properties (Becker, Woerner, Hasselhorn, Banaschewski, & Rothenberger, Reference Becker, Woerner, Hasselhorn, Banaschewski and Rothenberger2004; Hawes & Dadds, Reference Hawes and Dadds2004). The instrument subscales measure the level of conduct behaviour problems, difficulties with peer relationships, hyperactivity, and emotional difficulties, as perceived by the teacher. These are typically combined into a total SDQ score that is a measure of the extent of behavioural difficulties, with higher scores indicating a higher level of behaviour problems. A further subscale measures a range of prosocial or appropriate behaviour with higher scores showing a higher level of prosocial skills. The prosocial skills scale and the total SDQ score measured at Wave 5 (2012) were the behaviour measures outcomes used in this study.
Receiving Special Education Support
The Wave 4 (2010) teacher questionnaire included an item, ‘Does this child receive any specialised services provided within the school because of a diagnosed disability or additional need?’ A SC was regarded as receiving special education support in 2010 if their teacher responded ‘yes’ to this question. Given that some of the children in this cohort were in their second year of school, this special education support may have been provided to some children for over 12 months.
A subsequent question asked, ‘What is the main reason that this child requires additional assistance or specialised services to enable them to succeed in the regular school program?’ The 11 response options for this item were intellectual and physical disability; hearing, sight, or speech/language impairment; learning disability/learning problems in reading and maths; emotional/behavioural problems; poor understanding of Australian English/ESL; autism spectrum disorder; and giftedness. In the Australian context, children with poor understanding of Australian English/ESL and gifted students are not regarded as having special needs (Foreman, Reference Foreman, Foreman and Arthur-Kelly2014). Consequently, these students (n = 28) were excluded from the overall treatment group (ALL group n = 257) but were included in the contrast group of children.
As the second goal of the present study was to compare results for different groups of SC with special needs, a separate treatment group was identified comprising only SC with learning disability/learning problems in reading or maths (LD group n = 148). The second set of PSA analyses compared the LD group with a matched group not receiving additional support. The 109 SC from the ALL group who received additional support for intellectual and physical disability, hearing, sight, or speech/language impairment, emotional/behavioural problems, or autism spectrum disorder were excluded from this analysis. The rationale for this approach relied on several facts about students with learning disability/learning problems. In comparison to other special education needs groups, there is a relatively high incidence of students with learning disability/learning problems (Banks & McCoy, Reference Banks and McCoy2011; Dempsey & Davies, Reference Dempsey and Davies2013), and their support needs are quantitatively and qualitatively different to other students with additional needs, such as students with intellectual, physical and sensory disability, students with autism spectrum disorder, and students with emotional and behavioural problems (Lane, Carter, Pierson, & Glaeser, Reference Lane, Carter, Pierson and Glaeser2006).
The nature of data collected by the teacher and parent questionnaires in Waves 4 and 5 did not permit further differentiation of services. For example, the LSAC database does not allow meaningful conclusions to be made about the location of delivery of special education services (i.e., regular classroom, segregated classroom within a regular school, or a special school), or the duration or intensity of that support (e.g., teacher aide assistance, short-term withdrawal from class).
Predictors of Special Education Services
Twenty-two covariates were considered for inclusion in the present study to model a child's likelihood (propensity) to receive special education support in 2010. A combination of theoretical studies (Kavale, Reference Kavale1988; van Kraayenoord & Elkins, Reference van Kraayenoord and Elkins2004) and empirical research (Delgado & Scott, Reference Delgado and Scott2006; Donovan & Cross, Reference Donovan and Cross2002; Louden et al., Reference Louden, Chan, Elkins, Greaves, House, Milton and van Kraayenoord2000) was used to select variables associated with the use of special education services. Child and family demographic variables were SC gender and age, birth weight, whether birth was premature or was a multiple birth, whether the SC had repeated a school year, a physical health index, whether the child had a medical condition or disability for at least 6 months in 2006, the remoteness of the family home, and socioeconomic status. Parental variables were whether the primary parent was living with a partner, the extent of the parent's school involvement and frequency of homework checking, how far they thought their child would go with their education, and the parent's level of alcohol consumption. Also included in this category was the parent's age, their English-language proficiency, whether they were of Aboriginal or Torres Strait Islander origin, and two measures of their parenting skills: angry and consistent parenting scales. Finally, the 2010 teacher variables of teacher qualifications and years of teaching experience were added as covariates. See Table 1 for a full list of these covariates. With the exception of teacher experience, teacher qualifications, whether the SC had repeated a year of school, and the extent of parent–school involvement, all covariate data were collected from the primary parent (typically the mother) of the SC.
Note. SC = study children; ATSI = Aboriginal and Torres Strait Islander.
*Indicates differences are statistically significant.
PSA Procedure
The PSA procedures used in this study are those recommended by Guo and Fraser (Reference Guo and Fraser2010) and are consistent with those reported in Dempsey et al. (Reference Dempsey, Valentine and Colyvas2016). In brief, PSA was used to examine the effect on each of four outcome measures of receiving additional services for the two groups of SC (ALL and LD) receiving special education (treated) in comparison to matching groups of SC who did not receive those services (not treated).
Raw LSAC data were screened for missing data and working datasets created using SAS/STAT Version 9.3. Bivariate relationships between receiving special services and each of the potential covariates were tested using SPSS Version 21 (see Table 1). This approach identified which covariates were statistically associated with receiving services (are imbalanced and are therefore potential sources of selection bias).
Before beginning PSA analyses, t tests and OLS regression were conducted for each of the outcome measures to provide comparisons with the different methods of PSA that followed. All PSA methods calculated the ATE, which is a measure of the differences in outcomes for those SC who received the special services compared to a ‘corresponding’ set of SC who did not receive the services (on an ‘intention to treat’ basis; Guo & Fraser, Reference Guo and Fraser2010, p. 47). The method of calculating the propensity score and its appropriate matching process determine this corresponding set of students.
PSA methods assume that the distribution of the propensity scores overlap each other and therefore share sufficient common scores or a common support region (overlap assumption) from which to draw matching SC. Each of the PSA methods utilised used all or most of the treated SC and a selection of the untreated SC according to the rules and assumptions of the individual methods and the options to trim a percentage of the SC with the ‘weakest’ matching (Guo & Fraser, Reference Guo and Fraser2010). Unless stated otherwise, PSA procedures were conducted using Stata (Stata Corporation, 2011).
The first step in the PSA analysis, estimation of the conditional probability of receiving special services, was conducted by logistic regression in order to specify the functional form of the covariate for the propensity score model. The propensity score was then calculated using the logit of the probability. Matching (resampling) was conducted using four greedy matching procedures: nearest neighbour with callipers of 0.25 standard deviations of the propensity score and 0.1, and Mahalanobis distances with and without propensity score (PSA methods 1–4). No higher order or interaction terms were considered for these first four PSA methods. Postmatching analysis utilised the t test on these matched SC to calculate the ATE of the special services intervention.
A fifth PSA method used Generalised Boosted Modelling (GBM) in Stata to create the propensity score followed by various optimal matching techniques and postmatching analysis via the Hodges–Lehmann aligned rank test. The key advantage of this GBM regression tree method is that the functional form of the covariates or interactions do not need to be specified, but are tested within the modelling process up to order 4 interaction (Guo & Fraser, Reference Guo and Fraser2010, p. 143). Several boosted regression models were created using different proportions of training data, and Test R squared was used to determine the most appropriate model. GBM does not estimate regression coefficients, but calculates the relative predictive influence of each covariate. Highly influential covariates would indicate imbalance between the treated and untreated SC in this multivariate regression. The optimal matching procedures were conducted in R (R Core Team, 2013).
Optimal matching reduces the chance of poor matching where the propensity score difference between matched subjects is large, increases the chance of desirable matching where the difference is minimised (Rosenbaum, Reference Rosenbaum1989), and so may be more robust against violations of overlap (Guo & Fraser, Reference Guo and Fraser2010, p. 213). The Stata imbalance command was used to evaluate whether the optimal matching balanced an observed covariate between those receiving (treated) or not receiving special services (untreated) and to calculate the ATE and Cohen's d effect size. The Hodges–Lehmann aligned rank test used the hodgesl Stata command (Guo & Fraser, Reference Guo and Fraser2010, p. 18) to gauge statistical significance. Calculation of confidence intervals is not included in this procedure.
All of the above five methods are three-step propensity score analyses (i.e., calculation of propensity score, followed by appropriate matching techniques, and postmatching analyses). Common support regions (overlap) may or may not cover the whole range of study participants, but the key objective is to make the two groups of participants (those receiving special services and those who are not) as much alike as possible in terms of their estimated propensity score.
Depending on the extent to which covariate bias was reduced or eliminated with these five PSA methods, consideration was then given to using three additional methods. In the first of these methods, the propensity scores are used as sampling weights to improve the representativeness of treated and non-treated SC (McCaffrey, Ridgeway, & Morral, Reference McCaffrey, Ridgeway and Morral2004). The seventh and eighth potential methods of PSA used kernel-based matching estimators to conduct a latent matching, using nonparametric local linear (Heckman, Ichimura, & Todd, Reference Heckman, Ichimura and Todd1998, p. 131) and Epanechnikov kernel regression (Guo & Fraser, Reference Guo and Fraser2010, p. 255).
Results
Following data screening, the parent variable of proficiency in spoken English was dropped from further analyses because of a large proportion of missing cases. Table 1 shows the relationships between the remaining 21 covariates and SC receiving and not receiving additional support for the ALL group. For this special education needs group there were six covariates with significant associations with receiving support. These covariates were unbalanced and would likely lead to selection bias. For the ALL group, SC who received special education services at age 6/7 were more likely to be male, have repeated a year, come from a lower socioeconomic status, and have a lower physical health index, had less parent school involvement, and lower parent expectations on how far they would progress their education. Although not shown in Table 1 for reasons of conciseness, for the LD group, SC receiving additional support were more likely to be male, have repeated a school year, have a lower physical health index, and have parents with lower expectations about their child's education. The presence of these covariates with statistically significant associations with treatment showed that there was substantial imbalance of covariates in the dataset and that without adequate matching procedures any attempts to determine the effectiveness of special education services could be biased.
There were also significant differences between the LD group and the group of SC receiving support who did not have LD across all the outcome measures, except numeracy, at the start of the intervention in 2010. The LD group (M = 2.63, SD = 0.45) had lower literacy skills than the non-LD group, M = 2.90, SD = 0.71, t(249) = –3.61, p < .001, d = 0.45, although there was no significant difference in their maths skills. The non-LD group had higher levels of behaviour problems (M = 12.11, SD = 6.98) than the LD group, M = 8.48, SD = 5.25, t(249) = 4.71, p < .001, d = 0.59. Finally, SC with LD had higher scores on the measure of prosocial behaviour (M = 7.40, SD = 2.05) than the non-LD group of SC, M = 6.10, SD = 2.71, t(249) = 4.31, p < .001, d = 0.54.
The initial step in the first four PSA analyses reported here was the calculation of the propensity score by logistic regression. Figure 1 shows the box plots of the propensity score distributions demonstrating considerable overlap in propensity scores for SC receiving (treated) and not receiving (non-treated) special education support, for the ALL (n = 254) and the LD (n = 147) groups, and for whom all literacy outcome data and covariate data were available. As there were only small differences in propensity score distribution for the four outcomes considered, just the literacy distribution is presented here.
The next step in PSA analyses was the completion of the four greedy matching and boosted regression methods detailed earlier in the paper. As at least two of the nearest neighbour greedy matching methods consistently removed all bias from the dataset for all four outcome variables and for the ALL and LD groups, additional PSA analyses (i.e., the methods using propensity scores as weights and kernel regression) were not conducted. However the PSA that utilised the boosted regression to calculate the propensity score, followed by the optimal matching procedures was conducted. For each of the outcomes of literacy, numeracy, behaviour problems, and prosocial skills measured at age 8, Tables 2 to 5 report the number of ALL and LD children for whom data were available in the groups receiving or not receiving special education services. For the five different PSA techniques used, the tables also report estimates of the effect of the special services intervention (ATE), effect size, and confidence intervals (with the exception of boosted regression methods).
Note. SC = study children; LD = learning disability/learning problems; PSA = propensity score analysis; ATE = average treatment effect; CI = 95% confidence interval.
Note. SC = study children; LD = learning disability/learning problems; PSA = propensity score analysis; ATE = average treatment effect; CI = 95% confidence interval.
Note. SC = study children; LD = learning disability/learning problems; PSA = propensity score analysis; ATE = average treatment effect; CI = 95% confidence interval.
Note. SC = study children; LD = learning disability/learning problems; PSA = propensity score analysis; ATE = average treatment effect; CI = 95% confidence interval.
Literacy
For the ALL group, there were 1,857 observations with 255 of these SC receiving special education support. The t test with six unbalanced covariates likely overestimated the difference between the two groups’ literacy skills. Both nearest neighbour PSA methods eliminated covariate bias, and the remaining greedy matching methods (models three and four) had one and two covariates showing imbalance. The boosted regression and optimal matching method showed that physical health, birth weight, and how far parents thought their child would progress with their education were the most influential covariates. However, the variable method using Hansen's equation (5VM3) removed bias from all influential covariates. Across all PSA methods, the students receiving treatment scored around 0.6 points lower on literacy skills than their matched peers not receiving assistance (moderate effect size).
Results for the LD group (n = 147) were similar with the t test, again likely biased. All covariate imbalance was eliminated with both the nearest neighbour and the Mahalanobis methods and, for the 5VM3 optimal matching procedure, one covariate remained imbalanced (teacher experience). The LD group was about 0.7 points lower on literacy skill scores than their peers; a large effect size.
Numeracy
With regard to ALL maths skills (n = 253), the ATE from the t test had six imbalanced covariates and is therefore likely to be overestimated. Two of the greedy matching methods removed covariate bias, and overall children receiving treatment scored 0.6 points lower on numeracy skills than their matched peers, a result reflected by the boosted method. This was a moderate effect size.
Similarly for the LD group, all covariate bias was eliminated using the greedy methods and the boosted methods, and the maths ATE was about 0.7 points less for the LD group than their matched group. Again, this was a moderate treatment effect.
Behaviour
There were 1,861 ALL SC included in the analysis and 257 children received support. Again, the ATE from the t test had six imbalanced covariates and is therefore likely to be overestimated. The two nearest neighbour methods eliminated all covariate bias, and the remaining greedy methods retained one unbalanced covariate (how far parents thought their child would academically progress). Overall, children in the ALL group had behaviour problems about 3 points higher than the matched group, and this was a moderate effect size.
Every greedy matching PSA method removed all or most covariate bias. On average, the LD group's behaviour was about 2 points higher (worse behaviour) than SC not receiving special education services — a small effect size.
Prosociality
There were 257 ALL SC receiving assistance and 1,604 children not receiving support. The t test, with six imbalanced covariates, showed the ALL group with significantly lower prosocial scores but with a small effect size. Both nearest neighbour greedy PSA methods removed covariate bias and showed SC receiving special education assistance scored, on average, 0.4 points less than matched SC not receiving special education assistance.
For the LD group (n = 148), all five matching methods eliminated covariate bias. However, the ATE estimates for the SC with LD were not significantly different from their matched peers.
Discussion
The research reported here had two goals. The first objective was to replicate the PSA analysis completed by Dempsey et al. (Reference Dempsey, Valentine and Colyvas2016) with a different cohort of children. However, as covariate bias was eliminated by using the nearest neighbour and boosted regression PSA methods, the additional methods of using propensity scores as weights and kernel regression were not required in the current study. The second goal was to check if the ATEs for four outcome variables for the group of SC with LD were broadly consistent with the ATEs of the total group of SC receiving special education support.
With regard to the first goal, the results of the present study were consistent with those reported earlier by Dempsey and colleagues (Reference Dempsey, Valentine and Colyvas2016). The total group of children receiving additional support (ALL group) performed less well in literacy, numeracy, behaviour, and prosocial skills in comparison to a matched group of SC not receiving support. Logistic regression and t tests estimates of ATE used unbalanced covariates and are therefore likely to have overinflated differences between the groups. All propensity score analyses gave statistically significant differences between treated and untreated groups. At least two of the greedy matching methods eliminated covariate bias and the effect size of the unbiased ATEs ranged from small to large depending on the outcome measure under consideration.
There were several findings in relation to the second goal. The LD group of children receiving additional support performed significantly less well than their matched peers in literacy, numeracy, and behaviour outcomes across the biased t test and regression analyses, as well as the propensity score analyses. Covariate bias was eliminated in at least four of the five propensity score analyses. For prosocial skills, there was no significant difference between the two groups. All these results mean that, for both the ALL and the LD groups, the provision of special education support appeared to provide no benefits in terms of improvements in their academic skills and behaviour in comparison to matched peers who did not receive additional support.
In conjunction with the three other studies using PSA methods to examine the effectiveness of special education that were reviewed in the introduction, this study has assisted in building a consistent evidence base that special education may not be providing the outcomes expected of it in Australia and the US. The word ‘may’ is used judiciously here because there are some limitations in the research design used by Dempsey et al. (Reference Dempsey, Valentine and Colyvas2016) and in the present study that need to be acknowledged. The mid-year timing of LSAC data collection and that some of the SC were in their second year of school in 2010 means that children in the treatment groups had been receiving special education services for varying lengths of time. Furthermore, the LSAC database does not provide detailed information on the intensity or type of special education support provided. For example, it is not possible to draw conclusions about the extent to which the duration, location, and intensity of special education services may be associated with the research findings. A final limitation is that low cell counts for some groups of students with additional needs (e.g., students with hearing impairment, visual impairment, or with physical disability) did not permit differential analysis for these groups of SC. Given the specialised equipment and technologies used with these children in special education settings, it may be that additional supports do indeed provide demonstrable benefits for these children over and above what they may receive in the regular classroom.
Regardless of these limitations, the present findings must be of concern for special education professionals and for educational administrators. Beyond evidence that special education teaching strategies and technologies are effective in highly controlled environments, in cross-sectional studies, or in longitudinal studies with biased comparison groups (Kavale & Dobbins, Reference Kavale and Dobbins1993), the discipline of special education lacks confirmation that it is effective for the majority of students with additional needs (Carter & Wheldall, Reference Carter and Wheldall2008).
In a helpful discussion of stages of programs of educational research, Sam Odom and colleagues (Reference Odom, Brantlinger, Gersten, Horner, Thompson and Harris2005) note that such research logically progresses through four steps. First, preliminary ideas, hypotheses, and pilot studies; second, controlled laboratory and classroom-based experiments; third, randomised trials; and finally, informed classroom practice. The difficulty of conducting RCTs in special education no doubt explains why special education research has largely bypassed the third step in its goal to improve outcomes for students with additional needs. Nevertheless, without evidence from control trials (or from quasi-experimental methods, such as PSA, that control for bias), then claims about the effectiveness of special education for the majority of students with additional needs are unjustified.
At the moment, special education in developed countries is maintained by philosophical arguments and legislative requirements (Foreman, Reference Foreman, Foreman and Arthur-Kelly2014) rather than by an evidence base that maintains human professions such as medicine, nursing, and the sciences. Special education is not alone in lacking solid research evidence that it is effective, over and above what may be provided in regular settings. For example, a variety of social programs in the criminal justice system lack a sound research base (Australian Institute of Health and Welfare, 2013). However, without evidence that the special education system is effective then the profession leaves itself open to accusations that special education acts as little more than a form of respite for regular education.
The statistical realities identified by Kauffman and Lloyd (Reference Kauffman, Lloyd, Kauffman and Hallahan2011) mean that there will always be a group of students in the education system that perform substantially less well than their peers. However, the research reviewed and reported in this paper suggests that the current special education services provided to these Australian students delivers poorer outcomes than regular classroom teaching. There is a range of potential explanations for this situation. It may be that the relatively poorer outcomes in special education settings is related to the skill base of special educators and the ineffectiveness of preservice and inservice special education training. The inconsistent fidelity of special education support within and across schools may also contribute to the apparent ineffectiveness of special education. It could also be the case that the quality of teaching in special and in regular education settings is little different to each other. A final possible explanation may be that the administrative and support structures in special education offer no advantages over those that exist in regular schools.
Acknowledgements
This report makes use of data from Growing Up in Australia: the Longitudinal Study of Australian Children (LSAC). LSAC is conducted in partnership with the Department of Social Services (DSS), the Australian Institute of Family Studies (AIFS), and the Australian Bureau of Statistics (ABS), with advice provided by a consortium of leading researchers. Findings and views expressed in this publication are those of the individual authors and may not reflect those of the AIFS, DSS, or the ABS.