Listening comprehension refers to the ability to extract meaning from spoken language (Snowling & Hulme, Reference Snowling and Hulme2005). It implies the ability to understand and to construct a mental representation of the text message without the demands of decoding. Reading comprehension theories do not establish a clear distinction between the comprehension of oral or written language (Cain & Oakhill, Reference Cain, Oakhill, Cain and Oakhill2008), considering that their product is similar, i.e., a coherent mental representation of the text message (Kintsch, Reference Kintsch1998) and that the nature of the comprehension process itself is generally the same (Hess, Reference Hess2007).
The Simple View of Reading model highlights listening comprehension, in addition to decoding, as a main contribution to explain variation in reading comprehension (Hoover & Gough, Reference Hoover and Gough1990). Consequently, listening comprehension skills are crucial for reading comprehension (Torgesen, Reference Torgesen2002). Listening comprehension has been reported as a significant predictor of reading comprehension (Berninger & Abbott, Reference Berninger and Abbott2010; Vellutino, Tunmer, Jaccard, & Chen, Reference Vellutino, Tunmer, Jaccard and Chen2007); however, greater attention has been given to the construction of reading comprehension tests compared to listening comprehension assessments (see “Mental Measurements Yearbook” series, from 1985 [the ninth yearbook] to 2014 [the nineteenth yearbook]).
Listening comprehension tests focused on texts’ comprehension are more frequently included in informal reading batteries or inventories (e.g., Johns, Reference Johns2008; Leslie & Caldwell, Reference Leslie and Caldwell2006; Shanker & Cockrum, Reference Shanker and Cockrum2009). Usually, to assess texts’ comprehension, the tester reads a text aloud and raises literal and inferential questions for the students to answer or asks the students to retell the text. Their listening comprehension level is the “highest grade level of material that can be comprehended well when it is read aloud to the student” (Harris & Hodges, Reference Harris and Hodges1995, p. 140) and it is generally rated in accordance with the percentage of correct answers and the qualitative analysis of their responses to the comprehension questions and/or the story retell. Assessing listening comprehension with standardised instruments is important for a better understanding of reading comprehension difficulties as it contributes to verify whether those difficulties are related to deficits in decoding, language comprehension or both (Nation, Reference Nation, Snowling and Hulme2005).
The assessment of performance with different tests across grades allows the comparison of a student’s performance to that of a normative group. However, it does not enable psychologists to monitor intra-individual differences because each test performance estimates are on different metrics. Accordingly, student results are not directly comparable. One way of dealing with this limitation is to construct a test form for each grade and place them in a single metric. This procedure is based on Item Response Theory, and it is known as vertical scaling (de Ayala, 2009). Within this approach, the use of a non-equivalent groups with anchor test design is the most appropriate (de Ayala, Reference de Ayala2009; Kolen & Brennan, Reference Kolen and Brennan2010). The anchor or common item parameter estimates are used in the estimation of the item parameter in each grade, placing each form of the same test on a common scale.
The purpose of this article is to describe the construction of two tests for listening comprehension assessment in first, second, third and fourth grades of the primary school: the Test of Listening Comprehension of Narrative Texts (TLC-n) and the Test of Listening Comprehension of Expository Texts (TLC-e). Their construction takes into consideration scientific-research data that shows differences in listening comprehension performance according to the text genre. Students’ performance is usually better when narrative texts are used, compared with expository ones (Diakidoy, Stylianou, Karefillidou, & Papageorgiou, Reference Diakidoy, Stylianou, Karefillidou and Papageorgiou2004; Lehto & Anttila, Reference Lehto and Anttila2003), which has been related to the students’ familiarity with narrative texts (Graesser, Golding, & Long, Reference Graesser, Golding, Long, Barr, Kamil, Mosenthal and Pearson1991), and the greater difficulty of expository texts, as they usually present new information, their structure is less predictable and they are more demanding regarding the load on working memory (Graesser, Singer, & Trabasso, Reference Graesser, Singer and Trabasso1994).
Research also has shown the existence of different comprehension levels. Therefore, test items were constructed to assess four comprehension levels described in several taxonomies (Barrett, Reference Barrett and Barrett1976; Català, Català, Molina, & Monclús, Reference Català, Català, Molina and Monclús2001; Herber, Reference Herber1978). The levels are synthesised for the following: (a) literal comprehension (LC), the comprehension of information explicitly stated in the text; (b) inferential comprehension (IC), the comprehension of implicit information; (c) reorganisation (R), the ability to synthesise or combine information from the text; and (d) critical comprehension (CC), the ability to convey opinions and comprehensive judgments about the text. Texts’ comprehension implies different levels of comprehension, depending on the demands of the task proposed to the student. Test items should, therefore, gauge those levels to ensure the representativeness of the construct. Several studies have provided theoretical and empirical evidence for listening comprehension as an unidimensional construct (Berninger & Abbott, Reference Berninger and Abbott2010; Lee, Reference Lee2008; Lehto & Anttila, Reference Lehto and Anttila2003).
This research is organised in two studies. Study 1 aims to: (a) examine the psychometric properties of the items on the TLC-n and the TLC-e in each grade; (b) select unique and anchor items to integrate four forms of the TLC-n and four forms of the TLC-e (the selection of anchor items is an essential step to perform vertical scaling of the test forms in further studies); and (c) assess the unidimensionality, local independence and reliability of the final test forms.
The purpose of study 2 is to overcome the limitations of study 1 regarding the form of the TLC-n for fourth graders (TLC-n-4). New items were developed and tested in a new sample. The study goals are to: (a) examine the psychometric properties of the items of the TLC-n-4; (b) select unique items to integrate into the TLC-n-4; and (c) assess the unidimensionality, local independence and reliability of the TLC-n-4 final form.
STUDY 1
Method
Participants
The study of the TLC-n included 1042 students with 245 from the first grade, 256 from the second grade, 273 from the third grade and 268 from the fourth grade. The sample of each grade exhibited similar sex distribution: 47.8% (n = 117) of the first graders, 48% (n = 123) of the second graders, 53.8% (n = 147) of the third graders and 51.1% (n = 137) of the fourth graders were female. In the study of the TLC-e 848 students participated with 201 from the first grade, 205 from the second grade, 195 from the third grade and 247 from the fourth grade. This sample also exhibited similar sex distribution: 44.3% (n = 89) of the first graders, 51.2% (n = 105) of the second graders, 44.6% (n = 87) of the third graders and 47.4% (n = 117) of the fourth graders were female. All the participants attended schools from the same regions of northern Portugal.
Materials
Test of Listening Comprehension of Narrative Texts – TLC-n and Test of Listening Comprehension of Expository Texts – TLC-e
These tests measure the ability of students to comprehend pieces of texts that are read aloud to them. The TLC-n includes four narrative texts: An angel in the kitchen, Holidays in the village, A cat with special powers and The pirate’s son. Texts are presented in short passages whose excerpts range from 40 to 195 words. The TLC-e includes four expository texts: Butterflies and Lizards, So sunny!, Little mice with wings and Oak-apples and Acorns. Their presentation is also made in short passages that range from 53 to 157 words. The average number of words per sentence in TLC-n texts is of 15.8 words and in TLC-e texts it is of 13.48 words. All the texts are original and unpublished, authored by experienced Portuguese writers of literature and expository texts for children. Items are multiple choice with three options and evaluate one of the four comprehension levels. The TLC-n includes 97 items (LC = 30; IC = 51; CC = 8; R = 8) and the TLC-e consists of 102 items (LC = 49; IC = 42; CC = 8; R = 3). Each text comprises 16 to 35 questions. The texts and the items are orally presented to students using a previously recorded file dictated by a professional.
Procedures
Legal authorisations for the administrations of the tests in schools were solicited from the Portuguese Ministry of Education and the respective school boards. Informed consent was obtained from students and their parents or legal guardians. The tests were administered collectively by trained psychologists during class time. The task’s administration was similar to all the participants, i.e., the sequence of the texts in each test was the same for all students.
Data analyses
The TLC-n and the TLC-e were studied using the Rasch model analyses for dichotomous data and the software Winsteps, version 3.72.0 (Linacre, Reference Linacre2011).
Item analysis
Item difficulty measures and fit statistics to the Rasch model were estimated. Person (θ) and item parameters (b i ) were computed to analyse the difference between person ability and item difficulty and detect items that were too easy or extremely difficult. Infit and outfit mean square (MNSQ) indices were calculated for persons and for each item at each grade. These fit statistics range between 0 and +∞, with an expected value of 1, which indicates a perfect fit of the items (Bond & Fox, Reference Bond and Fox2007). Values less than 1.5 are more productive for measurement (Linacre, Reference Linacre2002). Point-measure correlations were computed for each item at each grade. This analysis provides information about the association between a score for one item and the overall test score. Values less than .20 represent low correlations (Linacre, Reference Linacre2011). The estimation of the mean values of person ability of the participants for the TLC-n and the TLC-e at each grade was also performed. It is expected that the higher measure is observed in the group of persons who choose the correct option for each item (Linacre, Reference Linacre2011).
For the selection of the anchor items, z-tests were computed. It is required that anchor items maintain a stable level of difficulty across adjacent grades (Dorans, Moses, & Eignor, Reference Dorans, Moses, Eignor and von Davier2011). Items with a z value higher than |1.96| were rejected as anchor items. Item distribution in the difficulty range of each test and its representativeness of each test’s content, i.e., the four comprehension levels (Huynh & Meyer, Reference Huynh and Meyer2010; Kolen & Brennan, Reference Kolen and Brennan2010), were two additional criteria used in the selection of the anchor items. Unique items were selected based on three criteria: (a) the difficulty level, spread out across the difficulty continuum; (b) the comprehension level assessed by the items; and (c) the reduction of redundant items, i.e., items with similar difficulty levels, which assessed the same comprehension level.
Unidimensionality, local independence and reliability
The unidimensionality of the final test forms was analysed using principal component analysis of linearised Rasch residuals (PCAR). Residuals with eigenvalues higher than 2.0 on a secondary dimension provide evidence of the unidimensionality assumption violation. Local independence assumption holds that a response to an item only depends on person ability. This assumption was tested through correlations of the items’ linearised Rasch residuals. Correlations higher than .70 indicate a violation of the local independency assumption (Linacre, Reference Linacre2011). The reliability of each final test form was studied by computing Person Separation Reliability (PSR), Item Separation Reliability (ISR) and the Kuder-Richardson formula 20 (KR20) coefficients. Higher values represent a lower level of measurement error (Bond & Fox, Reference Bond and Fox2007).
Results
Item analysis
The TLC-n and the TLC-e were calibrated separately for each grade. Table 1 presents item difficulty, person ability and fit statistics for the items and the persons.
In the TLC-n, 10 items in the first grade, six items in the second grade, 26 items in the third grade and 30 items in the fourth grade were too easy for the set of examinees. The TLC-e also included items that were determined to be excessively easy for the second, third and fourth graders (nine items, 19 items and 20 items, respectively).
In all grades, fit statistics for the items and infit statistics for the persons of the TLC-n did not exceed the reference value of 1.5. This value was exceeded in the TLC-n outfit statistics for the persons in all grades, except the first. This was the case of four second graders (1.6%), two third graders (0.7%) and 10 fourth graders (3.7%). Fit statistics for the items of the TLC-e in the first, second and third grades did not exceed the reference value of 1.5. In the fourth grade, items presented adequate infit values, but two items assumed outfit values greater than 1.5. Infit statistics of person ability on the TLC-e in the four grades were adequate. Outfit statistics that exceeded 1.5 revealed the presence of outlying students: one (0.5%) in the first grade, one (0.5%) in the second grade, one (0.5%) in the third grade and 11 (4.5%) in the fourth grade.
On the TLC-n, point-measure correlations less than .20 were found in the four grades (24 items, 12 items, 14 items and 10 items, respectively, in the first, second, third and fourth grades). Low correlations for the TLC-e were also found in the four grades (38 items, 25 items, 17 items and 15 items, respectively, in the first, second, third and fourth grades).
The estimation of the average values of persons choosing each option on the TLC-n indicated that the higher measure was not observed in the group who chose the correct option in five items in the first grade, one item in the second grade, three items in the third grade and one item in the fourth grade. On the TLC-e, there were 11 items in this condition in the first grade, eight items in the second grade, four items in the third grade and four items in the fourth grade.
Items with an inadequate level of difficulty for the grade groups, fit statistics higher than 1.5, point-measure correlations less than .20 and/or with problems in the options selection were excluded.
The invariance analysis of the item difficulty across grades was performed by computing z-tests for the remaining common items between adjacent grades, after the exclusion of the items with inadequate psychometric properties. In the TLC-n the reference value of |1.96| was exceeded in 18 of the 68 common items between the first and the second grade, eight of the 51 common items between the second and the third grade, and 12 of the 52 common items between the third and the fourth grade. In the TLC-e, the critical value of |1.96| was exceeded in eight of the 48 common items between the first and the second grade, five of the 58 common items between the second and the third grade and six of the 64 common items between the third and the fourth grade.
The following step was the selection of unique and anchor items to be integrated on the following test forms: TLC-n-1, TLC-n-2, TLC-n-3, TLC-n-4 and TLC-e-1, TLC-e-2, TLC-e-3, TLC-e-4 (the number identifies each grade). Anchor items were the first to be selected, based on the invariance analysis of the item difficulty across grades, the item distribution regarding the difficulty range of each form and each item’s representativeness of each comprehension level. For each form, 10 anchor items between adjacent grades were chosen. For the TLC-n forms, the difficulty of the anchor items ranged: (a) between the TLC-n-1 and the TLC-n-2, from -0.68 to 1.17 in the first grade and -0.40 to 1.56 in the second grade; (b) between the TLC-n-2 and the TLC-n-3, from -0.20 to 1.63 in the second grade and -0.28 to 1.56 in the third grade; (c) between the TLC-n-3 and the TLC-n-4, from 0.16 to 1.59 in the third grade and 0.15 to 1.52 in the fourth grade. For the TLC-e forms, the difficulty of the anchor items ranged: (a) between the TLC-e-1 and the TLC-e-2, from -0.71 to 0.62 in the first grade and -0.92 to 0.63 in the second grade; (b) between the TLC-e-2 and the TLC-e-3, from -0.57 to 1.36 in the second grade and -0.65 to 1.43 in the third grade; (c) between the TLC-e-3 and the TLC-e-4, from -0.33 to 1.61 in the third grade and -0.11 to 1.93 in the fourth grade. On each form of the TLC-n and the TLC-e, the anchor items represent the four levels of comprehension: LC, IC, R and CC. The selection of unique items to integrate into the different test forms was performed based on the comprehension level assessed, the difficulty of the items and the redundancy of item characteristics. Items selected for each test form assessed the four comprehension levels on a range of difficulty along the continuum of ability for each grade sample; when two or more items had the same level of difficulty, only one was selected to avoid redundancy of the measure.
Thirty items were selected for each form of the TLC-n and the TLC-e, including 10 anchor items, representing 33% of the total number of items. In the final step of item selection, items on the TLC-n-4 were easy for the fourth graders. The need for items with a higher level of difficulty to discriminate performance of students with a higher level of ability was evident. To solve this issue, a new study was conducted for the construction of a final form of the TLC-n for fourth graders.
Unidimensionality, local independence and reliability
Results of the PCAR indicated that the eigenvalues for the contrasts did not exceed the value of 2.0, confirming the unidimensionality of the three final TLC-n forms (TLC-n-1, TLC-n-2 and TLC-n-3) and every form of the TLC-e. The correlations of the items’ linearised Rasch residuals remained under .27 for all forms, indicating local independence of the items. The reliability results indicated very high ISR coefficients and moderate PSR and KR20 coefficients on the TLC-n and the TLC-e final forms (see Table 2).
Note: ISR – Item Separation Reliability; PSR – Person Separation Reliability; KR20 – Kuder-Richardson formula 20.
STUDY 2
Method
Participants
In this study, 260 fourth graders participated, of which 125 (48.1%) were female.
Materials, procedure and data analysis
This study aimed to overcome the limitations of study 1 with respect to the TLC-n form for fourth graders, from which 30 items with adequate psychometrics characteristics were selected for incorporation on the TLC-n-4. However, the difficulty level of the items in that form was not sufficient to discriminate students’ ability. In this second study, six new multiple choice items were generated to assess IC. For the construction of the new items, no changes were made in the texts, as they would interfere with the performance on the pre-selected items for the TLC-n-4. New items were elaborated with more complex content, as assessed by linguists and reading comprehension experts, demanding more attention and previous knowledge to select the correct option among the plausible options provided. These items, along with the 30 items previously selected, were tested and calibrated in the new sample.
The procedures and data analyses were similar to those followed in study 1.
Results
Item analysis
Items on the TLC-n-4 presented difficulty values ranging between -2.02 and 3.01 and person ability values ranging from -2.86 to 3.55. No item was too easy or too difficult for this sample, and no item presented fit statistics exceeding the reference value of 1.5. For the persons, fit statistics detected the presence of 26 students (10%) whose outfit values exceeded 1.5. Two items with point-measure correlations less than .20 were excluded. The higher measure for person ability among the participants was observed in the group who chose the correct answer for each item.
The analyses of this form indicated that two of the six new items were redundant with the items selected in study 1, in terms of difficulty and assessed level, thus the two items were deleted. The four remaining items presented high levels of difficulty varying between 1.65 and 3.01. These items replaced four unique items previously selected that measured the same level of comprehension, had lower levels of difficulty (between -2.02 and -0.41) and were redundant with other items on the test.
Unidimensionality, local independence and reliability
PCAR analyses provided evidence for the TLC-n-4 unidimensionality, with no eigenvalues exceeding the reference value of 2.0. Local independence of the items was confirmed by the correlations of the item linearised Rasch residuals, which were less than .23. The reliability results of this version indicated a very high ISR coefficient (.98) and moderate PSR and KR20 coefficients (.70 and .72, respectively).
General Discussion
The items on the TLC-n and the TLC-e were studied using the Rasch model analyses. In the item selection, items that were too easy or too difficult for the different grade groups, whose fit statistics were higher than 1.5, that showed point-measure correlations less than .20 and/or problems in the options selection were excluded. The anchor items were the first to be selected. These items met the recommended criteria of invariance across adjacent grades, distribution in the difficulty range of each test, and focus on the assessment of the four levels of comprehension (Dorans et al., Reference Dorans, Moses, Eignor and von Davier2011; Huynh & Meyer, Reference Huynh and Meyer2010; Kolen & Brennan, Reference Kolen and Brennan2010). The choice regarding the unique items was based on the level of difficulty of each item at the assessed level of comprehension and with respect to redundancy avoidance.
For each final test form 30 items were selected, including 10 anchor items between adjacent forms. The final forms of the TLC-n and the TLC-e for each grades demonstrated unidimensionality, local independence and adequate reliability coefficients. Based on the results, the TLC-n and the TLC-e forms have adequate psychometric characteristics that allow for the discrimination of student ability.
These instruments have important implications for practice and for research. Listening comprehension is an essential part of Hoover and Gough’s (Reference Hoover and Gough1990) simple view of reading, where reading comprehension is characterised as the product of decoding and listening comprehension skills. The use of a test of listening comprehension in association with tests of reading comprehension and decoding enables psychologists to compare students’ performance and it contributes to clarifying the causes of comprehension difficulties.
Two reading profiles of students have been described in the literature. The first includes those students who exhibit appropriate listening comprehension skills but have difficulties in comprehending what they read, which are probably related to problems in decoding. The second profile includes students who are unable to comprehend what they read or hear. These last students are referred to as poor comprehenders (Nation, Reference Nation, Snowling and Hulme2005). Their performance on comprehension is not a result of poor decoding, as it is adequate for their age and grade, or poor nonverbal cognitive abilities (Oakhill, Reference Oakhill and Gernsbacher1994). Rather, poor comprehenders frequently display broad language difficulties related to vocabulary and to grammatical information processing in spoken language and poor performance on general measures of language comprehension (Hulme & Snowling, Reference Hulme and Snowling2011). The use of a listening comprehension test emerges as a diagnostic measure from which it is possible to verify if reading comprehension deficits are due to problems in the decoding abilities or to general language-processing problems (Nation, Reference Nation, Snowling and Hulme2005).
The selection of anchor items for each form of the TLC-n and the TLC-e will enable, in future research, the calibration of the forms in the same metrics according to a common-item non-equivalent groups design (de Ayala, Reference de Ayala2009). With the vertically scaled forms of each test, it will be possible to compare results from the initial grades. This is an essential contribution for teachers and psychologists in the academic context and for the research field. In the academic context, these forms will allow for the follow-up of students from the first through fourth grade, enabling the analysis of intra-individual variability over time with narrative and expository material. Additionally, researchers will be able to study the differential trends in the development of listening comprehension and the effects of text genre on performance.
Each form of the TLC-n and the TLC-e could be useful in the assessment of children with reading and listening comprehension problems. For Portuguese educational agents, these tests fill a gap regarding listening comprehension assessment since, to date, no instrument has been constructed for this purpose that could be used across all four grades of primary school. However, further studies are required in order to gather evidence regarding construct- and criterion-related validity (concurrent and discriminant), as well as test-retest reliability, to support the interpretation of the TLC-n and the TLC-e test forms’ scores and the relevance of their use in educational or clinical contexts (AERA, 1999).
This research was supported by Grant FCOMP-01–0124-FEDER-010733 from FCT (Fundação para a Ciência e Tecnologia) and the European Regional Development Fund (FEDER) through the European program COMPETE (Operational Program for Competitiveness Factors) under the National Strategic Reference Framework (QREN).