1. Introduction
Webconferencing has emerged as an important tool in computer-assisted language learning (CALL) programs (Hubbard, Reference Hubbard, Son and Windeatt2017) and especially in telecollaboration (O’Dowd, Reference O’Dowd2018). By webconferencing, we refer to tools that allow visual and audio communication over the internet, usually alongside other forms of communication and collaboration, such as text-based chatrooms and/or screen sharing. Common platforms include client software such as Skype, or online platforms such as Adobe Connect and, more recently, Zoom.
Webconferencing has been studied in terms of the modalities it integrates, beginning with seminal work by Develotte and colleagues on the use of images and individual internet users’ webcams (Develotte, Guichon & Vincent, Reference Develotte, Guichon and Vincent2010), and on conversational phenomena such as multimodal conversational openings gestures and proxemics (Develotte, Kern & Lamy, Reference Develotte, Kern and Lamy2011). The presence of multiple modalities, in videoconferencing as well as in other computer-mediated communication (CMC) environments, has been linked to teachers’ and learners’ communicative capacity to draw on them, and this has been conceptualised in various ways, including (multi)literacies (Fuchs, Hauck & Müller-Hartmann, Reference Fuchs, Hauck and Müller-Hartmann2012) or multimodal competence (Hauck, Reference Hauck, Guth and Helm2010).
The present article relates recent research in this area to a new methodological framework for studying language teacher education and specifically the development of techno-pedagogical competence in online tutoring (Guichon & Cohen, Reference Guichon, Cohen, Farr and Murray2016). As such, it is concerned with several important but understudied aspects of CALL that were identified in a recent review by Gillespie (Reference Gillespie2020), such as multimedia, online learning, and content and language integrated learning. We demonstrate this methodological approach in a case study of a telecollaboration between learners and pre-service teachers of Mandarin Chinese as a foreign language, with the latter in the role of online language tutors (Cappellini & Hsu, Reference Cappellini and Hsu2020). Our aim is to support an ecological approach by shedding new light on the perception and the use of multimodality by future teachers with the help of eye-tracking data (Stickler, Smith & Shi, Reference Stickler, Smith, Shi, Caws and Hamel2016). In section 2, we review the relevant recent literature linking this approach to webconferencing. In section 3, we provide information on the context, the participants and the methods used in our case study. In section 4, we present and discuss our results before concluding.
2. Literature review
2.1 Multimodality
Following a long tradition in CALL (see, for instance, Lamy & Hampel, Reference Lamy and Hampel2007), we define multimodality as the simultaneous presence of multiple modes of communication. Modes are defined as semiotic resources or semiotic regimes that interlocutors can use to co-construct meaning (Bezemer & Kress, Reference Bezemer and Kress2016). In webconferencing, such modes include, but are not limited to, written language, gestures, facial expressions, proxemics and oral language in its verbal aspects as well as prosody, pitch and delivery (Rivens Mompean & Cappellini, Reference Rivens Mompean and Cappellini2015). These modes can be present at any time during the use of webconferencing, but interlocutors do not necessarily perceive them as relevant to expressing their intentions. We will draw on the concept of affordance (Blin, Reference Blin, Caws and Hamel2016; Gibson, Reference Gibson1979), which refers to the perception of an actor engaged in an action of the enabling and/or constraining effects of elements of the environment on that action (see below).
Studies on the multimodality of webconferencing interactions in CALL have utilised one of two broad approaches: one analytic and the other one holistic. In analytic approaches, researchers focus on one mode and study it in isolation from the others. For instance, Yamada and Akahori (Reference Yamada and Akahori2010) manipulated the presence of the video feed of one of the interlocutors to gauge its impact on the other interlocutors’ grammatical accuracy correction while speaking. Although this approach has informed some recent studies (Kozar, Reference Kozar2016, for instance), research has more commonly adopted holistic approaches in which multimodality is conceived of as a whole and is studied within paradigms such as multimodal conversation analysis (CA) (Cappellini & Azaoui, Reference Cappellini and Azaoui2017; Sert & Balaman, Reference Sert and Balaman2018), interactional sociolinguistics (Satar, Reference Satar2016), social semiotics, or combinations thereof (Helm & Dooly, Reference Helm and Dooly2017; Satar & Wigham, Reference Satar and Wigham2017). These studies have enhanced our understanding of how multimodality is used as a whole during interaction, often by focusing on particular conversational dynamics such as instruction-giving sequences (Satar & Wigham, Reference Satar and Wigham2017, Reference Satar and Wigham2020), policing (Sert & Balaman, Reference Sert and Balaman2018), or side sequences of negative feedback (Cappellini & Azaoui, Reference Cappellini and Azaoui2017). This article aims at proposing a methodological approach that starts with multimodal CA and articulates it with an ecological approach based on the concept of affordance and integrating relevance theory. In broadening multimodal CA in this way, we aim to gain insights into the cognitive dimensions of interaction. We take techno-pedagogic competence as a test case for this methodological approach.
2.2 Techno-pedagogic competence
Previous studies on multimodality have often incorporated reflections on the competencies learners and teachers must develop if they are to take part in online interactions effectively. Several models of pedagogical competence to teach languages with ICTs have been proposed (Dooly, Reference Dooly, Guth and Helm2010; Kessler, Reference Kessler, Farr and Murray2016, among others). One of the most influential frameworks is Hampel and Stickler’s pyramid (Reference Hampel and Stickler2005, Reference Hampel and Stickler2015). Although its originators studied webconferencing (Hampel & Stickler, Reference Hampel and Stickler2012), the pyramid framework was not specifically conceived for this type of CMC. On the other hand, Guichon’s (Reference Guichon2012) framework of techno-semio-pedagogical competence or techno-pedagogic competence (Guichon & Cohen, Reference Guichon, Cohen, Farr and Murray2016) was designed from the outset to describe the integration of ICTs into language teaching, and has subsequently been extensively adapted to teaching through webconferencing (Guichon & Tellier, Reference Guichon and Tellier2017). Guichon defines techno-semio-pedagogical competence as “knowledge and skills concerning:
the communication tools available (forums, wikis, videoconferences, etc.) that are best suited to the objectives of a given teaching sequence;
the appropriate choice of modes (written, oral, video, or a combination) for a given activity and for the development of linguistic competences;
the pedagogical management of learning activities where CMC tools are central or incidental (planning, regulating task implementation, evaluating learning)” (Guichon, Reference Guichon2012: 187; our translation).
Importantly, however, there is a gap between the definitions of techno-pedagogical competence and the methodological tools to study it. In fact, most authors agree that such competence includes not only the ability to effectively use relevant modes and strategies for communication (and perhaps teaching; Dooly, Reference Dooly, O’Dowd and Lewis2016) but also knowledge and awareness of semiotic modes (Guichon, Reference Guichon2012; Hauck, Reference Hauck, Guth and Helm2010). Such knowledge/awareness has often been studied through retrospective introspection, especially in the form of learning logs (e.g. Fuchs et al., Reference Fuchs, Hauck and Müller-Hartmann2012), and less often through stimulated recall (Cohen, Reference Cohen, Guichon and Tellier2017), but never within (inter)action itself. Recent advancements in eye-tracking data collection and related methodologies have made it possible to fill this gap in methodology.
2.3 Ecological approaches and affordances
Following Cappellini (Reference Cappellini2021), we developed an ecological approach based on the work of Bronfenbrenner (Reference Bronfenbrenner1979) and van Lier (Reference van Lier2004). Bronfenbrenner defines ecological approaches in opposition to experimental methodologies. He requires that participants be able to manipulate the environment (ecological validity) and that their perspectives be included in research (phenomenological validity). Following van Lier, we conceive of the environment as a multimodal reservoir of semiotic resources that are drawn upon to make meaning (Bezemer & Kress, Reference Bezemer and Kress2016). In this study, we focus on the relationship between the participants and their environment on the screen in terms of perception of the elements of the environment and their use in interaction. Participants’ perception of the elements in the environment is dependent on the actions they carry out, as well as their interpretations. We conceive of these as affordances in the sense of Blin’s (Reference Blin, Caws and Hamel2016) post-cognitive approach. In other words, affordances are not given before the action or interaction but emerge during the interaction with/through an environment, including digital environments. Thus we draw on ecological approaches to focus on the dialogic relationship between agents and their environment based on the concept of affordance.
2.4 Relevance theory
Relevance theory was first proposed by Sperber and Wilson (Reference Sperber and Wilson1986) as a framework for studying the cognitive dimension of human interaction. The framework allows the investigation of the interplay between communicative behaviour and contextual factors to explain how interlocutors’ attention is managed. In this framework, context can be roughly defined as what is mutually manifest to interlocutors. Elements of the environment can be rendered mutually manifest through ostensions – that is, communicative behaviour. Drawing on the interlocutor’s ostensions and on representations of their intentions, one we can process information in order to formulate an interpretation of what the interlocutor is doing and/or is trying to communicate. In this study, we conceive ostensions as drawing on communicative affordances that emerge during interaction and allow interlocutors to construct interpretations about each other’s actions and intentions. These ostensions therefore draw on the multimodality of the webconferencing environment. The perception of such an environment is at the core of this study.
2.5 An ecological approach to multimodality integrating eye tracking
In a previous study (Cappellini & Hsu, Reference Cappellini and Hsu2018), we argued that the emergence of affordances could be studied in part using eye-tracking data. Eye-tracking techniques have been used to collect data in CALL and telecollaboration for about a decade (O’Rourke, Reference O’Rourke, Dooly and O’Dowd2012), but only recently have technological advancements allowed researchers to deploy it to study webconferencing environments (Stickler et al., Reference Stickler, Smith, Shi, Caws and Hamel2016), answering various calls to do so from different researchers (Guichon & Wigham, Reference Guichon and Wigham2016; Sert & Balaman, Reference Sert and Balaman2018). Specifically, eye-tracking technology can collect data about a subject’s gaze fixation (in our case, on a screen), allowing us to investigate where the subject is looking at each moment of an interaction. The eye–mind hypothesis (Conklin, Pellicer-Sánchez & Carrol, Reference Conklin, Pellicer-Sánchez and Carrol2018) holds that gaze fixations can provide insights into a subject’s cognitive processes. Accordingly, we formulate the hypothesis that if tutors establish a relationship with the environment on the screen in cycles of action-perception-interpretation (van Lier, Reference van Lier2004), then studying how they scrutinise the multidimensional semiotic space of the screen can provide information about their knowledge and awareness of the elements of the digital interfaces they use, which is a key component of their techno-pedagogic competence. Within this theoretical framework, we aim to answer the following main research question: what are the contributions and limits of eye tracking for studying the perception and use of multimodality during webconferencing? A secondary research question is, what multimodal conversational dynamics are apparent in conversation side sequences for scaffolding (Cappellini, Reference Cappellini2016)?
3. Context and methods
3.1 Context and participants
Our data were collected during a telecollaboration between L1 French learners and pre-service teachers of Mandarin Chinese as a foreign language. The teachers were first-year MA students at the Hong Kong Polytechnic University. They came from Mainland China and also speak English (perhaps among other languages). The learners were second-year bachelor’s students in Chinese Language, Literature and Civilisation at Aix-Marseille University, with proficiency ranging between A2 and B1 (Council of Europe, 2001). The telecollaboration took place in three iterations, during the spring semesters of 2017, 2018 and 2019, and involved 18 teacher–learner groups. Each year, the future teachers received instructions about the French students’ Chinese language curriculum and developed conversational activities to be carried out during their webconferencing sessions. The interested reader will find more information about the pedagogical set-up in Cappellini and Hsu (Reference Cappellini and Hsu2020).
3.2 Data collection and the corpus of the study
The current study used two randomly chosen webconferencing sessions as a test case for our methodological approach. The first session was drawn from the first iteration of the telecollaboration, which involved one pre-service teacher and two learners of Chinese. Data were collected from the teacher’s side of the exchange only. The second session was part of the second iteration, which involved one pre-service teacher and a single learner. In this case, data were collected from both sides. All the learners in the two sessions were at A2 level in Mandarin Chinese. Written consent for the study was obtained before each participant joined the recorded webconference sessions.
We used a Tobii X300 eye tracker with Tobii Pro Studio software to collect the data. To ensure the webconferencing setting in our study was similar to typical online language-tutoring sessions, we asked the participants to sit comfortably at a distance of about 65 cm from the recording screen and face it directly. Then we conducted a nine-point calibration of the eye tracker for gaze direction. Unlike the laboratory experimental eye-tracking studies, participants were not restricted in terms of head movement. There were three parts to each dataset:
-
1. The single-channel audio stream comprising audio from all the participants;
-
2. A dynamic screen capture of participants’ eye movements during the session, for the tutor only in the first case and for both parties in the second case; and
-
3. An eye-tracking recording set at 120 Hz and thus providing 120 captures of each participant’s gaze position every second.
The first session recording lasted 54 minutes (1,028 turns, 4,241 words) and the second, 34 minutes (567 turns, 3,868 words).
Data were then exported in formats compatible with the EUDICO Linguistic Annotator (ELAN; Sloetjes & Wittenburg, Reference Sloetjes, Wittenburg, Calzolari, Choukri, Maegaard, Mariani, Odijk, Piperidis and Tapias2008). Eye-tracking data were exported in two ways. First, gaze fixation and gaze path were presented as dots and lines on the dynamic screen capture. Figure 1 shows an example screen capture from the first session. Second, we defined three areas of interest: the written chat, the participants’ own camera feed, and the face(s) in the camera feed of their interlocutors. Fixations on these areas of interest were then exported into ELAN for annotations, with a separate annotation tier assigned to each area of interest.
The audio recordings were transcribed by the authors collaboratively in ELAN, with a tier for each participant. For the verbal material, we used the transcription convention adopted in Cappellini (Reference Cappellini2021), an adaptation from the ICOR convention, a standard for interaction research in France.Footnote 1
3.3 Data analysis
The audio stream and the video recording(s) with gaze dots were integrated into the ELAN interface. Other than for transcription of verbal data, we used an adaptation of Wigham (Reference Wigham2017) already presented in Cappellini and Hsu (Reference Cappellini and Hsu2018) to manually annotate multimodal elements, including gestures, proxemics, head movements, participation in the chat window, and actions.
Our main focus was on two types of conversational side sequences that have previously been found relevant in webconferencing settings for language learning (Cappellini, Reference Cappellini2016). The first are sequences of potential acquisition (hereafter SPA; de Pietro, Matthey & Py, Reference de Pietro, Matthey, Py, Weil and Fugier1989), which correspond to conversational side sequences (Jefferson, Reference Jefferson and Sudnow1972) in which the learner faces a gap in their competence, usually a missing lexical item, and solicits help from their interlocutor. The second are sequences of normative evaluation (hereafter SEN; Py, Reference Py2000): another type of side sequence in which the language expert considers that there has been an error in the interlocutor’s expression and signals this.Footnote 2
Analysis was conducted by replicating the procedure in previous research (Cappellini & Azaoui, Reference Cappellini and Azaoui2017). First, the two authors independently identified the conversational phenomena of interest, then discussed any discrepancies until agreement was reached. Next, we analysed each instance from a multimodal interaction perspective, informed by “embodied” CA (Goodwin, Reference Goodwin2000; Mondada, Reference Mondada2016), including a focus on gaze trajectories to understand management of interaction. Finally, we compared our analyses of the different instances to highlight common patterns. Figure 2 shows an example screenshot of annotations in ELAN taken from our second example.
In the analysis, we investigated how the interlocutors directed their focal attention to the elements of the screen during the unfolding conversations that manifested either type of side sequence while using sets of affordances to co-construct meaning in terms of ostensions and inferences. Our aim was primarily methodological, in that we wanted to assess the contributions and the limits of the approach we propose. Inevitably, however, we gained some insights into the cognitive process at work while interlocutors, particularly the pre-service teachers, deployed their techno-pedagogic competence, and these should be explored using larger datasets in the future.
4. Results and discussion
In this section, we first provide an overview of all side sequences of interest in the two webconferencing sessions, and then we focus on one example for each type to offer a more comprehensive analysis.
4.1 Overview
As Table 1 shows, we identified 37 side sequences in the corpus, 20 SPA, and 17 SEN.
The three interlocutors in the first session generated more side sequences than the pair in the second session, with SPA especially frequent at 62% of all Session 1 side sequences, and 80% of SPA overall. This cannot be explained entirely by the fact that this session lasted longer and produced more verbal content than the other; rather, it was at least partly related to other differences. The tutor in Session 1 had a more conversational approach, asking the learners questions that introduced crossed expertise, where the learners were topic experts and the tutor the language expert, an approach previously found to generate more SPAs (Cappellini, Reference Cappellini2016). The tutor in Session 2, on the other hand, adopted a more teacher-like posture, producing mainly the initiation-response-feedback (IRF) exchanges that are typical of classroom interaction (Sinclair & Coulthard, Reference Sinclair and Coulthard1975), which are likely to lead to SEN in the case of form or content problems.
4.2 Sequences of potential acquisition
In this section, we present our analysis of a representative excerpt of a Session 1 SPA focusing on multimodal conversation strategies, both by the tutor in terms of techno-pedagogic competence and by the learners in terms of multimodal competence. Table 2 gives a multimodal transcription with our description.Footnote 3 The excerpt included both Mandarin Chinese and French, which we translated in parentheses. This side sequence took place after the tutor asked the learners about their location. Learner 1 is on the left.
Although longer than other SPAs in the dataset, this example is representative of the multimodal strategies at work in terms of several characteristics. First, the video feeds emerge as an affordance used by the tutor to understand the learners’ orientation and engagement in the interaction. When an interlocutor takes the floor, the tutor looks at the speaker, including when she herself is speaking (though, as we shall see, this dynamic was not observed in Session 2). During TRPs, the tutor looks at the learners to see if they intend to take the floor. For instance, before turn 10, the learners’ ostension of looking downwards leads the tutor to the inference that they are not willing to take the floor at that point. Moreover, in case of overlap, the tutor systematically leaves the floor to the learners. The video’s status as an affordance whereby the tutor can scrutinise learners’ orientation and gaze is especially evident in turns 1–3. In other words, gaze and the interlocutor’s gaze perception function as interactional gestures – that is, gestures to manage interaction (Bavelas, Chovil, Coates & Roe, Reference Bavelas, Chovil, Coates and Roe1995).
Second, it is worth noting that the gestures produced in both sessions are mainly deictics and emblems. The latter are defined as culturally specific gestures whose meaning can be understood without any additional element like speech (Kendon, Reference Kendon1982). The tutor does not direct the focus of her attention toward such gestures, which remain in the periphery of focal attention. The only exception occurred in Session 1, when both learners patted Learner 1’s shoulder while soliciting the word “back” in a side sequence about backache. This absence of focal attention on gestures may be linked to the affordances of the webconferencing environment and to more general patterns of focal attention on gestures. Gullberg and Holmqvist (Reference Gullberg and Holmqvist2006) showed that overt focal attention to gestures is present when the gesture is in the extreme peripheral area, very far from the speaker’s face. In webconferencing settings, such gestures would be outside the frame of the camera, either invisible, as here, or reduced to more central areas (Holt & Tellier, Reference Holt, Tellier, Guichon and Tellier2017). In either case, the need for or possibility of focal attention is excluded.
A third characteristic of this type of sequence, evident in both sessions, is the fact that affordances for successful communication are not restricted to software features, but include other technical artefacts in the interlocutors’ physical environments, particularly the learners. The main affordance is a cell phone, which is used to access the internet for more information (e.g. a translation, as in the example above). Fourth, the aural mode conveyed by the audio modality emerges as an affordance to highlight parts of the utterance. As seen in turn 10 in Table 2, this occurred through the use of intra-turn pauses that detach a lexical item from the whole utterance. This indicates a multimodal strategy at work, especially in Session 2, where it is also combined with the use of the chat window. Lastly, there is a difference between Session 1, where the learners usually do not repeat the lexical item, and Session 2, where the learner systematically repeats the lexical item before integrating it into his utterance.
4.3 Sequences of normative evaluation
Our second example comes from Session 2 (Table 3), during which we recorded gaze data for both tutor and learner. The side sequence in question emerges in the middle of a larger sequence about Chinese food and the presence of Chinese restaurants in France. It is an instance of IRF exchange, in which the question asked by the tutor in the initiation phase is less aimed at obtaining new factual information than at gaining an understanding of whether the learner is capable of producing the expected answer. We do not reproduce the first part of the exchange, in which the tutor asked if there are lots of Chinese restaurants in France and the learner answered yes. Rather, the excerpt starts after the tutor asked the learner to repeat himself.Footnote 6
This example of SEN is highly representative of the instances we found in our corpus; everything we note occurred in several other sessions, apart from the episode of joint attention. Moreover, some of the multimodal strategies at play in it are the same as those being used in SPAs. The most obvious of these common patterns is the use of video as an affordance for the tutor to understand what a learner is doing, and possibly to leave the floor to the interlocutor(s) for maximum autonomy in expression. This is particularly visible at point 5 of turn 1 (Table 3), where the learner’s ostension of leaning sideways leads to the tutor’s inference that he is looking into the chat window for the word he needs for his utterance, which results in the episode of joint attention. In other words, her awareness of the learner’s screen and the perception of the learner’s ostensions lead the tutor to an inference concerning his experience, which we can interpret from the recording of the tutor’s eye movements. Moreover, since we have the learner’s eye-tracking data as well, we can confirm the tutor’s inference.
On the learner side, video is used also to wait for feedback after turn completion. As shown in the second example, feedback can be positive or negative – in the latter case, usually without overt verbal indication. Indeed, in our data, verbal feedback is mostly positive, possibly followed by repair, as in the second example. The other common pattern is the use of the audio channel as an affordance to deploy the multimodality of speech, for instance, with the use of short intra-turn pauses by the tutor to draw learners’ attention to specific lexical items.
In our data, SENs are usually shorter than SPAs. The tutors do not interrupt learners’ turns and leave them the floor, even when there are long intra-turn pauses. The tutors signal they understand through ostensions using configurations of behaviours through audio and video, such as paraverbal sounds and head movements, or gesture and facial expressions as above.
5. Discussion and conclusion
Our first and main research question concerned the potentialities and limits of eye-tracking technology for the study of interlocutors’ perception and use of multimodality in webconferencing. On the whole, our analysis confirms the utility of adopting a holistic approach to the study of multimodality during webconferencing-based language learning – more particularly, an ecological approach informed by multimodal CA for micro-analysis of communicative behaviours on the one hand, and by relevance theory to interpret the cognitive dimension on the other hand. Indeed, the methodological combination of CA tools with eye-tracking data enabled us to gain insight into not only the interlocutors’ orientations (Mondada, Reference Mondada2016) but also the cognitive dimensions of intentions and inferences rendered through sequential multimodal ostensions. In other words, eye-tracking data enriched our ecological analytical approach focusing on the co-construction of meaning and social actions in webconferencing through cycles of action-perception-interpretation (van Lier, Reference van Lier2004). More precisely, the eye-tracking technique enriched such observations by providing a window on the cognitive management of graphic and visual affordances during interaction. Two different dynamics emerging from our analysis show this contribution. The first is the link between interlocutors’ orientations and the results of gaze analysis. Eye tracking provides evidence that posture and proxemics in relation to the screen are perceived and used as interactional gestures, and allow interlocutors to appreciate one another’s moment-to-moment engagement in interaction. This finding calls into question pedagogical recommendations that tutors look directly into the webcam to give learners the impression of being looked at in the eye (Develotte et al., Reference Develotte, Guichon and Vincent2010). Instead, we show that when an interlocutor detaches his gaze from the screen, he ceases to be able to efficiently manage the interaction based on the ostensions through the video affordance. The second dynamic illustrative of the contribution of eye-tracking data is that such data can be instrumental in identifying and analysing instances of joint attention (when data is obtained from both sides of an interaction). Our second example indicates that the tutor was able to imagine what was on the learner’s screen and adopt the learner’s perspective. We suggest that this will be especially useful in phases of a tutoring session where the tutor mediation is key to directing learners’ attention to particular parts of the screen, such as giving instructions, accompanying reading comprehension, analysing visual elements, and resolving technical issues.
As for its limits, eye tracking provides only partial information about interactions in webconferencing, which is not fully interpretable without other sources of data, especially audio and video recordings and dynamic screen captures. Therefore, it is less likely to be a useful stand-alone tool for analysis. A further limitation was imposed by our specific choice of eye-tracking tool, which sometimes provides approximations of fixations that were only accurate to 0.5 cm. More precise data could be gathered, but only using tools that were much more expensive and/or that restricted the interlocutors’ head movements, thus reducing ecological validity. The third and final limitation concerns the difficulty of interpreting the meanings of gaze behaviours. In fact, even if eye-tracking data can provide fully accurate information about fixations on elements of the screen, this does not constitute a direct window on interlocutors’ intentions. This limit may be partly overcome via stimulated recall, possibly using screen recordings with eye-tracking data superimposed on them. This method, like any other kind of retrospective explanation, is of course subject to cognitive bias (Mercier & Sperber, Reference Mercier and Sperber2017).
Our study also yielded an answer to our secondary research question: some multimodal conversational patterns were shared between our two groups, while others were specific to one or the other. In particular, our analysis shows that video became an affordance for, on the one hand, the tutor to interpret learners’ orientations and gazes as signs of engagement, and, on the other hand, for the learner(s) to solicit and interpret tutors’ (conversational) feedback. However, given that this was a case study based on a very limited dataset, our findings regarding this question are at best only preliminary. Future work on much larger datasets with similar types of participants will help us understand both the possible variations and the common features of multimodal interaction. Ideally, any such extension should also include longitudinal data, which would shed new light on how communicative behaviour in general, and gaze in particular, evolves over time.
Supplementary material
To view supplementary material referred to in this article, please visit https://doi.org/10.1017/S0958344022000076
Acknowledgments
This study was possible thanks to the support of different research engineers and technicians at different stages of data collection and treatment. We would like to express our gratitude to Fabrice Cauchard of the H2C2 platform, Alain Ghio and Antonio Serrato from platform Centre d’Etudes sur la Parole at the Laboratoire Parole & Langage, and Christelle Zielinski from the Centre de Ressources Expérimentation of the Institute of Language, Communication and the Brain. We would also like to thank Xia Wang, Li Tang, Anqi Xu, Eugene YC Wong and KT Tong at the CBS Speech and Language Sciences (SLS) Lab in the Hong Kong Polytechnic University for their technical support. Our thanks also goes to the three anonymous reviewers of the first versions of this paper for their interesting comments and for letting us improve this article. This paper is part of a research project funded by the French ANR (https://anr.fr/Projet-ANR-18-CE28-0011).
Ethical statement and competing interests
This study was approved by the research review board of the Hong Kong Polytechnic University prior to the beginning of data collection (HSEARS20170926001) and conducted in accordance with the board’s ethical guidelines. The guidelines of Laboratoire Parole & Langage were followed in the collection of data in France, and the French participants signed a written consent to be recorded. All participants received an explanation of the study and its procedures and gave their informed consent prior to its commencement. The authors declare no competing interests.
Author ORCIDs
Marco Cappellini, https://orcid.org/0000-0002-2086-061X
Yu-Yin Hsu, https://orcid.org/0000-0003-4087-4995
About the authors
Marco Cappellini is an associate professor in didactics of foreign languages at Aix-Marseille University. His research interests include tandem learning, interactionist approaches to foreign language (FL) learning, FL teaching and learning through CMC, teacher education and the integration of ICT in FL education, and learner autonomy.
Yu-Yin Hsu is an assistant professor in the Department of Chinese and Bilingual Studies at the Hong Kong Polytechnic University. Her research interests include technology application in FL teaching and learning, FL teacher training, psycholinguistic language processing, and theoretical linguistics.