Skip to main content

Exploring the application of ChatGPT in ESL/EFL education and related research issues: a systematic review of empirical studies

Abstract

ChatGPT, a sophisticated artificial intelligence (AI) chatbot capable of providing personalised responses to users’ inquiries, recently has had a substantial impact on education. Many studies have explored the use of ChatGPT in English as a second language (ESL) and English as a foreign language (EFL) education since its release on 30 November 2022. However, there has been a lack of systematic reviews summarising both the current knowledge and the gaps in this research area. This systematic review analyses 70 empirical studies related to the use of ChatGPT in ESL/EFL education within a 1.5-year period following its release. Using the Technology-based Learning Model, we provide a comprehensive overview of the domains in which ChatGPT has been applied, the methodological approaches, and associated research issues. The included studies collectively provide solid evidence regarding the affordances (e.g., increased learning opportunities, personalised learning, and teacher support) and potential drawbacks (e.g., incorrect information, privacy leakage, and academic dishonesty) of ChatGPT use in ESL/EFL education. However, our findings indicate that the majority of studies have focused on students’ use of this AI tool in writing, while few studies have quantitatively examined its effects on students’ performance and motivation. In addition, the impact of ChatGPT on other language skills, such as reading, speaking, and listening, remains under-researched. Therefore, we recommend that longer-term studies with rigorous research designs (e.g., quasi-experimental designs) and objective data sources (e.g., standardised tests) be conducted to provide more robust evidence regarding the influence of ChatGPT on students’ English language acquisition.

Introduction

On 30 November 2022, OpenAI launched ChatGPT, an artificial intelligence (AI) chatbot. Using natural language processing technologies, ChatGPT interacts with users in real time and provides personalised responses to their queries (OpenAI, 2022). The initial release of ChatGPT was based on the third iteration of the Generative Pre-trained Transformer (GPT-3) series developed by OpenAI. GPT-3 is a significant improvement over its predecessor (GPT-2), featuring an expanded training dataset, enhanced fine-tuning and other capabilities, and the ability to generate even more human-like text (Brown et al., 2020). However, the limitations of its initial release include an inability to process images and the potential to yield inaccurate or false information (Bozkurt et al., 2023; Lo et al., 2024; Tlili et al., 2023). To address these shortcomings, OpenAI introduced an updated version of ChatGPT on 14 March 2023. This release was based on GPT-4, which can process both text and images. OpenAI asserts that GPT-4 has improved the accuracy and overall performance of the tool compared with the initial version (OpenAI, 2023). As of 13 May 2024, the latest version, GPT-4o, includes expanded capabilities for processing text, vision, and even voice conversations (OpenAI, 2024).

Increasingly, ChatGPT has attracted attention in the field of English language education, specifically in the areas of English as a second language (ESL) and English as a foreign language (EFL). The growing popularity of this research topic is reflected in the volume of research published. Within just nine months after its launch, Meniado (2023) found 15 articles related to the use of ChatGPT in ESL/EFL education. His analysis focused on the impact of this tool on English language teaching and learning. In terms of teaching, ChatGPT can support teachers in various aspects, such as lesson planning (Mohamed, 2024), preparation of teaching materials (Jeon et al., 2023), and grading of students’ writing (Mizumoto & Eguchi, 2023). In terms of learning, Meniado (2023) found that ChatGPT facilitated students’ engagement in meaning-focused input, meaning-focused output, language-focused learning, and fluency development—the four crucial components of meaningful and productive English language acquisition (Nation, 2007). Taking fluency development as an example, ChatGPT generated dialogues that helped students to practise spoken English and enhance their language proficiency (Young & Shishido, 2023). However, Meniado (2023) also identified concerns regarding the use of ChatGPT in ESL/EFL education, including occasional inaccurate responses and risks to academic integrity. These concerns echo findings from other systematic reviews of ChatGPT research in the education sector (e.g., Imran & Almusharraf, 2023; Lo et al., 2024; Vargas-Murillo et al., 2023).

Although several systematic reviews have explored the application of ChatGPT, remaining knowledge gaps warrant further investigation, particularly in the context of ESL/EFL education (Meniado, 2023). First, these reviews have focused predominantly on health professions (e.g., Garg et al., 2023; Gödde et al., 2023; Sallam, 2023). At the time of writing, only one systematic review written by Meniado (2023) focuses on ESL/EFL education. However, very few relevant articles (n = 15) had been published and could be included in his research synthesis. Meniado (2023) thus acknowledged that the evidence base of his review might not have been robust enough to establish a thorough overview of the application of ChatGPT and its impact in the area of English language teaching and learning. Second, previous reviews generally have focused on analysing the strengths, weaknesses, opportunities, and threats associated with ChatGPT (e.g., Gödde et al., 2023; Lo et al., 2024; Zhang & Tur, 2023). While such analyses are beneficial, a comprehensive review of the key research issues related to the application of this AI tool in ESL/EFL education is lacking.

To inform future studies, it is important to produce a review that includes more studies and thus provides researchers a global perspective on ChatGPT research in ESL/EFL education (Meniado, 2023). The overarching objective of the present systematic review is to summarise both the current knowledge and the gaps in this research area from multiple angles (Hwang & Chang, 2023; Liu & Hwang, 2023). With this objective, we focus on empirical studies related to the application of ChatGPT in ESL/EFL education within a 1.5-year period after its initial release. To enable a multi-dimensional analysis of these studies, the Technology-based Learning Model (Hsu et al., 2012; Hwang & Chang, 2023; Liu & Hwang, 2023) was used as the theoretical framework for research synthesis. Accordingly, the following research questions (RQ1 to RQ3) were posed to guide our review.

  • RQ1: Within a 1.5-year period after its initial release, in which domains of ESL/EFL education was ChatGPT applied?

  • RQ2: Within a 1.5-year period after its initial release, which methodological approaches were employed in the studies of ChatGPT in ESL/EFL education?

  • RQ3: Within a 1.5-year period after its initial release, what research issues related to ChatGPT in ESL/EFL education were identified?

Theoretical framework

In this review, we used the Technology-based Learning Model proposed by Hsu et al. (2012) for our research synthesis. The researchers emphasised that when exploring future development trends in technology-enhanced learning, it is important to review the literature in the categories of (1) application domains, (2) research methods, and (3) research issues. This model has been adopted in various reviews across research areas (e.g., Hwang & Chang, 2023; Liu & Hwang, 2023). In particular, Hwang and Chang (2023) applied this model to explore trends in research on chatbots in education. From the studies included in their review, they found that languages were the learning domains in which chatbots were most frequently applied, followed by engineering and computers. Regarding research methods, the majority of studies included in their review employed a quantitative approach, followed by those using mixed-methods and qualitative approaches. Most importantly, Hwang and Chang (2023) identified research issues worthy of further investigation, such as exploring the use of effective learning designs or strategies with chatbots.

As shown in Fig. 1, the three constructs of the Technology-based Learning Model were adopted to review and analyse the literature on ChatGPT-supported ESL/EFL education. Using the empirical studies (e.g., Bin-Hady et al., 2023; Mizumoto & Eguchi, 2023; Mohamed, 2024; Yan, 2024; Young & Shishido, 2023) included in Meniado’s (2023) review, a preliminary analysis was conducted to establish a foundation for applying the Technology-based Learning Model in our research synthesis. This groundwork enabled our further efforts to retrieve a more comprehensive set of instances regarding ChatGPT application domains, research methods, and research issues across studies.

Fig. 1
figure 1

Modified Technology-based Learning Model used to review ChatGPT-supported ESL/EFL education

Application domains

The review of application domains involved an analysis of the study locations, educational contexts (e.g., primary, secondary, and higher education), and learning domains. Taking learning domains as an example, although several studies (e.g., Bin-Hady et al., 2023; Mohamed, 2024) did not focus on specific learning domains, others predominantly addressed the four core English language skills, namely reading, writing, speaking, and listening. For example, Mizumoto and Eguchi (2023) explored the potential of using ChatGPT to evaluate writing, while Yan (2024) investigated students’ feedback-seeking abilities in writing classes. Therefore, both studies fell within the writing domain. Such analysis enhances our understanding of which domains are under-researched, thereby informing the directions of future studies.

Research methods

The review of research methods involved three levels of analysis encompassing study types, research approaches, and data sources. In our preliminary analysis, we identified four study types, namely ChatGPT evaluation, AI detection, human observation, and human intervention (Table 1). In ChatGPT evaluation studies, researchers interacted with ChatGPT and evaluated its performance (e.g., Mizumoto & Eguchi, 2023). In the AI detection studies, researchers tested the use of AI detectors to detect ChatGPT-generated text (e.g., Ibrahim, 2023). Studies involving human participants were classified as either human observation or human intervention studies (Thiese, 2014). A study was deemed observational (e.g., Mohamed, 2024) if data were collected solely to explore the participants’ perspectives, without any attempt to interfere with or alter the measured attributes (i.e., an intervention). Conversely, a study was classified as interventional (e.g., Yan, 2024) if some forms of intervention were conducted under an experimental condition. Human intervention studies were further classified as having a pre-experimental, quasi-experimental, or true experimental design, as defined by Creswell (2012). Second, the research approaches were broadly categorised as qualitative, quantitative, or mixed methods (Creswell, 2009). Third, we summarised the data sources (e.g., surveys and interviews) employed in the empirical studies.

Table 1 Study types of ChatGPT research

Research issues

To identify areas lacking in research, the research issues pertaining the included studies must be understood. In addition to identifying research gaps, our review of research issues enabled us to group similar studies and then compare and contrast their research findings. Thus, we could consolidate existing knowledge regarding the impact of ChatGPT on ESL/EFL education. For example, Liu and Hwang (2023) identified several major research issues in their review of research on touchscreen mobile devices. These research issues included the impacts of these devices on children’s development, as well as teachers’ and parents’ perceptions of their use. Consequently, their research synthesis allowed the researchers to summarise the key findings of the literature according to different research issues and propose further research topics that warranted follow-up investigation.

Methods

This section first outlines our search strategies, followed by the inclusion and exclusion criteria and the study selection process. We then explain the process of data extraction and analysis.

Search strategies

We selected relevant articles according to the Preferred Reporting Items for Systematic Reviews and Meta-Analysis (PRISMA) statement (Moher et al., 2009). The final search was conducted on 1 June 2024. Seven electronic databases were searched: (1) Academic Search Ultimate; (2) ACM Digital Library; (3) Education Source Ultimate; (4) ERIC; (5) IEEE Xplore; (6) Scopus; and (7) Web of Science. The search string with Boolean operators was as follows: (ChatGPT OR GPT-4 OR GPT-4o) AND (ESL OR EFL OR (English AND (L2 OR “second language” OR “foreign language”))). This string was applied to each database to search for relevant articles containing either of the keywords in the title, abstract, or keywords. The publication period was specified as December 2022 to May 2024.

Inclusion and exclusion criteria

The search outcomes were filtered using the following inclusion and exclusion criteria:

  1. (1)

    Topic and subject area: Studies had to focus on the use of ChatGPT and/or its subsequent releases (i.e., GPT-4 or GPT-4o) in ESL/EFL education. Articles that did not focus on the use of ChatGPT, GPT-4, and/or GPT-4o, or were from other subject disciplines were excluded.

  2. (2)

    Study type: Included studies were required to report empirical research in which data from ChatGPT, AI detectors, and/or human participants were collected and analysed. Articles that did not discuss empirical studies (e.g., reviews, position papers, and editorials without empirical data) were excluded.

  3. (3)

    Source: Only journal articles, conference papers, and book chapters were reviewed. Other sources of publications were excluded.

  4. (4)

    Period: This review covered studies published between 1 December 2022 and 31 May 2024 (i.e., within a 1.5-year period after the release of ChatGPT). Articles outside this time frame were not included.

  5. (5)

    Language: Only English-language articles were included in the review.

Table 2 summarises the inclusion and exclusion criteria applied when selecting the studies.

Table 2 Inclusion and exclusion criteria for study selection

Study selection

A total of 202 records were retrieved through a database search on 1 June 2024. Duplicate articles were removed, yielding 121 unique records for screening. However, many of the records retrieved were not empirical studies (n = 27) or not related to ESL/EFL education (n = 14) and the use of ChatGPT (n = 6). Thus, they were outside the scope of this review. After reviewing the titles and abstracts, we assessed 74 full-text articles for eligibility. Of these, four articles were excluded because no empirical data were reported. Finally, 70 articles were included in this review. Figure 2 provides an overview of the article selection process.

Fig. 2
figure 2

PRISMA flow diagram of article selection

Data extraction and analysis

We used the theoretical framework described in Sect. "Theoretical framework" to guide our data extraction process and analysis. We extracted the author(s) and year of publication from each article. To address RQ1, which is related to the application domains of ChatGPT, we obtained information on (1a) the geographical locations of the studies, (1b) the educational contexts in which they were conducted, and (1c) the learning domains targeted. To address RQ2, which is related to the research methods, we classified the studies by (2a) type, including ChatGPT evaluation, AI detection, human observation, and human intervention (see Table 1). We also categorised (2b) the research approaches as qualitative, quantitative, or mixed methods and identified (2c) the data sources involved, such as surveys and interviews. The data pertinent to RQ1 and RQ2 were summarised using descriptive statistics to provide an overview of empirical research. To address RQ3, which is related to research issues, we conducted a content analysis to code the studies. Relevant themes were identified and categorised inductively (Braun & Clarke, 2006), without a predetermined coding scheme. The themes that emerged were subject to refinement throughout data analysis. The findings of the included studies within each thematic category were compared and contrasted. This analysis facilitated a deeper understanding of the ways in which ChatGPT has influenced ESL/EFL education, according to the literature.

To ensure the reliability of coding, all the included studies were independently coded by the first and third authors. This dual-coding approach allowed us to calculate inter-rater reliability using the percent-agreement method, a technique recommended by Stemler (2004). Accordingly, we achieved an inter-rater reliability exceeding 90%, indicating a high level of agreement between the two coders. When discrepancies arose, the first and third authors re-examined the articles in question to discuss and resolve their differences. This approach to coding and resolving discrepancies ensured the integrity and accuracy of our data extraction and analysis.

Findings and discussion

The findings pertaining to each research question are then presented and discussed in the subsequent subsections.

RQ1

Within a 1.5-year period after its initial release, in which domains of ESL/EFL education was ChatGPT applied?

The findings of RQ1 are organised and discussed across three aspects: (1a) study locations, (1b) educational contexts, and (1c) learning domains. Table 3 provides an overview of our major findings and their implications for further research.

Table 3 Major findings associated with RQ1 (application domains) and implications for further research

1a. Study locations

Figure 3 shows that nearly half of the included studies (n = 34; 48.6%) were conducted in East Asia, including China (n = 12), Indonesia (n = 4), South Korea (n = 4), and several other regions. Around one fourth of the included studies (n = 18; 25.7%) were conducted in Middle Eastern regions, such as Iran (n = 6) and Saudi Arabia (n = 6). Ten studies (14.3%) were conducted in European regions and Russia. Three studies (i.e., Mizumoto et al., 2024; Yancey et al., 2023; Yuan et al., 2024) involved ESL learners and/or their work in various locations. Notably, three studies were published in locations where English is a common language, but their participants’ first language was not English. The studies by Escalante et al. (2023) and Lee (2024) involved students who learnt English as a new language at the University of Hawaii in the United States. Another study by Liu et al. (2024b) was conducted at a university in New Zealand but involved Chinese international students for whom English was a foreign language. A potential research direction could involve exploring whether ChatGPT can assist international ESL/EFL students in adapting to learning environments in English-speaking countries. Researchers could investigate whether ChatGPT can provide personalised support for overcoming language barriers, becoming familiar with new cultural settings, and improving academic performance. Such a study would offer insights into the applications of ChatGPT for supporting diverse student populations in higher education.

Fig. 3
figure 3

Locations where the included studies were conducted

1b. Educational contexts

Figure 4 shows that the majority of the included studies (n = 47; 67.1%) were conducted in higher education settings. Only two studies were conducted in K–12 educational settings: the studies by Allehyani and Algamdi (2023) and Kim and Park (2023) involved early childhood teachers and primary school students, respectively. This distribution of studies shows that there are unexplored areas in the context of K–12 ESL/EFL education. Accordingly, there is a need for research focusing on the implementation and impact of ChatGPT in early childhood, primary, and secondary education settings.

Fig. 4
figure 4

Educational contexts of the included studies

1c. Learning domains

Figure 5 shows that the majority of the included studies focused on three core English language skills, including writing (n = 29; 41.4%), speaking (n = 5; 7.1%), and reading (n = 2; 2.9%); no studies specifically addressed listening. In addition, we identified other learning domains, namely vocabulary (Malec, 2024; Mugableh, 2024), grammar (Kucuk, 2024), cultural appreciation (Zheng & Stewart, 2024), literature appreciation (Alhammad, 2024), and thinking skills and creativity (Kartal, 2024). The limited or lack of availability of studies on reading and listening indicates a clear need for further research in these learning domains. Investigating how ChatGPT can support the teaching and learning of all four core skills would lead to a more comprehensive understanding of its affordances and limitations in ESL/EFL education.

Fig. 5
figure 5

Learning domains of the included studies

RQ2

Within a 1.5-year period after its initial release, which methodological approaches were employed in the studies of ChatGPT in ESL/EFL education?

The findings of RQ2 are organised and discussed across two aspects: (2a) study types and (2b) research approaches and data sources. Table 4 provides an overview of our major findings and their implications for further research.

Table 4 Major findings associated with RQ2 (research methods) and implications for further research

2a. Study types

Figure 6 shows that the included studies were classified into four study types established in Table 1: ChatGPT evaluation (n = 15; 21.4%), AI detection (n = 2; 2.9%), human observation (n = 26; 37.1%), and human intervention (n = 27; 38.6%). Among the 15 studies classified as ChatGPT evaluation, the most common focus was the potential of this AI tool to support teaching and learning in writing (n = 8), followed by speaking (Wang et al., 2023; Young & Shishido, 2023), reading (Shin & Lee, 2023), and vocabulary (Malec, 2024), among others. The two AI detection studies (Alexander et al., 2023; Ibrahim, 2023) both focused on the writing domain. The studies which focused primarily on human participants were classified into approximately equal numbers of observation studies (n = 26) and intervention studies (n = 27). The observation studies largely focused on investigating teachers’ and students’ perspectives on using ChatGPT in ESL/EFL education. The numbers of participants in these studies ranged from four (Marzuki et al., 2023) to 867 (Liu et al., 2024a), M = 150.19, SD = 215.44. In the intervention studies, researchers generally experimented the use of ChatGPT in ESL/EFL classrooms. Pre-experimental designs were the most common among these studies (n = 16), followed by true experimental designs (n = 7) and quasi-experimental designs (n = 5). Notably, these numbers did not add up to 27 (i.e., the total number of intervention studies) because Escalante et al. (2023) reported two sub-studies in their article. The numbers of participants in the intervention studies ranged from three (Yan, 2024) to 213 (Han et al., 2023), M = 50.44, SD = 43.07. The durations of these studies varied and could be categorised as a few learning tasks or sessions (n = 7), an interval of one month or four weeks (n = 4), five to 10 weeks (n = 11), or longer than 10 weeks or one semester (n = 6). In general, these studies had short durations. Longer-term studies are required to provide further insights into the effects of consistent interaction with ChatGPT on students’ language acquisition and its sustained impact on learning behaviour.

Fig. 6
figure 6

Classification of the included studies by type

2b. Research approaches and data sources

The quantitative approach was used by the majority of the included studies (n = 26; 37.1%), followed by the qualitative (n = 23; 32.9%) and mixed methods approaches (n = 21; 30.0%). Figure 7 shows the data sources used in the included studies. We first explicated three data sources that emerged specifically in the context of ChatGPT research, namely ChatGPT output (n = 17), AI detector output (n = 2), and user screen recordings of interactions with ChatGPT (n = 2). The use of these data sources was closely related to the study type. Specifically, studies focusing on ChatGPT evaluation collected and analysed ChatGPT output, while those examining AI detection collected and analysed both ChatGPT and AI detector output. The use of user screen recordings is particularly noteworthy. For example, Üstünbaş (2024) used such data and a stimulated-recall interview approach, which enabled users to comment on their experiences as they interacted with ChatGPT. This approach could provide valuable insights into how different users employ ChatGPT as a virtual partner in English language learning.

Fig. 7
figure 7

Data sources used in the included studies

As shown in Fig. 7, the two most common data sources were participants’ self-reported data, namely surveys (n = 34) and interviews (n = 26). Other types of self-reported data included user journals (n = 6), students’ verbal feedback (n = 1), and online discussions (n = 1). Comparatively, few studies collected and analysed data from objective measures, such as tests (n = 10), participants’ work (n = 4), and observations (n = 2). Future research should incorporate more objective data sources to provide a balanced and comprehensive understanding of the impact of ChatGPT on students’ English language acquisition. For example, standardised tests, student work, and direct observations could be included to increase the robustness of research evidence and complement self-reported data.

RQ3

Within a 1.5-year period after its initial release, what research issues related to ChatGPT in ESL/EFL education were identified?

The research issues identified in the included studies were classified into four major themes. Two themes were specifically related to core English language skills: (3a) writing (n = 30) and (3b) speaking (n = 5). The other two themes were the general perspectives of (3c) teachers (n = 14) and (3d) students (n = 11) regarding the role of ChatGPT in ESL/EFL education. In addition, we identified several (3e) other research issues that had not been explored extensively. The findings of RQ3 are thus organised and discussed across these five areas. Table 5 provides an overview of our major findings and their implications for further research. Table 6 summarises the themes and subthemes of the research issues that emerged from the included studies.

Table 5 Major findings associated with RQ3 (research issues) and implications for further research
Table 6 Themes and subthemes of the research issues in the included studies

3a. Research issue 1: Writing (n = 30)

Over 40% (n = 30) of the included studies addressed research issues related to writing. Nineteen studies focused on how ChatGPT influenced the teaching and learning of writing, seven examined the use of ChatGPT in assessments of student writing, two evaluated ChatGPT’s capabilities in writing, and two investigated methods used to detect ChatGPT-generated writing. Regarding the teaching and learning of writing, multiple studies provided evidence that ChatGPT could assist students with generating ideas and materials for consultation (Al-Obaydi et al., 2023; Mahapatra, 2024; Nugroho et al., 2024; Üstünbaş, 2024), organisation and structure (Lee, 2024; Mahapatra, 2024; Nugroho et al., 2024; Tsai et al., 2024), spelling and grammar (Lee, 2024; Mahapatra, 2024; Nugroho et al., 2024; Tseng & Lin, 2024), and vocabulary and word choice (Lee, 2024; Nugroho et al., 2024; Tsai et al., 2024; Üstünbaş, 2024). As students noted, “it guides us in obtaining the required information, arranging our ideas, and writing correctly” and “explains the grammar issues when asked” (Mahapatra, 2024, p. 9). Similar to the present review, Meniado (2023) found that ChatGPT’s ability to help students notice and correct errors, along with its support in organising ideas and adhering to genre-specific structures, contributed to improved writing performance. Both sets of findings highlight the potential of ChatGPT to scaffold the writing process.

However, the findings associated with ChatGPT-assisted writing were not overwhelmingly positive. Echoed with Meniado (2023), students in several studies expressed dissatisfaction with ChatGPT for various reasons, including inaccuracy (Hieu & Thao, 2024; Nugroho et al., 2024; Yuan et al., 2024), technical problems (Hieu & Thao, 2024), and an inability to provide desirable responses (Han, 2023; Yan, 2024). One student described it as “a powerful yet demanding tool with diversified and unpredictable outcomes” (Yan, 2024, p. 11). Ahmed (2023) reported that a majority of students in an EFL writing class were dissatisfied with ChatGPT. The students lamented that the opportunities for interaction were more frequent and satisfying in a teacher-mediated writing class. Although ChatGPT can supplement the teaching and learning of writing, it cannot fulfil the role of a teacher (Ahmed, 2023; Escalante et al., 2023; Üstünbaş, 2024).

Table 7 shows that six included studies (Boudouaia et al., 2024; Escalante et al., 2023; Ghafouri et al., 2024; Mahapatra, 2024; Silitonga et al., 2023; Song & Song, 2023) quantitatively compared ChatGPT-assisted writing (experimental condition) with traditional classroom instruction (control condition). These studies focused on students’ writing performance and/or motivation. Except for the study by Escalante et al. (2023), the study results generally indicated that students in the experimental groups significantly outperformed those in the control groups (Table 7). The study by Song and Song (2023) further provided a breakdown of students’ writing performance, showing significant improvements in their experimental group in terms of content, organisation, and language use. Similarly, Boudouaia et al. (2024) reported significant improvements in students’ task achievement, coherence and cohesion, grammatical range and accuracy, and lexical range and accuracy. Regarding students’ writing motivation, Silitonga et al. (2023) and Song and Song (2023) reported that students in the experimental groups had higher levels of motivation than those in the control groups (Table 7). However, too few studies were available to conduct a meta-analysis of the overall effect of ChatGPT on students’ writing performance and motivation. Furthermore, all comparison studies were conducted in higher education settings. Therefore, these results might not be generalisable to other contexts, such as primary and secondary education. Clearly, more comparison studies are needed in both higher and K–12 education settings.

Table 7 Quantitative results of comparison studies focusing on students’ writing

In addition to the influence of ChatGPT, research has provided insights into the role of this AI tool in supporting ESL/EFL teachers’ assessments of student writing (n = 7). Some of its functionalities (i.e., automated scorer and feedback provider) were also identified in the review by Meniado (2023). Mizumoto and Eguchi (2023) and Mizumoto et al. (2024) demonstrated that the accuracy and reliability of ChatGPT-based automated essay scoring could complement human evaluations. Besides, it was able to distinguish between native and non-native English sentences and provide suggestions for correction (Cho, 2023). Yancey et al. (2023) provided further evidence suggesting that GPT-4 could achieve a nearly optimal writing evaluation performance. However, Algaraady and Mahyoob (2023) cautioned that while ChatGPT excelled at identifying surface-level errors, it struggled to detect deeper structural and pragmatic issues. The researchers therefore emphasised the irreplaceability of human teachers. Similarly, Obata et al. (2023) noted the challenges associated with relying solely on AI models for writing assessment. Regarding feedback on writing, Guo and Wang (2024) observed that ChatGPT generated more feedback than human teachers. Furthermore, this feedback was distributed evenly across the content, organisation, and language aspects of writing and could potentially lessen teachers’ burden of writing feedback. However, Guo and Wang (2024) also revealed teachers’ concerns about the length, readability, and relevance of ChatGPT-generated feedback. These studies collectively highlight the importance of integrating both AI and human expertise into ESL/EFL education to effectively evaluate and provide feedback on writing.

Regarding ChatGPT’s capabilities in writing, Wang (2023) and Zindela (2023) conducted a syntactic complexity analysis of ChatGPT-revised and ChatGPT-generated essays, respectively. Wang (2023) tasked ChatGPT with revising students’ essays and found that the ChatGPT-revised essays better matched the characteristics of high-level argumentative writing compared to students’ original essays. Similarly, Zindela (2023) found that ChatGPT-generated essays used more sophisticated and varied vocabulary compared to students’ argumentative writing. These results indicate the potential use of ChatGPT to support the teaching and learning of writing by improving the quality of students’ essays and providing high-quality sample essays.

Regarding the detection of ChatGPT-generated writing (n = 2), Ibrahim (2023) examined the effectiveness of two AI-detection platforms in identifying machine-generated text within a dataset of 240 human-written and ChatGPT-generated essays. His findings revealed that while both detectors could identify AI-generated content, they performed inconsistently across the dataset. The study highlighted the need for a more reliable detection mechanism to address AI-assisted plagiarism in the context of ESL/EFL education. In addition to AI detector capabilities, Alexander et al. (2023) investigated the challenges faced by ESL teachers when identifying ChatGPT-generated texts. They found that the teachers often lacked awareness of the characteristics and metrics used by ChatGPT and did not focus enough on fact-checking the content. Their findings highlighted teachers’ needs for enhanced digital and AI literacy, professional development (PD), advanced detection tools, and updated assessment policies to maintain academic integrity, as also emphasised by other researchers (e.g., Bozkurt et al., 2023; Meniado, 2023; Tlili et al., 2023).

3b. Research issue 2: Speaking (n = 5)

Although few studies addressed speaking, these studies covered three areas, namely the use of ChatGPT in generating dialogue materials (n = 2), its role as a learning partner (n = 2), and its use as an assessment tool (n = 1). First, Young and Shishido (2023) examined the effectiveness of ChatGPT in terms of generating dialogue materials suitable for EFL students. They concluded that the materials were suitable for students in primary education settings. Kim and Park (2023) compared students’ perceptions of role-playing scripts derived from textbooks with those of scripts generated by ChatGPT. Their students consistently rated the ChatGPT-generated scripts as more interesting than the book-derived scripts. While both studies highlighted the potential use of ChatGPT-generated materials for practising speaking in primary education settings, further research is needed to explore its efficacy in other educational contexts, such as secondary and higher education.

Second, two studies explored the potential use of ChatGPT as a partner in learning to speak English. Muniandy and Selvanathan (2024) instructed their students to use ChatGPT to simulate various roles (e.g., TED speakers) when developing persuasive speeches, generating outlines and presentation slides, and comparing these with their own work. Although ChatGPT did not support voice conversations at the time of that study, the students used an extension called ChatGPT Voice Master from the Chrome Web Store to address this limitation. Muniandy and Selvanathan (2024) found that ChatGPT boosted students’ confidence and speaking skills. However, inaccurate information, difficulty in using the correct prompts, and technical issues were the major challenges encountered when using ChatGPT. In another study, Lee et al. (2023) integrated ChatGPT into Augmented Reality glasses to enhance students’ speaking skills and provide a contextual language learning experience. This integrated approach led to improvements in students’ perceptions of task competence and aesthetic appeal compared with traditional English language learning.

Third, Wang et al. (2023) developed new ways to use ChatGPT to evaluate how well ESL learners placed pauses in their speech. Their results indicated that ChatGPT partially understood punctuation breaks but tended to overlook slight pauses between semantic groups. Nevertheless, Wang et al. (2023) recognised the potential of this AI tool in speech assessment and encouraged further exploration of strategies for optimising prompt design to enhance its performance.

3c. Research issue 3: Teachers’ general perspectives (n = 14)

Solid evidence indicates that ESL/EFL teachers’ general perspectives of ChatGPT comprised a mixture of affordances and concerns (Allehyani & Algamdi, 2023; Derakhshan & Chiasvand, 2024; Mabuan, 2024; Mohamed, 2024; Ulum, 2024). We discussed three major affordances (i.e., increased learning opportunities, personalised learning, and teacher support) identified in the included studies. First, teachers recognised that ChatGPT could increase ESL/EFL students’ opportunities to practise their language in real time (Allehyani & Algamdi, 2023; Mohamed, 2024; Ulla et al., 2023). One teacher stated that “ChatGPT may allow students to have the opportunity to actively formulate questions, request further explanations, and produce replies, which may foster an active engagement in practicing their language skills, resulting in enhanced language proficiencies” (Ulla et al., 2023, p. 175). Second, ChatGPT could offer personalised learning experiences by tailoring content to a student’s proficiency level (Alenizi et al., 2023; Mohamed, 2024; Yeh, 2024). In the study by Alenizi et al. (2023), for example, teachers confirmed that “ChatGPT can provide personalized learning experiences for special education students” (p. 17) and that it “adapts to the student’s learning style, pace, and need” (p. 18). Third, ChatGPT could assist teachers in generating and refining lesson plans, exercises, and activities based on specific learning objectives in ESL/EFL classrooms (Alenizi et al., 2023; Farzaneh, 2024b; Ulla et al., 2023; Yeh, 2024). These findings aligned with those of Meniado (2023), who also identified ChatGPT’s potential to serve as both a lesson planner and an instructional material developer. In the study by Yeh (2024), for example, teachers used ChatGPT not only to create vocabulary exercises but also to refine educational song lyrics and align them with the lesson objectives, thus making instructional materials more accessible to their students.

Despite these affordances, the included studies collectively pinpointed five major concerns regarding ChatGPT, namely the occasional provision of incorrect information (Gao et al., 2024; Mohamed, 2024), privacy leakage when using it (Gao et al., 2024; Mohamed, 2024), academic dishonesty (Cong-Lem et al., 2024; Hieu & Thao, 2024), students’ over-reliance on this AI tool (Cong-Lem et al., 2024; Dehghan, 2024a), and hindered real-life communication (Alenizi et al., 2023; Mohamed, 2024). The first three concerns have been well documented in the review by Meniado (2023) as well as previous reviews of ChatGPT research (see Gödde et al., 2023; Imran & Almusharraf, 2023; Lo et al., 2024; Mohamed, 2024; Zhang & Tur, 2023 for a review). Regarding students’ over-reliance, across the included studies (Cong-Lem et al., 2024; Dehghan, 2024a; Gao et al., 2024; Mohamed, 2024; Ulla et al., 2023), teachers expressed concerns that students might become overly dependent on AI assistance instead of developing their language skills. This over-reliance could also impair students’ development of critical thinking skills (Cong-Lem et al., 2024; Mohamed, 2024; Ulum, 2024) and creativity (Cong-Lem et al., 2024; Dehghan, 2024a; Derakhshan & Ghiasvand, 2024). Regarding the hindrance of real-life communication, ChatGPT could not provide the same level of nonverbal cues as a human (Alenizi et al., 2023). As one teacher noted, “ChatGPT may become a crutch, hindering effective communication in real-life situations, and may not capture the nuances of human interaction” (Mohamed, 2024, p. 3206). However, follow-up studies are required because the latest release of GPT-4o supports voice conversations (OpenAI, 2024), potentially enhancing the effectiveness of real-life communication.

Finally, teachers expressed that they would need training to effectively integrate ChatGPT into their teaching practices (Alenizi et al., 2023; Allehyani & Algamdi, 2023; Mabuan, 2024). In efforts to inform teachers’ PD, Alrishan (2023) and Dehghani and Mashhadi (2024) used the Technology Acceptance Model to investigate pre-service and in-service teachers’ intention to use ChatGPT in ESL/EFL education. Both studies showed that perceived usefulness and ease of use were crucial in shaping teachers’ intention to use ChatGPT. Therefore, PD programmes should ensure that teachers find ChatGPT both useful and easy to use. Through such training, teachers could learn to effectively integrate ChatGPT into their teaching practices, become empowered to evaluate and modify its outputs, and learn strategies to mitigate the potential negative impacts of ChatGPT on ESL/EFL education.

3d. Research issue 4: Students’ general perspectives (n = 11)

Research focused on students’ general perspectives was related to whether they found ChatGPT to be useful, their satisfaction with it, and their motivation to learn English. Across studies (Bin-Hady et al., 2023; Klimova et al., 2024; Liu & Ma, 2024; Liu et al., 2024a; Shaikh et al., 2023; Vo & Nguyen, 2024), students generally expressed that ChatGPT was a useful tool for English language learning. Their perceived usefulness influenced their intention to use ChatGPT for English language learning (Liu et al., 2024a; Xu & Thien, 2024). It scaffolded their learning process by acting as a partner in practising language, providing feedback on language use, and recommending additional activities for practising (Bin-Hady et al., 2023; Liu et al., 2024a). In the study by Klimova et al. (2024), the students shared several useful applications of ChatGPT, including explaining, writing, and copy-editing. Consequently, the student participants in several studies (Klimova et al., 2024; Markus et al., 2023; Shaikh et al., 2023) were generally satisfied with the use of ChatGPT. Furthermore, using ChatGPT could increase students’ motivation to learn English (Markus et al., 2023; Muthmainnah et al., 2024). In their words, “The material provided is easier to understand when interacting with ChatGPT” and “I feel that my motivation to learn English has increased with ChatGPT” (Muthmainnah et al., 2024, p. 34).

Despite its positive influence on their learning experiences, some ESL/EFL students expressed concerns about ChatGPT. Most of these concerns mirrored those of teachers, including issues with information accuracy, academic dishonesty, and over-reliance on ChatGPT (Klimova et al., 2024; Marjanovikj-Apostolovski, 2024; Xiao & Zhi, 2023). In addition, some students reported that it was difficult to obtain a desirable output from ChatGPT (Liu et al., 2024a; Xiao & Zhi, 2023). Notably, this challenge might have stemmed from both ChatGPT’s limited understanding of the students’ input and the students’ lack of knowledge regarding the use of appropriate prompts. As one EFL student of computer science explained, “ChatGPT still has many limitations, especially in terms of understanding the users’ input. People need to give appropriate prompts so that they can find it enjoyable and effective to use ChatGPT” (Liu et al., 2024a, p. 16). Therefore, training should be provided to improve students’ ability to formulate effective prompts for eliciting responses from ChatGPT and their understanding of its input-processing limitations.

3e. Other research issues (n = 10)

As shown in Table 6, we also identified several unexplored research issues that may hold implications for a broader understanding and application of ChatGPT in ESL/EFL education. The following highlights the major findings from individual studies in three learning domains, namely reading, vocabulary, and grammar.

  • Reading: Rees and Lew (2024) investigated the effectiveness of ChatGPT-generated definitions in helping students resolve uncertainties about vocabulary during a reading task. They found no significant difference in performance between students who used the AI materials and those who used definitions from the Macmillan English Dictionary. Shin and Lee (2023) compared ChatGPT-generated reading comprehension tests with those created by human experts. In terms of naturalness, they found that the flow and expressions in the AI-generated materials were comparable to those in human-created materials. However, the expert-created reading passages and test items appeared to be superior in terms of attractiveness and completeness.

  • Vocabulary: Mugableh (2024) explored the potential use of ChatGPT to create vocabulary exercises. His findings indicated that students using ChatGPT-generated exercises significantly outperformed those using traditional exercises. However, Malec (2024) discovered that ChatGPT’s performance in generating distractors for multiple-choice vocabulary questions was unsatisfactory.

  • Grammar: Kucuk (2024) examined the effectiveness of integrating ChatGPT in the teaching and learning of grammar. The results of his grammar test indicated that students with ChatGPT support scored significantly higher than those without.

Conclusion and limitations

This systematic review analysed 70 empirical studies related to the use of ChatGPT in ESL/EFL education within a 1.5-year period after its initial release. Compared to the previous review by Meniado (2023), there is a substantial increase in the volume of relevant research in recent months. This growing trend in research is likely to continue. However, researchers should first identify gaps in the fast-growing literature to avoid overlooking previous efforts in this research area.

Using the Technology-based Learning Model, we provided an overview of the application domains, methodological approaches, and research issues that have emerged from research on ChatGPT in ESL/EFL education. We found that the majority of existing studies addressed the writing domain. However, the effect of ChatGPT use in writing courses remains under-evaluated. Very few comparison studies (e.g., ChatGPT-supported vs. traditional approaches) have been conducted, which has hindered the use of a meta-analytical approach to summarising the influence of this AI tool on students’ writing performance and motivation. The efficacy of ChatGPT in supporting the teaching and learning of other language skills (i.e., reading, speaking, and listening) is also under-researched. Therefore, we recommend that further studies with more rigorous research designs, such as quasi-experimental and true experimental designs, should be conducted to explore these areas. It also will be necessary to include more objective data sources (e.g., standardised tests and student work) to offer more robust research evidence. In light of the rapid advancements in AI technology, the capabilities of ChatGPT are likely to have improved further since the time of writing. Future research should continue to evaluate the evolving capabilities and potential affordances and concerns associated with the use of ChatGPT in ESL/EFL education.

Finally, several limitations of this review must be acknowledged. First, our research synthesis was constrained by the information reported by the study authors. The absence of an entity (e.g., data source) or a theme (e.g., participants’ perceptions) did not necessarily imply the absence of a specific category. Instead, it indicated only that the authors did not explicitly report such information in their articles. Second, although we summarised findings regarding major research issues, these findings were primarily based on studies conducted in higher education settings. Therefore, some findings of this review might not be generalisable to other educational contexts, such as primary and secondary education. Further studies are required to investigate the impact of ChatGPT on ESL/EFL education in K–12 settings. Third, most of the empirical studies included in this review were of short duration, which may have led to a novelty effect. Longer-term studies are necessary to determine whether the impact of ChatGPT on students’ English language acquisition is sustainable.

Availability of data and materials

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Download references

Acknowledgements

Nil.

Funding

This work was supported by The Education University of Hong Kong (Project No. #04A45, #CB382, and #CB383) and by Department of Mathematics and Information Technology (Departmental Research Grant; MIT/DRG02/24-25), The Education University of Hong Kong.

Author information

Authors and Affiliations

Authors

Contributions

Conceptualisation and design of the work: CKL; the acquisition, analysis, interpretation of data, and creation of resources used in the work: CKL, PLHY, SX; drafting the work and substantively revised it: CKL, PLHY, SX, DTKN, MSYJ. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Chung Kwan Lo.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lo, C.K., Yu, P.L.H., Xu, S. et al. Exploring the application of ChatGPT in ESL/EFL education and related research issues: a systematic review of empirical studies. Smart Learn. Environ. 11, 50 (2024). https://doi.org/10.1186/s40561-024-00342-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40561-024-00342-5

Keywords