An augmented reality learning system for Japanese compound verbs: study of learning performance and cognitive load

To address the difficulties related to acquiring Japanese compound verbs, which lack the clarity of verb combinations and the opacity of compound verb meanings, we designed and developed an augmented reality (AR) learning system based on image schema and AR animations. We investigated the effects of the AR-based language learning system developed in this study on the learning performance and cognitive load of an AR learning system and paper-based image schema materials. This study also examined the correlation between learning performance and cognitive load. Learners of these two learning methods had significantly improved performance on post-tests. Especially, regarding the retention of knowledge, the AR learning system was more effective. However, there was no significant difference in the perceived cognitive loads between the two learning methods. It is also found that the learning performance of the two learning methods was related to the perceived different types of cognitive load.


Introduction
In recent years, the need for non-Japanese people to learn the Japanese language has spread in the world, for the purposes of tourism, studying abroad, and seeking employment in Japan. According to the Survey on Japanese-Language Education Abroad (The Japan Foundation, 2019), the number of learners of Japanese as a foreign language (JFL) has increased by 30.2 times from 1979 to 2018. Mandarin and Korean are two languages in which two verbs can be combined to form a compound verb; there are also compound verbs in Japanese (Isobe, Okabe, & Kido, 2018). Compound verbs represent the concatenation of single verbs; the first single verb in a compound verb always appears in the adnominal form. Moreover, the meanings are established by the interaction between the meanings of the single verbs. For example, The compound verb yobi-kakeru is formed from the single verbs yobu and kakeru. The single verb yobu means "call" and kakeru means "hang", however, the meanings of compound verb yobi-kakeru are "call out" and "urge". In this study, we will refer to the first single verb as V1 and the second single verb as V2. Compound verbs frequently appear in daily life, but are difficult to learn for JFL learners (Kanzaki & Isahara, 2019;Matsuda, 2000Matsuda, , 2002. Matsuda (2002) and Sano (2004) found two difficulties relating to compound verb acquisition: (1) lack of clarity of verb combinations, and (2) the opacity of the compound verb meanings. The clarity of the combination plays a vital role in the judgment of combined compound verb's existence, such as which two single verbs can compose a compound verb, and the correct combination order (i.e. V1-V2 or V2-V1). The reason for the opacity is that the meanings of a compound verb are not simply the combined meaning of V1 and V2; numerous compound verbs are polysemous, and have their own meaning. Therefore, the meaning of a compound verb cannot be easily inferred (Matsuda, 2002). Due to the difficulties of acquiring compound verbs, they are difficult for learners to comprehend and retain (Chen, 2007;Sano, 2004).
Cognitive linguistics (CL) is essential for exploring the modeling and connections of linguistics structures, with conceptual knowledge and communicative function (Gibbs, 2006). Extensive CL research has been performed to understand the characteristics of words (e.g., Bolinger, 1977). Matsuda (2001) proposed image schemas of Japanese compound verbs by applying the core theory and image schema of cognitive linguistics to Japanese language education. The image schemas of compound verbs are frequently utilized in compound verb instruction; one image schema is employed to comprehend the meanings of a compound verb. However, one particular concern is that the simplification of the image schemas may make it difficult for learners to understand them, as image schemas are composed of abstract arrows and graphics (Tagawa & Yuizono, 2016). Sato (2016) noted that, instead of directly offering image schemas to learners, presenting visual glosses that are dynamic images, 3D animation, and visual explanations in multimedia environments will lead to improved learning effectiveness. For example, using 3D animations, instead of image schema, to learn target vocabulary (Sato, 2016).
The advancement of new technologies has brought new opportunities for language acquisition. Augmented Reality (AR) is defined as the system which has the ability to combines real and virtual, interactive in real-time, and registered in three-dimension (3D) (Azuma, 1997). It has a vital role in the way that language is acquired (Santos et al., 2013). AR is a graphic technology that provides interactive visual glosses of concepts to users (Billinghurst & Duenser, 2012). AR provides attached information regarding the learning material to enhance the learner's attention toward relevant visual information, and improve the ability to physically interact with the learning materials and observe digital feedback regarding behaviors that may support the learner in understanding abstract concepts (Fan, Antle, & Warren, 2020). The advantages of annotations of the real world, visualization of context, and vision-haptic visualization make the usage of AR promising for learning and teaching of abstract concepts and complex problem effectively, and it shows strong potential as a new educational tool related to the objects in the real world (Santos et al., 2013;Walczak, Wojciechowski, & Cellary, 2006). For example, Ibrahim et al. (2018) found that learning performance of AR in which words were displayed for real world objects was significantly higher than the learning performance of flashcards in a recall test. Thus, AR technology has significant potential to play a crucial role in facilitating language learning. In this study, we employed an AR compound verb learning system based on image schema and AR animations to support learners in acquiring compound verbs.
Cognitive load is assumed to be related to the learners' working memory capacity (Paas & Sweller, 2014). Since the working memory capacity of learners is limited, instructional design and learning materials might impact how learners interact with the learning environment and experience the way of cognitive load (Lin, Atkinson, Christopherson, Joseph, & Harrison, 2013;Paas & Sweller, 2014). Therefore, it indicates a need to explore the relationship between the learning method that includes learning materials and the perceived cognitive load of the learners. This paper is organized as follows: Section literature review presents reviews of image schemas, AR for language acquisition, and cognitive load. In research questions, we describe the purpose of this study and point out the research questions. Section development of the AR compound verb learning system describes the AR learning system and its functions. Section methods describes the methodology of this study. Section results we describe the results of Learning performance, cognitive load, the correlation of changes in learning performance and cognitive load. Section discussions were held along three research questions. The final of the article, we concludes the paper and describes future work.

Image schemas
Image schemas are a crucial concept of CL. They are a structural representation of various experiences, based on the orientation, movement, and interaction of our bodies. Schemas are a pattern of our physical experiences, and are important mediators between language and concepts. They are useful for understanding both literal and figurative meanings (Sato, 2016). For example, the image schema of "in" is the mode of the container, which is used to indicate that something contains something.
Image schemas are frequently utilized instructions in language acquisition (e.g. Tyler, 2004). Benjamin (2012) and Mitsugi (2017) investigated the effectiveness of image schema based instruction (ISBI) in language teaching. For example, in research, teachers let learners of English as a second language use their imagination to draw image schemas of phrase verbs. This reduced learners' confusion about the use of phrase verbs (Benjamin, 2012). The current ISBI is mainly used for English acquisition, but Matsuda (2001) suggested that image schemas could be applied to Japanese language education. As shown on the right side of Fig. 1, as it was originally applied for linguists to semantic analysis, the image schema only consists of simple graphics and arrows. Therefore, it is possible that in a multimedia environment, visual gloss will be easier to understand and more effective for second language learning than paper-based materials (Sato, 2016), rather than directly presenting the image schemas to learners (Tagawa & Yuizono, 2016).

AR for language acquisition
With the popularity of mobile devices and the advancement of technology, various new learning methods have become possible due to the combination of new technologies. AR is a graphics technology that allows learners to experience quickly and efficiently, without wearing special equipment such as a head-mounted display. The three features of the AR system are, a combination of real and virtual, real-time interaction, and registered in 3D (Azuma, 1997), which allow for a simultaneous combination of realworld and virtual objects (Sin & Zaman, 2010;Ibanez et al., 2016). AR can effectively support the learning of abstract and complex content, meaning that it has the potential to greatly improve learning performance (Chiu, DeJaegher, & Chao, 2015). There are three main advantages of AR learning experiences -annotations of the real world, visualization of context, and vision-haptic visualization (Santos et al., 2013). AR can provide better learning performance than traditional learning methods by using virtual objects integrated into the real environment (Santos et al., 2016). AR improves elaboration by employing more meaningful cues for users to see in the real world. Furthermore, AR can also integrate both perceived visual images and touch, to present visual information.
AR is generally categorized as comprising location-based and image-based systems (Cheng & Tsai, 2013). Location-based AR systems utilize data related to the location of mobile devices. Image-based AR systems, however, focus on image recognition technology, which is used to determine the correct position of virtual content, relative to physical objects in the real environment. The application developed in this research employs an image-based system, with animations displayed on verb cards which is the real text of the verb. Thus, this application is allowed to enhance the learner's familiarity with the verb by the interaction with the verb cards during the learning process.
Language acquisition research using AR has also been performed (e.g., Boonbrahm, Kaewrat, & Boonbrahm, 2015;Hsu, 2017;Ibrahim et al., 2018;Mahadzir & Phung, 2013;Santos et al., 2016). As an example of an application for learning English as a foreign language, Boonbrahm et al. (2015) created a marked-based AR application to learn the spellings and meanings of animal-related words; the learners can arrange letters to form a word that matches the name of an animal, and the 3D animal will be displayed. Mahadzir and Phung (2013) designed an AR pop-up book to assist primary school students in Malaysia in learning English by scanning the material through webcams. A recent study by Ibrahim et al. (2018) described an AR application to support people learning foreign languages. Their approach focused on using Microsoft HoloLens, and the learners must move to scan the learning objects placed in the room to learn the meaning and pronunciation of objects in a foreign language. AR facilitates the presentation of visual information, which provides the crucial conditions for language acquisition and allows learners to physically interact with the materials (Fan et al., 2020). AR as multimedia can potentially improve the learning performance of language learners (e.g., Hsu, 2017;Ibrahim et al., 2018;Mahadzir & Phung, 2013;Santos et al., 2016). According to the multimedia effects and spatial and temporal contiguity principle of multimedia learning theory (Mayer, 2009), words that are presented using both text and photos are more likely to be learned than words that are presented using text alone. Because AR has the three advantages-providing annotations of the real world, visualizing context, and providing vision-haptic visualization, AR can be annotated by attaching learning objects to the virtual content. Therefore, a study has posited that learning with AR by annotating objects is better than learning about the same object by using other materials, such as paper-based textbooks (Santos et al., 2013). Moreover, displaying information simultaneously beside the object in the real world allows learners to integrate the object and the surrounding spatial information, thus providing learners with cues to remember as well as allowing memories to be retained for longer periods. For example, handheld AR systems for vocabulary learning in the Filipino and German languages were developed by Santos et al. (2016), and the study results indicate that the usage of handheld AR improved word retention.

Cognitive load
Cognitive load theory (CLT) involves working memory capacity and long-term memory. It indicates that cognitive load is related to the learner's working memory capacity, and instructional design should integrate human cognitive structures due to the limited capacity of working memory (Paas & Sweller, 2014). Three categories are included in CLT, namely, extraneous cognitive load, intrinsic cognitive load, and germane cognitive load (Sweller, 2010). The intrinsic cognitive load is related to the complexity of information, the extraneous cognitive load is caused by the instructional design, whereas the germane cognitive load is related to the acquisition of knowledge (Sweller, 2010). Instructional design and learning materials can affect the way that learners interact with the learning environment and experience the cognitive load (Lin et al., 2013;Paas & Sweller, 2014). CLT, which is one of the most important theories that instructional designers must consider, identifies the cognitive processes that users engage in when using technology (İbili, 2019). Therefore, examining the cognitive load of AR instructions is a crucial task. For example, Lai, Chen, and Lee (2019) developed a multimedia learning system designed with AR that significantly reduced the extraneous cognitive load. However, Akçayır and Akçayır (2017) conducted a systematic review, and showed two contradictory results: one asserted that AR systems decreased cognitive load, whereas the other one asserted that they increased cognitive load. Based on the multimedia learning theory and CLT, an AR system can reduce the cognitive load of limited working memory by providing memory cues. On the other hand, receiving significant amounts of information during a short period will increase the cognitive load on learners (Cheng & Tsai, 2013). Therefore, examining the cognitive load of AR systems is a crucial task.
The spatial continuity principle proposed by Mayer (2002) indicated that words presented using both text and photos are learned with increased effectiveness and reduced cognitive load, compared with words that are presented using only text in a multimedia learning environment. Because AR is a technology wherein the real world and virtual objects are presented together, AR instructions can serve as a symbolic representation between abstract and concrete concepts; thus, AR instructions can facilitate learning by extending the learning content from the abstract to the concrete, thereby reducing the cognitive load of the learner (İbili, 2019). Santos et al. (2013) noted that AR provides interactions with information that can help learners to facilitate the perception and memorization of information. In addition, as a feature of an AR system, real-world annotations can reduce the cognitive load for learners, thus leading to improved learning. On the other hand, some ineffective instructional designs can cause the split-attention effect, such as text not being within the same page as the target image, thereby increasing the learner's need for working memory, and potentially increasing the cognitive load and negatively impacting learning (Sweller, 2010). Hence, it is crucial to probe the cognitive load of AR learning environments to investigate learning performance.

Research questions
In this study, we designed an AR learning system based on the image schema of cognitive linguistics for acquiring Japanese compound verbs, in order to support compound verb learning. Moreover, we conducted an experiment to investigate the effects of paper-based image schema material and AR learning systems on learners' learning performance and cognitive load. Therefore, we set the following three research questions: Research Question 1: Do the learners of the two learning methods (AR learning system and paper-based image schema material) differ in their learning performance? Research Question 2: Do the learners of the two learning methods differ in their perceived cognitive load? Research Question 3: Are the cognitive loads perceived by the learners related to their learning performance in each learning method?

Development of the AR compound verb learning system
In this study, based on our previous research (Geng & Yamada, 2019), we developed an AR compound verb learning system to support learners' compound verb learning. In this system, the meanings of verbs, including both single verbs and compound verbs, were represented by 3D animations created using Maya, according to the image schemas of the verbs. Maya is a 3D computer graphics software, and it is used to create interactive 3D animations and visual effects. Figure 1 depicts how the image schemas were converted into the character actions of the animations. For example, the graphics of the image schemas were converted into characters and objects; arrows were converted into actions and motions. In this way, the meanings of verbs were shown to learners by the 3D animations. The animations were also evaluated by five Japanese native speakers (including two Japanese-language teachers) to confirm the validity of the semantic explanation of the animation. Specifically, we presented each animation to each evaluator and asked them to judge whether the meaning of the animations produced was matched its Japanese meaning using the 5-point Likert scale (i.e. 1 -not at all, 5 -very much). We also interviewed them about their comments on each animation. The animations with the average score below 4 were revised according to the comments of evaluators. The animated verbs comprised 19 verbs (including single and compound verbs) that were extracted from the vocabulary list of the N2 and N3 levels Japanese-Language Proficiency Test. Moreover, the image schemas of the verbs were created by Matsuda (2004) and Matsuda and Shiraishi (2011).
The system was designed based on marker-based AR, and was composed of a set of verb cards and an application for smartphones. One set of cards comprised 11 cards that were printed with the characters of the single verbs (e.g., Fig. 2). In this system, the learners can scan the verb card, and then the animation will be displayed on the card through the screen of the smartphone in the application. The application was developed using Unity 3D and Vuforia. Additionally, the combination function was developed based on the V1 + V2 strategy (Matsuda, 2004) in order to support learners to learn compound verbs. As shown in Fig. 3, learners can first learn single verbs via scanning the verb cards, and can then correctly combine the verb cards of V1 and V2 to learn compound verbs composed of V1 and V2. In detail, the application will recognize the number of cards in the camera interface, and judge the combination and order of these verb cards if the number of cards is two. The animations of the compound verbs will be displayed if the two verb cards can be successfully combined into a compound verb.
On the other hand, the following functions of the system were designed to aid the difficulties of acquiring compound verbs: (1) lack of clarity of verb combinations: the judgment function was designed to present right or wrong verb combinations. When the learner combines the two verb cards, if the combination of the verb order is not correct, or these two verbs do not form a compound verb, the application will deliver the message: Fig. 2 The verb card of single verb yobu "call" "The combination of the compound verb is incorrect!" on the interface. (2) The opacity of compound verb meanings: by combining the verb cards, the meanings of the single verbs and compound verbs are compared, making it possible for learners to distinguish the difference between their meanings. For example, when learners use a single card of V1, the screen shows the animation of V1, but by combining the verb cards V1 and V2, the application will show an animation of the compound verb V1-V2 (see Fig. 3). In this way, by scanning individual cards and card combinations, it is possible to distinguish between the meaning of the single verb and the compound verb it can form.

Participants
Twenty-one students from a Japanese language school in Japan were recruited for this study; they were aged 18-25 years. The nationalities of the participants were Chinese, Korean, Vietnamese, and Thai, all of whom were non-native Japanese speakers. There were 12 male participants and nine female participants. In addition, the Japaneselanguage level of the participants were all over N3, which means that they could understand the explanations of the verbs provided in the experiment. The N3 level is a junior-intermediate level of the Japanese language. Although learners at this level are able to understand Japanese used in everyday situations to a certain degree, they do not have much knowledge concerning compound verbs. Twenty-one participants were randomized to the experimental (n = 10) and control groups (n = 11). The experimental group conducted the learning activity with the AR system and materials, whereas the control group conducted the learning activity only using the materials. The materials of the experimental group only contained explanations and example sentences of the verbs, but the materials of the image schema group consisted of the image schemas of the verbs, their explanations, and example sentences. Except for the image schemas of the verbs, the learning content, explanations, and example sentences of the two groups of materials were the same. Experimental procedure In this study, we conducted a 90-min experiment that was aimed to enhance the learning effectiveness of learners' compound verbs. The experimental procedure is shown in Fig. 4. Firstly, all participants took a pre-test before the learning activity, to evaluate their knowledge of compound verbs and single verbs. As shown in Fig. 5, the experimental group adopted the AR learning system to learn single verbs and compound verbs via scanning and combining single verbs during the learning activity. In addition, the experimental group also utilized materials containing explanations and example sentences to understand the meanings of the verbs. The control group only used the material to learn Japanese verbs. All participants implemented the learning activity for 40 min. After the learning activity, all participants conducted a post-test and filled out the cognitive load questionnaire and the post questionnaire. Four weeks later, all of the participants conducted a delayed test.

Measuring tools
The measuring tools used in this experiment included the pre-test, post-test, delay-test, cognitive load questionnaire, and post questionnaire. To assess learners' knowledge of compound verbs and single verbs, we designed the pre-test, post-test, and delayed test. Each test consisted of two parts, comprising 15 true or false questions for examining the combinations of compound verbs, and 26 multiple-choice questions for testing the meanings of the verbs. These three tests are composed of the same questions and options from the question pool for single and compound verbs. However, the order of the questions and options in all three tests varied (see Additional file 1). Each test had a total score of 41 points, with one point per question. In addition, an expert teacher Fig. 4 The procedure of the experiment Geng and Yamada Smart Learning Environments (2020) 7:27 with more than ten years of experience in Japanese education evaluated the validity of the test questions. The cognitive load questionnaire was created by Leppink, Paas, Van der Vleuten, Van Gog, and Van Merriënboer (2013). It measures the learners' extraneous cognitive load, intrinsic cognitive load, and germane cognitive load. The intrinsic cognitive load is related to the complexity of information, the extraneous cognitive load is related to the design of material, and the germane cognitive load is related to the acquisition of knowledge (Sweller, 2010). The questionnaire was conducted using an 11-point Likert scoring scheme. The intrinsic cognitive load comprised three items (e.g., "The compound verbs covered in the activity were very complex."). The extraneous cognitive load was made up of three items, such as "The instructions and explanations during the activity were very unclear"; moreover, the germane cognitive load comprised four items (for example "The activity really enhanced my understanding of the compound verbs."). The Cronbach's alpha coefficient values described via the original survey were 0.81, 0.75, and 0.82 for intrinsic cognitive load, extraneous cognitive load, and germane cognitive load, respectively, which shows that the method that Leppink used to assess students' cognitive load had high intrinsic consistency reliability (Leppink et al., 2013). The post questionnaire contained some questions about learners' impressions after using materials and systems. The questionnaire asked participants to complete two open-ended questions which are "write your impressions of using the system" and "write what could be improved (or difficult to use) about the system." We recorded and collated the responses to the questionnaire from all participants.

Results
Learning performance Table 1 presents the median, mean, and standard deviation (SD) values of the total scores in the pre-test, post-test, and delayed test for participants in the experimental and control groups. From Table 1, it is found that the total scores of the post-test and delayed test were higher than the pre-test in each group. Since the sample size was small and not normally distributed, the non-parametric analysis was used in this study. First, we conducted the Friedman test to compare the total scores of the three tests in the experimental group and the control group. The results were significant differences Fig. 5 The pictures of experimental and control groups during the learning activity. Legend: On the left is the experimental group: the participant was using the AR system to distinguish the meaning between single verbs and the compound verb; On the right is the control group: the participant was learning with the paper-based image schema material between the total scores of the three tests in the two groups (experimental group: χ 2 = 14.37, p < 0.01 control group: χ 2 = 8.750, p < 0.05). Wilcoxon signed-rank tests were conducted to evaluate whether the total-score was different between each of the two tests. We also calculated the effect size r. The results of the Wilcoxon signed-rank tests are shown in Table 2. In both groups, significant differences were observed in the total scores between the pre-test and post-test (experimental group: | Z | = 2.677, p < 0.01, r = 0.85; control group: | Z | = 2.668, p < 0.01, r = 0.80). As shown in Tables 1 and 2, the experimental and control groups' mean values and medians of the total scores improved from the pre-test to post-test, at a significance value of 0.01. The results all show a large effect size. This shows that there was a statistical difference in learners' learning performance after completing the learning activity. Furthermore, there was a statistically significant difference between the pre-test and the delay-test scores in the experimental group (|Z| = 1.963, p < 0.05, r = 0.62). There was also a statistically significant difference between the post-test and delay-test scores in each group (experimental group: | Z | = 2.051, p < 0.05, r = 0.85; control group: | Z | = 2.439, p < 0.01, r = 0.74). Table 3 shows the summary statistics for the median, mean, and SD values of the scores of parts one and two in the pre-test, post-test, and delay-test, for the experimental and control groups. The questions of part one were designed to examine combinations of the compound verbs, whereas part two was designed to examine the learners' knowledge of the meanings of single verbs and compound verbs. To compare the differences between the experimental and control groups for part one and part two in each test, we conducted the Friedman tests. Concerning the scores of part one, there were significant differences on all three tests in the two groups (experimental group: χ 2 = 15.84, p < 0.01 control group: χ 2 = 16.70, p < 0.01). Significant differences were also found in the results of Part two (experimental group: χ 2 = 12.47, p < 0.01 control group: χ 2 = 15.85 p < 0.01). The results of Wilcoxon signed-rank tests are shown in Table 4.
There was a significant difference in scores for part one between the pre-test and posttest in the experimental group (| Z | = 2.816, p < 0.01, r = 0.89). Conversely, no statistically significant difference in the part one scores was found between the pre-test and post-test in the control group, at a significance value of 0.05 (| Z | =1.196, p < 0.1, r = 0.36). Furthermore, it was found that the part one scores of the delayed test were In order to verify the differences in learning performance between the experimental and control groups, we analyzed the test scores using Mann-Whitney U tests. The results of the Mann-Whitney U tests are shown in Table 5. As shown in Table 5, none of the differences in scores between the two groups were statistically significant.

Cognitive load
In this study, the cognitive load questionnaire was divided into three categories: extraneous cognitive load, intrinsic cognitive load, and germane cognitive load. To verify the reliability of the survey, the Cronbach's alpha coefficient values were 0.666, 0.674, and 0.871 for the external, internal, and germane cognitive loads, respectively. In addition, Mann-Whitney U tests were used to probe the influences of the AR learning system and the paper-based image schema materials on the learners' cognitive load. As shown in Table 6, the mean intrinsic cognitive load and germane cognitive load values of the experimental group were higher than those of the control group. The mean of the extraneous cognitive load was lower for the experimental group than for the control  group. However, no statistical difference was found for the three categories of cognitive loads between the experimental and control groups.

Changes in learning performance and cognitive load
In order to investigate whether the perceived cognitive load influenced the learning performance, correlation analyses using Spearman's rank correlation coefficient were executed to assess the correlations between changes in learning performance and the cognitive load of the experimental and control groups. According to the results of Table 7, in the experimental group, there was a moderate negative correlation between learners' changes in the delayed and pre-tests (delayed test score-pre-test score) and intrinsic cognitive load (ρ = 0.563, p < 0.1). On the other hand, as shown in Table 8, a strong negative correlation was found between changes in the post-tests and delayed tests (delayed test score -post-test score) and the extraneous cognitive load. There was also a strong positive correlation between the changes in post-tests and delay tests and germane cognitive load.

Discussion
Research question 1: do the learners of the two learning methods (AR learning system and paper-based image schema material) differ in their learning performance?
The results of this experiment indicate that both the learners using the AR learning system and the learners using the paper-based materials significantly enhanced their learning performance. In particular, the AR learning system was more effective regarding the retention of knowledge. These findings were also supported by the previous  study (e.g., Santos et al., 2016). Santos et al. (2016) designed two handheld AR systems for Filipino and German vocabulary learning, and proposed that AR improves the retention of words via evaluation experiments. One possible explanation for this might be that AR can effectively promote learning and long-term retention by increasing the size of the chunks of working memory (Squires, 2017). AR technology can attach extra information to the real-world view of the learners, and can then release part of the working memory to recall knowledge, to support new experiences and tasks in complex environments (Proctor and Zandt, 2018). The AR learning system of this study seemed to improve perception through verb cards, which are real world annotations used to reduce the load on the learners' limited working memory. This meant that a larger proportion of the short-term memory would have been used for their knowledge of the verbs (Santos et al., 2013).
Regarding the difficulties of compound verb acquisition, we designed two parts of the tests. Regarding the results of part one, compared with the learning method of the image schema materials, the AR learning system greatly aided learners in improving their learning of the combination of compound verbs. This could be explained by the design of the judgment function that presents the correctness of each verb combination. On the other hand, the learners using the image schema materials better retained the meanings of the verbs than the AR system learners. This could be explained by the following feedback of the post-questionnaire: "I want the explanations and animations of the verbs to appear on the same screen." Due to the spatial contiguity principle of the multimedia learning theory, integrating the explanation and AR animation on the same screen might lead to better learning performance (Mayer, 2002). Regarding the paper-based material, the explanation of the meaning and the image schemas were on the same page, thus the paper-based material might have been more effective for learning the meanings of the verbs in this experiment. Based on the above, we found that the AR learning system was more effective for learning verb combinations, while paper-based material was more effective for compound verb meaning. This provides ISBI with a new proposal that utilizes the features of AR to design functions to overcome the difficulties of language acquisition. Moreover, it should be noted that the AR system might not have a better learning effectiveness than the paper-based materials, as we need to consider the multimedia design principles when using AR to design instructional materials. Another result of the learning performance was that there was no significant difference in the learning scores between the two learning methods. This is consistent with results of previous research (Sato, 2016), which suggested that there was no significant difference between the learning effects of animated glosses and pictorial glosses in learning English polysemous words. The result can likely be attributed to the fact that the participants' Japanese levels were too high -the mean value of the two groups' pretests score exceeded 70% of the perfect score (see Table 1). This is an important issue for future research. Further research needs to be done to adjust for the Japanese language proficiency of participants and the difficulty of the tests. On the other hand, due to the small effect size of these results, we need to validate the experimental results again in future work.
In the post-questionnaire, participants provided the following comments regarding the AR learning system: "easy to understand," "animations of the verbs were easy to remember," and "useful for learning." The findings show that the AR system was perceived as being increasingly effective, compared with paper-based instruction. Although preliminary, our experiments suggest that AR may lead to increased perceived usefulness. For AR learning systems, scanning or combining cards to learn the meaning of the single and compound verbs is an easy task and is perceived to be part of a useful learning environment. Therefore, conducting further analyses of the reasons for increasing perceived usefulness is a crucial task. Future research must use learning analytics to confirm the relationship among these behaviors in the real world (e.g., combination and scanning for verb cards), as well as to analyze the perceived usefulness and learning performance.
Research question 2: do the learners of the two learning methods differ in their perceived cognitive load?
While the intrinsic and germane cognitive loads of learners who used the AR system were higher than that of the paper-based materials learners, their extrinsic cognitive load was lower than that of the paper-based materials learners; however, there was no significant difference in the three categories of the cognitive loads between the two learning manners in this study. This finding was unexpected, and suggests that the perceived cognitive loads using animations of the AR system were equivalent to paperbased image schemas. This result may be due to the split-attention effect (Sweller, 2010), which is caused by the spatial separation of the AR system and the verb explanation materials. Although AR systems can reduce cognitive load by providing cues of memory (Santos et al., 2013), non-intuitive AR system design and decentralized learning information will increase learners' cognitive load (İbili, 2019;Van Merrienboer, Kester, & Paas, 2006). Therefore, the results of this study are different from those of an AR system developed by the principle of continuity of multimedia (Lai et al., 2019). Lai et al. (2019) instead found that their AR system reduced the perceived cognitive load of learners. We propose that instructional materials using AR should be designed to employ the split-attention effect and the continuity principle of multimedia to reduce extraneous cognitive loads. For example, explanations of the verbs and AR animations will be integrated on the same screen of the application and displayed at the same time.
Research question 3: are the cognitive loads perceived by the learners related to the learning performance in each learning method?
Concerning the correlation between the learning performance and perceived cognitive load, it can be found that the lower the intrinsic cognitive load of the learners using the AR learning system the better the retention of knowledge according to the results in Table 7. Intrinsic cognitive load is caused by the number of interacting information elements contained in the learning task or learning material (Sweller, 2010). Thus, learners with low intrinsic load require less working memory to be able to store information into long-term memory compared to learners with high intrinsic load (Leppink, Paas, Van Gog, van Der Vleuten, & Van Merrienboer, 2014).
Conversely, correlation was observed among learners of image schema materials, which might be explained by the fact that the higher the extraneous cognitive load perceived by the learners, correspondingly, the lower the germane cognitive load and the shorter memory of the verbs retained. This finding was consistent with the CLT, which purposed that as the extraneous cognitive load increases, the germane cognitive load will decrease (Schnotz & Kürschner, 2007). Sweller (2010) asserted that the extraneous cognitive load and intrinsic cognitive load are additive, and that the total load cannot exceed available resources. If many resources were to be allocated to the extraneous cognitive load, then little resources would be allocated to the working memory. Therefore, it will be difficult to transform from short-term memory to long-term memory, regarding learning contents (Galy & Mélan, 2013). We suppose that regarding the paper-based image schema materials, only simple graphics and explanations were provided to learners. Too much information needed to be perceived, making the working memory insufficient, thus it was even harder to keep memory to be retained than for the AR system. According to Tables 2 and 6, the extraneous cognitive load was higher for the paper-based material, and the germane cognitive load was lower. There were also larger changes in the delayed test and post-test for the paper-based material.

Conclusions and future work
In this study, we designed and developed an AR learning system based on image schema for acquiring Japanese compound verbs. The results of our study indicate that both of the learning methods enhanced the learning performance of the learners. The AR learning system greatly supported learners to improve their learning of the combination of compound verbs, while the learners who used image schema materials better retained the meaning of verbs. This result might be related to the principle of continuity of multimedia due to the spatial separation of the AR system and the verb explanation materials. The results of this study confirmed that there was no significant difference in cognitive load between the two learning methods. They also supported the propositions of Van Merrienboer et al. (2006) and İbili (2019); in that AR system design is not intuitive, and that decentralized learning information increases the cognitive load. Furthermore, we also found that the perceived cognitive load was related to learning performance, and was likely to be affected by learner motivation.
These results need to be explored further, because our experiment involved only a low sample size and the effect sizes related to differences in learning performance between the two learning methods were too small. The experiment should be adjusted to suit the Japanese language proficiency of the participants, with a bigger sample size. It should be noted that this study only compares the AR learning system with paperbased materials, but not with other visual gloss resources such as animated images. In future research, we need to take it into account and conduct multi-group experiments. Furthermore, Schnotz and Kürschner (2007) pointed out the limitation of the psychometric measures of cognitive load, namely whether learners are really able to clearly distinguish between different types of cognitive load through introspection. It needs to be carried out in order to measure the cognitive load by using other ways, such as behavioral assessment and learning analytics. Regarding the results of this study, one limitation is the separation of the AR system and the materials. Therefore, further research is required to establish the learning performance and cognitive load of integrating explanations and AR animations on the same screen. A further issue that was not addressed in this study was the relationship between learning performance and the learning process of the AR system. In future studies, we will design the learning path visualization function to clarify these relationships.
Additional file 1. Example questions on the tests.
Abbreviations AR: Augmented reality; JFL: Learners of Japanese as a foreign language; CL: Cognitive load; ISBI: Image schema based instruction; CLT: Cognitive load theory; 3D: Three dimensional; SD: Standard deviation