 Research
 Open access
 Published:
Automated labeling of PDF mathematical exercises with word Ngrams VSM classification
Smart Learning Environments volume 10, Article number: 51 (2023)
Abstract
In recent years, smart learning environments have become central to modern education and support students and instructors through tools based on prediction and recommendation models. These methods often use learning material metadata, such as the knowledge contained in an exercise which is usually labeled by domain experts and is costly and difficult to scale. It recognizes that automated labeling eases the workload on experts, as seen in previous studies using automatic classification algorithms for research papers and Japanese mathematical exercises. However, these studies didn’t delve into finegrained labeling. In addition to that, as the use of materials in the system becomes more widespread, paper materials are transformed into PDF formats, which can lead to incomplete extraction. However, there is less emphasis on labeling incomplete mathematical sentences to tackle this problem in the previous research. This study aims to achieve precise automated classification even from incomplete text inputs. To tackle these challenges, we propose a mathematical exercise labeling algorithm that can handle detailed labels, even for incomplete sentences, using word ngrams, compared to the stateoftheart word embedding method. The results of the experiment show that monogram features with Random Forest models achieved the best performance with a macro Fmeasure of 92.50%, 61.28% for 24class labeling and 297class labeling tasks, respectively. The contribution of this research is showing that the proposed method based on traditional simple ngrams has the ability to find contextindependent similarities in incomplete sentences and outperforms stateoftheart word embedding methods in specific tasks like classifying short and incomplete texts.
Introduction
Labeling learning materials is a key problem in scaling smart learning environments (Contractor et al., 2015). The availability of knowledge metadata for learning materials is critical as important decisions, such as what to recommend for study for the next time, are usually made based on the metadata and the learners’ previous experience (Vovides et al., 2007). Each exercise in a textbook for each subject usually has a set of course units that clarify the category of each exercise and are very useful in educational situations and the framework of educational problems. Recently, there has also been a growing trend in the adoption of nationwide curriculum or studying guidelines, such as: the Australian digital curriculum called Australian Curriculum, Assessment and Reporting Authority (ACARA) in Australia (Ditchburn, 2012), Common Core Standards (Porter et al., 2011; Ritter, 2009) in America, Mathematics Curriculum Standards for Compulsory Education (MOE, 2012) in China, and the Courses of Study (MEXT, 2018) in Japan. These guidelines provide regulations for education and instruction, as well as standard units for each subject (MEXT, 2018). Educators select learning materials based on these guidelines to meet the requirements of the compulsory curriculum. Therefore, learning materials that do not contain knowledge metadata are difficult to incorporate into the course of study, and the automated assignment of labels to learning materials could help overcome this problem.
In this study, the task of labeling learning materials has two main objectives: yielding high accuracy for detailed classification and labeling incomplete texts. First, as is common with other labeling tasks, the performance of the classification task is very important as the aim of labeling materials is to reduce the burden of domain experts who are usually manually tackling the knowledge classification task. Detailed labeling of learning materials is very useful in the educational field, but assigning the classification to problems manually is a hard task that requires the cooperation of experts, and the burden could be alleviated through automation. Schubotz et al. (2020), examined the task of automatically assigning coarse labels according to a mathematical subject classification scheme for retrieving research papers and literature on mathematics in English. It was found that the support provided by the proposed automatic classification algorithm resulted in a reduced manual classification burden for domain experts. Another study proposed the WEKE model, which combines word embedding and knowledge components, to achieve accurate unit classification of Japanese mathematical exercises (Tian et al., 2022). With the shift to ICT education, researchers label the exercises to utilize them for learning pattern analysis (Wang et al., 2022). While more detailed classifications may be necessary depending on the intended use, such detailed labeling was not conducted in those studies.
Second, as extracting complete text is sometimes difficult due to the format of learning materials, another approach for labeling incomplete text is required. With the increased digitization of learning materials and their use in smart learning environments, teachers and publishers are migrating existing nondigital materials to these systems. As these learning materials were usually not created while considering digitization, it is often seen that publishers will provide publicationquality PDFs directly to teachers or educational institutes. Problems are caused when uploading and analyzing such materials in learning environments as it is difficult to extract all of the information, such as: text, formulas, graphs, and images, from publicationquality PDFs, resulting in incomplete information extraction (Abekawa & Aizawa, 2016). While researchers have tried labeling with sentences, images, formulas, or a combination of them (Bhartiya et al., 2016; Shen, et al., 2021; Tian et al., 2022; Wang et al., 2022), there has been less focus on classification with incomplete information of mathematical sentences. In this study, we propose a mathematical exercise labeling algorithm that can deal with detailed labels, even for incomplete sentences, by focusing on the exact match of a set of mathematical exercises and predicting a unit using an existing machine learning method or calculating the similarity of any given exercise to a set of weighted word ngrams. Therefore, we aim to answer the following research question:
RQ:
What are the best features and models that can assign detailed and precise labels from incomplete mathematical exercise text?
We propose an algorithm to automatically provide classification results for preprocessed exercise sentences that have been extracted from publicationquality PDFs that include incomplete text. In the experiments of this study, two different levels of labels are assigned to each exercise for validation. We then predict the labels to evaluate the performance of the proposed algorithm and compare it to stateoftheart word embedding models.
Literature review
Labeling learning materials
National labeling standards for mathematical exercises
Often learning materials are labeled to notice easily what kind of knowledge is contained in an exercise. Government standards often provide some norms of mathematical exercise classification, for example, in Japan the government provides common standards of subjects and directions for each unit of study that aim to develop the qualities and abilities to think mathematically through mathematical activities in the Guidelines for the Course of Study for Senior High Schools (MEXT, 2009, 2018), and teachers prepare exercises by following these directions. In the US, the Common Core State Standards (CCSS) classification refers to the learning standards for K12 education that was developed in collaboration with teachers, school administrators, and professionals to provide a clear and consistent framework for preparing children for college and career success (Ritter, 2009). It includes 11 units that students will study over the course of nine years, plus appendices that cover counting and radix, operations and algebraic thinking, decimal numbers and operations, fraction operations, measurement and data, ratios and proportion relationships, number systems, expressions and equations, functions, geometry, statistics, and probability, as well as content taught in higher grades (Shintani, 2014). In Mathematics Curriculum Standards for Compulsory Education in China (MOE, 2012), learning items are distributed into one of up to four main parts and assigned categories 10 keywords, including: number sense, symbolic awareness, space concept, geometry intuitive, data analysis concept, computation ability, reasoning ability, model idea, application awareness, and innovative awareness (Guo et al., 2018). In Australian Curriculum, Assessment and Reporting Authority, treated as an Australian digital curriculum (Ditchburn, 2012), units are called “content strands” and consist of number and algebra, measurement and geometry, and statistics and probability. Each of these strands has 6, 5, and 2 units, respectively, and the structure can be described as hierarchical (ACARA). There is also a specialized system called Zentralblatt MATH (zbMATH) that is a mathematicsrelated bibliographic database and literature search engine. The Mathematics Subject Classification (MSC) which zbMATH helps maintain is used to classify items in mathematical sciences literature. Every 10 years, two editorial groups solicit input from the mathematical community. The new MSC (MSC2020) includes 63 twodigit classifications, 529 threedigit classifications, and 6006 fivedigit classifications (Dunne & Hulek, 2020; Kühnemund, 2016).
As the topic standards mentioned above can be important rules when classifying many mathematical materials, some researchers decided to tackled labeling math exercises based on the standard automatically. One study has attempted to classify according to the CCSS (Ritter, 2009), and this study used 385 different labels to classify 12 years of mathematics materials from kindergarten through to high school (Shen et al., 2021). One study also proposed the MathBERT model (Shen et al., 2021), which is a model created by preparing a large mathematical corpus ranging from the prekindergarten to the graduate level and training a base BERT model (Devlin et al., 2019). However, these studies did not tackle the problem of incomplete text classification. In this study, we use information from MEXT to label exercise data in both a coarse and detailed method while focusing on incomplete exercise text labeling.
Labeling for analysis of how students learn
There is a trend toward analyzing learning behavior in a new way using labels assigned to teaching materials. Regarding the use of features in the analysis of learning effectiveness, a study reported that the proposed system automatically assigned labels with learning materials and the study shows the assigned labels can assist in the discovery of students’ learning patterns (Wang et al., 2022). While the analysis using labels is novel in the research, the labeling conducted in this research was only for one class in the university and was not generalized using a common standard.
Giving labels to exercises for knowledge tracing is also a hot research topic. One study, using multiple real data sets consisting of tens of thousands of users and items, showed that regression classification models could accurately and rapidly estimate student knowledge, even when student data is sparsely observed. In addition, the study showed that the model can handle multiple knowledge elements and side information such as the number of trials of items and skill levels (Vie et al., 2019). If no labels were given to each exercise, the study could not accurately predict the student’s performance.
It is also useful to categorize any exercise for recommending a specific exercise to enhance students’ understanding. One study discusses the application of a topicbased tree structure to personalized adaptive educational systems for its transparency for the users (Sosnovsky & Brusilovsky, 2015). Another study focuses on the visualization of the relationship between any combination of two topics to notify the achievements of each student individually, which aims to be consistent among the assessments in different courses, to do meaningful feedback to individual, and to grasp the students’ longterm progress (Khosravi & Cooper, 2018). There has also been research into extracting labels from learning materials to form knowledge structure representations that learners can use to increase their awareness of the study process (Flanagan et al., 2019). These research examples show that it is easier to obtain or utilize detailed information about the characteristics of the material if they are labeled in advance. In addition, there is one system, called BookRoll, that any learner can post the PDF materials freely without selecting any topics (Flanagan & Ogata, 2018), so in this context the automatic labeling system helps the materials to obtain some topics.
In this study, we tackle the task of text classification to automate knowledge labeling process for incomplete text by proposing a more detailed and highly accurate method based on ngrams. The proposed method could improve the use of materials with knowledge labeling and assist in the analysis of how students study using these materials.
Labeling to reduce the burden on domain experts
Automatic labeling and classification of learning materials is a prominent area of classification research in education. Schubotz et al., (2020), proposed an automatic classification method in a mathematical subject classification scheme for organizing mathematical literature, achieving a classification agreement rate of 81% with very close accuracy in two large peerreview services. It also enabled an 86% reduction in labor when compared to the manual classification task. The result shows the advantage of labeling automatically, although the research has a different context when compared to the present paper. Tian et al. (2022), proposed a unit classification method that combines natural language processing techniques with a method for extracting keywords from mathematical exercises, and this resulted in a 25% labor reduction compared to manual classification. While the paper provides a mostly accurate classification of units, it only provides as detailed a classification as the Courses of Study even though more detailed labeling may be necessary depending on the intended use.
Automated detailed labeling must be accurate in order to reduce the burden on domain experts and assist in assigning labels to exercises. In this study, we developed a more detailed automated classification that has high accuracy even when labeling exercises that contain incomplete text.
Hierarchical and automatic labeling of teaching materials
Hierarchical text classification (HTC) is a method that can classify objects into multilevel detailed classifications, and this aims to assign one or more optimal categories to text documents from a hierarchical category space (Graovac, 2017) and literature in this area has applied this method to many different types of domains (Silla & Freitas, 2011). Another study proposes a method of categorizing and labeling educational materials with various academic learning objectives (Bhartiya et al., 2016). This method selected words in the materials as labels and achieved extensive labeling in various grades and subjects.
When labeling the exercises, the granularity that is required depends on how the labels will be used, so by assigning different labels to each exercise the scope of use can be broadened. In the experiments, we assigned two labels to each exercise, such as: 1st level unit and 2nd level unit and measured the classification accuracy of each label. Previous studies related to labeling materials for use in Japanese schools don’t consider the hierarchical label. Tian et al. (2022) uses 24 labels for the Japanese high school curriculum, and Wang et al. (2022) uses 47 for a course at a university in Japan. Our study uses the most detailed labeling scheme of all previous studies into Japanese mathematical exercise classification with a total of 297 items at the 2nd unit level.
Text vectorization method for classification tasks
Ngram
We often use text mining, machine learning, and natural language processing to classify many kinds of text data, such as: electronic documents, online news, blogs, emails, and digital libraries, to obtain meaningful knowledge, and many classification methods have been proposed (Khan et al., 2010). Previously, Suen (1979) showed that ngram classification is effective to classify incomplete sentences from OCR. Text classification must work reliably for all input, and therefore must allow some tolerance for various types of text error problems, such as misspellings and grammatical errors in email and character recognition errors in OCRprocessed documents, and Cavnar and Trenkle (1994) argued that ngrams is an effective way to meet this requirement. Graovac (2014) proposed an ngram method for topicbased text classification using the characters in a text so that the method is independent of language and topic.
The task of classification using ngrams has been investigated in various studies. A study on the results of using an ngrambased algorithm for Bangla text classification (Mansur, 2006) and a study that attempted to statistically estimate the expressive quality of an article by using word ngrams and partofspeech ngrams in the article (Kobayashi et al., 2012). Despite the loss of semantic information, bagofngramsbased methods have been shown to perform well in sentiment analysis (Li et al., 2016). Many studies have also found ngrams to be an effective tool for classification tasks in a variety of fields, such as in music analysis (Zheng et al., 2017).
However, there are still few studies that use ngram to classify Japanese mathematical exercise materials. Our study uses ngrams and applies it as a novel method of Japanese mathematical text classification.
Word embedding
Recently, word embedding methods have become a popular text vectorization method, and one of the most representative and popular word embedding methods is Word2Vec (Mikolov et al., 2013). This method trains a model on contextindependent distributed representations for words. Considering the context of the sentence using RNN or LSTM, machine learning improves the understanding of sentences, such as: ELMo (Peters et al., 2018) that uses LSTM for a contextualized word embedding model. Moreover, OpenAI’s GPT model (Radford et al., 2019) is a model that can have enhanced flexibility for finetuned tasks, which allows an AI to consider words at a distance and to compute it not as a Markov method, but in parallel. BERT (Devlin et al., 2019) is also a popular natural language model created by Google which has an attention mechanism instead of RNN and applies a masked language model for learning.
Prior studies have demonstrated the efficacy of word embedding for label classification tasks. For instance, Dharma et al. (2022) utilized the Fasttext method to classify a dataset of 19,977 news articles and 20 news topics with 97.2% accuracy, outperforming other word embedding techniques. However, in the case of short sentence exercises, the sentence vectorization methods using word embedding has been found to be less effective. Tian et al. (2022) applied word embedding for the classification of short Japanese exercise texts, achieving an accuracy of 72.87%. The combination of this method along with the extraction of keywords, called the WEKE model, further enhanced the accuracy to 79.57%. These findings suggest that word embedding may not be as effective for short exercise texts. It is worth noting that for this experiment, incomplete sentences were employed as inputs.
The objective of this study is to introduce an automated classification algorithm capable of effectively categorizing short Japanese sentences found in mathematical exercises. To accomplish this, we concentrate on achieving the best agreement between sets of mathematical exercises through the calculation of similarity using weighted word ngram variance representations. The algorithm is then assessed by comparing it to similar experiments conducted using prediction models, and its accuracy is calculated.
Morphological analysis and relation to reading comprehension
As a study of mathematical morphological analysis, it is popular to investigate the relationship between learners’ reading comprehension and their mathematical skills. It is suggested that general vocabulary may serve as a proxy for mathematicsspecific vocabulary in studies that do not include measures of mathematicsspecific vocabulary (Chow & Ekholm, 2019). Much of the research investigating the relationship between language proficiency and math outcomes focuses specifically on vocabulary for reasons such as memorizing large numbers as words (Spelke & Tsivkin, 2001) and the need to understand oral instruction (Chow & Ekholm, 2019).
While the present study does not specifically address the learners’ reading comprehension skills, but we use morphology to analyze the Japanese sentence and to create a vector representation.
Classification of incomplete exercise texts
According to previous research, exercise texts for classification task, which is called “TREC” in the paper, contains the least number of sentences and even the least number of vocabularies of all 7 dataset types, including movie review, sentiment classification dataset and subjectivity dataset (Liu & Guo, 2019). This fact indicates that an exercise text consists of relatively less characters. Previous studies have also shown that it is difficult to achieve adequate performance on the classification of short text by word embedding which was also discussed in Sect. 2.2.2, and therefore another approach is required for this task.
Unlike other naturallanguagepresented subjects such as languages, history, and social science, mathematical learning materials involve the presentation of notations, formulas, and figures. Using the common PDF format, the processing of nonlanguage information in the mathematical learning materials is costly and complex. Although prior studies have shown that formula processing is detectable if the layout and format are defined (Date & Isozaki, 2015; Fateman et al., 1996), it is difficult to detect when they are not. Such issues arise during the uploading and analyzing of these materials in educational settings due to the challenge of fully extracting content like text, formulas, graphs, and images from published PDFs, leading to incomplete information retrieval (Abekawa & Aizawa, 2016). Hence, other methods should be investigated for the labeling from incomplete text.
In this study, we aim to automatically label the mathematical learning materials by analyzing textual information which is readily extractable from PDF files.
Mathematical education in Japan
Japanese students’ performance in mathematics is the highest level among countries in the world, which is said to be due to the influence of students’ confidence in mathematics, student SocioEconomic Status (SES), and school emphasis on academic success (Wang et al., 2023).
Japan’s Courses of Study are curriculum standards established by the Ministry of Education, Culture, Sports, Science and Technology (MEXT) to ensure that standards are maintained in all schools throughout Japan. They are revised approximately every 10 years. In recent years, the decline in Japan’s performance in the PISA 2003 international achievement test has triggered a shift in educational policy toward improving academic achievement (Onishi, 2011). MEXT revision of that standard in 2009 strengthened English foreign language learning and introduced taskbased learning (MEXT, 2009). The latest revision, issued in 2018, set three items as learning objectives: “knowledge and skills,” “ability to think, judge, express” and “ability to learn and humanity” (MEXT, 2018). Students’ textbooks, exercises, and inclass learning are based on the Courses of Study. In mathematics, the curriculum guidelines divide mathematical knowledge and skills into categories, each of which has its own meaning. Table 1 shows the organization of mathematics units and their objectives as defined by the Courses of Study revised in 2009 and exercises in the materials in this study are prepared based on this. Because these standards are used all around Japan, the categorization of exercises can affect mathematical education throughout the country.
Technology is helping researchers better understand how students learn mathematics in order to improve studies on mathematical education (Fishback & Schlicker, 1996; Hussein, 2023). In the context of mathematics in Japan, units on mathematics related to statistics have been introduced at every grade level, as indicated by the enhancement of statistical education, and learning activities using computers and other tools. A recent study has proposed the use of programming environments to support the learning of statistics according to learner’s grade (Kayama et al., 2022). To support learners use of such environments, it is important for learners to be able to figure out which exercises are in which grade level of similar statistical units without requiring teacher intervention. Another study has proposed the method to explain the unit structure of textbooks in order to relate knowledge in learning (Taniguchi & Itoh, 2023). However, without knowledge labeling of textbooks and exercises, it is difficult to make use of such unit structures in educational settings.
In this study, we focus on the labeling of mathematics units and verify the assignment of units to textbooks and exercises, which have been the subject of much research. In addition, we focus on the Japanese context of mathematical education and use the most common standards in Japan.
Method
Our goal is to find an algorithm that can assign appropriate labels to educational materials using characters extracted from math teaching material PDFs. In particular, we use the characters extracted from the math teaching material PDFs as input, vectorize them using natural language processing, train the vectors as features, and output the labels \({l}_{pred}\).
We defined the method of predicting labels with the following two functions:
where \(t\) is a set of characters from a mathematical PDF material, \(v\) is a vector from \(t\) by vectorization. In the following section, we defined the functions \({f}_{ve{c}_{1}}, {f}_{ve{c}_{2}}\) and \({f}_{pre{d}_{1}}, {f}_{pre{d}_{2}}\) respectively as the methods of vectorization from characters and the methods of prediction from the vector. In other words, we defined \({f}_{ve{c}_{1}}\) or \({f}_{ve{c}_{2}}\) as the featureselecting method and did \({f}_{pre{d}_{1}}\) or \({f}_{pre{d}_{2}}\) as the modelselecting method. Note that there are a set of labels \(L\) that \({l}_{pred}\) can be selected from \(L\). Figure 1 shows the experimental overview from inputting exercise PDF to outputting a prediction.
Data preparation
The input data in this experiment is a Japanese math exercise contained in a PDF file. To use the characters’ information of exercises, we first extract text and create a text set from the exercise PDF files. We defined datasets \(Q = \{{q}_{1} , {q}_{2}, ..., {q}_{i}, ..., {q}_{n}\}\) for each \({q}_{i}\in Q\) as an exercises’ text data set. Each \(q\) has its label \({l}_{i} = \{{l}_{i1}, {l}_{i2}\}\) in advance, where \({l}_{i1}, {l}_{i2}\) represent the 1st level label, and the 2nd level label, respectively. A relation between a unit label and a subunit label can be formulated as follows:
We divide the obtained characters into meaningful chucks before converting them into vectors. This preprocessing provides us with word sets \({T}_{i} = \{{t}_{{i}_{1}}, {t}_{{i}_{2}}, ..., {t}_{{i}_{j}}\}\) of each \({Q}_{i}\). \(n({T}_{i})\) equals \(j\) where \(n(X)\) represents the number of elements in the set \(X\).
The exercise texts used are electronic pdf versions of each of the following exercise books:

“Supplementary and Revised Edition Charting Mathematics from the Basics I + A”

“Supplementary and Revised Edition Charting Mathematics from the Basics II + B”

“Supplementary and Revised Edition Charting Mathematics from the Basics III”

“Succeeding Mathematics I + A for Textbook Sidelines”

“Succeeding Mathematics II + B for Textbook Sidelines”

“Succeeding Mathematics III for Textbook Sidelines”
These exercise books are designed for high school students and align with the textbooks approved by the Japanese government (MEXT, 2021). They are produced by the same company responsible for the widely used textbooks in Japan.
We prepared text files by reading text data using the Python library Pdf2text (Palmer, 2021). Note that PDF files are more difficult to obtain in their complete text form than HTMLformatted files (Ramakrishnan et al., 2012; Smith, 2007).
Japanese high school mathematics teachers created one 1st level unit label and one 2nd level unit label for each exercise by referring to sections in their textbooks and mapping them to each other. There was a total of 2775 exercises, consisting of 24 1st level units and 297 2nd level units. The same 2nd level unit is never assigned across multiple 1st level units. Each 1st level unit consists of between 25 and 200 exercises, with a minimum of five exercises assigned per 2nd level unit. Table 2 shows the content of each 1st level unit, the organization part the unit belongs to, the number of 2nd level units it contains (\(n\left({{L}_{l}}_{2}\right)\)), the number of exercises it contains (\(n\left({Q}_{l}\right)\)), and the mean and standard deviation for the morphemes contained in each exercise (\(\stackrel{}{n({T}_{l})}, {s}_{{T}_{l}}\)). All of these 1st level units are math common standard in Japan. They are categorized into 5 big meaningful sets. The column “part” of Table 2 represents one of the five organization parts (refer to Table 1) that is assigned to the unit. Figure 2 shows an example of the hierarchical structure of 1st level unit and 2nd level unit. Figure 3 shows an example of an exercise and the 1st level unit and 2nd level unit that has been assigned to it.
We used pdf2txt to extract the characters from the mathematical exercise PDF. Figure 4 shows an example of what was extracted from the mathematical exercise PDF. In the figure, (a) represents the raw PDF data of the exercise, (b) represents the extracted Japanese texts from (a), and (c) is an English translation of (b). As shown in (a) of the figure, while the information about the diagram in the PDF cannot be extracted, also the letters highlighted in blue in the PDF do not appear in the extracted text. These words consist of mathematical formulas “GH = 2OG”, figures such as “3” of “3点” (3 points), and symbols such as “ABC”. It is difficult to extract significant sentences from extracted texts because of the few of the text and symbols related to math equations could be extracted. We can see from (b) or (c) that we could not get the full sentence from PDF text, and it was also somewhat meaningless and difficult to comprehend.
As Japanese text does not contain word boundaries, preprocessing to extract morphemes is required and we used a package called Nagisa (Ikeda, 2021) for the morphological analysis of text data. Nagisa is a package for the morphological analysis of Japanese sentences. One feature of Nagisa is that it can assign a part of speech to each segmented morpheme and can exclude words with a specific part of speech. Some parts of the text cannot precisely be divided into morphemes and therefore some parts of deviation are incorrect.
Vectorization methods
VSM created from Ngram
We assumed that the words or a sequence of words in a sentence which has the same label will be similar to one and another, so we developed the ngram word extracting method and compared the performance to methods using stateoftheart word embedding. As we will compare both word embeddings and ngrams in the same context, we have to convert the ngrams into a vector which represents the ngram features.
We first define vector \({V}_{{G}_{i, k}}\) that is created from the specific exercise tokens \({T}_{i}\) with all exercise text token \(T\) and the number of consecutive tokens of \(k\)gram \(k\), such as:
Figure 5 shows the overview of method to convert ngram of sentence into vector.
The method of creating ngrams is as follows:
We created a word \(k\)grams \({g}_{i,k,l} \left(1\le l\le n\left({T}_{i}\right)k+1\right)\) from the tokenized exercise sentences of \(t\in {T}_{i}\). This means that \(k\) consecutive tokens from \({t}_{{i}_{l}}\) to \({t}_{{i}_{l+k1}}\) were taken and stored in a single tuple:
Then we made \({G}_{i,k}\) aggregating all \(l\) of \({g}_{i,k,l}\).
For vectorization using word ngrams, we prepared a list \({G}_{k}\) that includes all \({g}_{i,k,l}\) in all \({G}_{i,k}\). Then, we made a list called \(k\)gramlist that indicates if each component of the \(n\)grams included the query ngrams. We defined the \(m\)th elements of \(k\)gramlist:
For each \(i\), the \({q}_{i}\) should have one vector whose length is the same as \(n\left({G}_{k}\right)\). The \(i\)th value of \(v\) at \(k\)gram, \({V}_{{G}_{i,k},m}\in {V}_{{G}_{i, k}}\), is determined by the following formula:
When Nagisa morphologically analyzes numbers, it recognizes each number as a onedigit noun. In mathematical texts, different numerals are treated as different morphemes, so we created an algorithm that treats digits as a single number, as shown in Fig. 6, and treated all numbers as the same thing. This process makes easier to find the same exercise except for numbers or formula.
We collected the ngram data of the exercise texts. In ngrams, it is necessary to determine the value of \(n\) for good classification accuracy. Although there are some studies that explore appropriate values of \(n\) for each task, as research has shown that large ngrams have advantages in generating features that can be interpreted in malware analysis (Raff et al., 2018), in almost all previous studies \(n\) values are very small, and \(n > 6\) is extremely rare. Larger values of \(n\) are not tested due to the computational burden and the risk of overfitting. So in this study, we conducted ng extraction for \(1 \le n \le 6\). Table 3 shows the results of the number of ngrams with \(1 \le n \le 6\). Figure 7 shows the overall flow of creating ngrams with Nagisa. In the figure, (a) represents the extracted full text data. The item (b) represents a list of morphemes with part of the speech of each morpheme: n, p, v, s stands for noun, particle, verb, suffix, respectively. The item (c) represents an obtained list of morphemes processed numbers by the method illustrated in Fig. 6. The item (d) represents the completely obtained bigram from (a).
Word embedding vectorization
We defined vector \({V}_{{E}_{i}}\) that is created from the specific exercise tokens \({T}_{i}\) \(\left(1\le i\le n\left(Q\right)\right)\) and the model for word embedding model, i.e.
For vectorization with word embedding, we used a model called fastText (Joulin et al., 2017). There is a website, https://fasttext.cc/docs/en/crawlvectors.html, which has pretrained models for 157 languages. In this experiment, we used the Japanese model, which combines three methods to represent input sentence data in 300 dimensions: character 5gram, weighting by position, and Word2Vec (Church, 2017).
Label prediction by vectorized sentence
Prediction by calculating cosine similarity
For any exercises text \(T\), we use score \(s\left({T}_{a}, {T}_{b}\right)\) to measure the similarity of texts between \({T}_{a}\) and \({T}_{b}\). The higher \(s({T}_{a}, {T}_{b})\) is, the more similar \({T}_{a}\) and \({T}_{b}\) are. The answer of predicting labels with finding similarity of exercises can be formulated as: Given a set of query exercise text vector \({v}_{query}\), a labeledexercise text \({v}_{labeled}\) that has the label \({l}_{labeled}\), weight parameters function \({f}_{w}\), our goal is to integrate these heterogeneous materials to measure the similarity scores of exercise pairs and predict the 1st level unit label or the 2nd level unit label for any \({v}_{query}\) by selecting the candidate label \({l}_{pred}\) with a predicted label, i.e.
where \({V}_{labeled}, {L}_{labeled}\) is the set of vectors of labeled data and labels of them respectively, \({f}_{w}\) is the weight parameters function, and \(L\) is the domain of labels in the data. The selected label for query \({l}_{pred}\) is the prediction label of the exercises.
In this algorithm, as shown in Fig. 1, the data set is divided into label data and query, and the similarity between the set of word ngrams in the label data and the set of word ngrams in the query is calculated. Here, the similarity of the vectors is the value \(s\left({v}_{{l}_{X}}, {v}_{query}\right)\) obtained using the cosine similarity method, where \({v}_{{l}_{X}}, {v}_{query}\) represent the vector of word ngrams of the labeled data with the label \({l}_{X} \left(1\le X\le n\left(L\right)\right)\) and the vector of word ngrams of the query, respectively.
We then compute \({s}_{{l}_{X}, query}\) by aggregating \(s\left({v}_{{l}_{X}}, {v}_{query}\right)\) of all vector \({v}_{{l}_{X}}\) with label \({l}_{X}\), and substitute them all into the determined weight function \({f}_{w}\). Previous studies improve accuracy by weighting for realistic nonhomogeneous data sets. One study successfully achieved high accuracy using cosine similarity with added weighting to effectively train CNNs in realistic learning situations such as class imbalance, small size, and label noise (Kobayashi, 2021). Weighting explanatory variables with generated ngrams is said to be an effective means of improving text classification accuracy (Graovac et al., 2015). The calculation formula of \({s}_{{l}_{X}, query}\) is as follows:
where \({V}_{labele{d}_{X}}\) represents the labeled vector assigned label \(X\). Finally, we find \({s}_{{l}_{X}, query}\) for all \({l}_{X}\) and determine \({l}_{pred, query}\) as follows:
What this formula means is that the predicted label is the same label as the problem with higher similarity. Various changes in the function \({f}_{w}\) are used to determine a more suitable weighting for classification. In this experiment, we defined the functions as follows:
where \({H}_{X, k}\) represents the \(k\)th highest value in \({s}_{se{t}_{{l}_{X}}, query}\). The prediction vector for query, \({v}_{query}\), is defined as the array of values \([{s}_{{l}_{1}, query}, {s}_{{l}_{2}, query}, \dots , {s}_{{l}_{n\left(L\right)}, query}]\) obtained by the function \({f}_{w}\).
We created these functions based on the sentence similarity which has the same label: the more similar the sentences are, the more likely to have the same label. There are two assumptions as follows:

Assumption 1: Any pair of two exercises that have the same label are similar to each other. Therefore, we created to find the most appropriate label considering all labeled exercises’ similarities, \({f}_{mean}\).

Assumption 2: Specific exercises with the same label have high similarity with each other. Therefore, we created to find the most appropriate label considering \(m\) labeled exercises’ similarities, \({f}_{to{p}_{m}}, {f}_{ran{k}_{m}}\).
Figure 8 shows the overview of how to find the weight of specific label assigns to a query.
Prediction by machine learning
The problem of finding similar exercises can be formulated as follows: Given a set of test exercise text vector \({v}_{test}\), a set of training text vectors \({v}_{train}\) that have the true label set, our goal is to integrate these heterogeneous materials to predict the 1st level unit or the 2nd level unit for any vector from query exercise text \({v}_{query}\) by selecting the candidate label \({l}_{pred}\) with a predicted label, i.e.
where model is a package that can classify these vectors into the specific number of categories and \(L\) is the domain of labels in the data. The selected label for test data \({v}_{test}\) is the prediction label of the exercises, described as \({l}_{pred}\).

XGBoost (Chen et al., 2015; Chen & Guestrin, 2016): This model, which merges boosting with decision trees, has demonstrated promising outcomes in diverse natural language processing assignments, making it an appropriate choice for employment in this paper’s context.

Random Forest (Breiman, 2001): This is a model that employs numerous decision trees trained using randomly selected training data. It performs effectively even with a considerable number of explanatory variables, enabling it to handle a 300dimensional vector.

Logistic Regression (Cox, 1958), Perceptron (Rosenblatt, 1958): Both models are used for statistical regression with variables that follow a Bernoulli distribution. However, the former employs coordinate descent or quasiNewtonian methods for parameter determination in optimization problems, whereas the latter utilizes the stochastic gradient descent method.
Evaluation
We conducted experiments using fivefold cross validation for training and prediction. The use of fivefold reduces overtraining on training and label data. In addition, accuracy \({A}_{L}\), macro Fmeasure \({F}_{L}\) and weighted Fmeasure \({F}_{wL}\) were used to evaluate this experimental algorithm. Let \(T{P}_{l}, F{P}_{l}, T{N}_{l}\) and \(F{N}_{l}\) denote that the true prediction for a label \(l\) is correct or wrong, and that the false prediction for a label \(l\) is correct or wrong, then accuracy \({A}_{l}\) and precision \({P}_{l}\), recall \({R}_{l}\) and the f score \({F}_{l}\) can be expressed as follows.
We used \({A}_{L}\), \({F}_{L}\), and \({F}_{wL}\) to evaluate the performance of the prediction.
Result
Classification results with selecting features and methods
We take ngrams of \(1 \le n \le 6\) and vector with w2vec into consideration. We also prepare a cosine similarity model with the weighted function formula (15), (16), (17), (18) \((2 \le m \le 10)\), and machine learning method Xgboost, Random Forest, Perceptron and Logistic Regression. Tables 4, 5, 6, 7, 8 and 9 show the three kinds of prediction result, accuracy \({A}_{L}\), macro Fmeasure \({F}_{L}\) and weighted Fmeasure \({F}_{wL}\), when we used the combination of each feature and the model. In the tables, the best performance rate in each feature is bolded, and the best overall performance is underlined. We also draw a graph that represents all recalls of each feature and model selection in Figs. 9 and 10. The tables show that at the both 1st level unit and 2nd level unit prediction, the algorithm yielded the best \({A}_{L}\), \({F}_{L}\) of all when using monogram features with the Random Forest model, and best \({F}_{wL}\) when bigram features were used with the Random Forest model, when compared to the use of word embedding, ngrams of the other \(n\) features, and the other models such as cosine similarity or machine learning methods.
Unlike word embedding, ngrams can be analyzed literally without considering the context. It is a suitable feature for this experiment in that we are using text data poorly extracted from PDF files. In addition, since the experiment using cosine similarity considers textual similarity, the text is likely to be classified into the units that contain many texts with high similarity. Therefore, the higher the textual similarity of the texts, the higher the similarity at a larger \(n\) is likely to be. However, if \(n\) is too large, there will be fewer matching ngram words and less textual similarity. Considering these conditions, monograms turn out to be the most suitable nsize since it is the size of the ngram that is most likely to be used in the experiment.
Figures 11 and 12 compare the graph of \({A}_{L}\), \({F}_{L}\), \({F}_{wL}\) between selected weighted similarity models and MLP models. The reason of selecting the model in the figure is clarified as follows: \({f}_{to{p}_{3}}\) is the best prediction model of all \({f}_{to{p}_{m}}\) models, \({f}_{ran{k}_{3}}\) is the best prediction model of all \({f}_{ran{k}_{m}}\) models in 2nd level unit prediction, and \({f}_{ran{k}_{9}}\) is the best prediction model of all \({f}_{ran{k}_{m}}\) models in 1st level unit prediction.
As shown in Figs. 11 and 12, in both experiments, we could see the results using weighted similarity models are similar to that using MLP models from the point of the shape of the figure, while the result between MLP and the other machine learning methods’ results are not so similar; the former doesn’t have a peak when \(n = 1\), and the latter does when \(n = 1\). This suggests that weighted similarity models are taking the same method as MLP, like aggregating the number in the way of calculating the prediction. This also shows that as for the optimal value of n for ngrams, \(n=2\) was optimal for prediction by searching for similar sentences using cosine similarity. This means that the smaller the value of \(n\), the greater the number of matching components, while the larger the value of \(n\), the higher the degree of similarity of sentences with the higher agreement, suggesting that \(n=2\) is a moderate value that covers both aspects.
Random forest monogram feature analysis
To examine the predictions in detail, we performed feature analysis on the random forest model that was trained using monograms as it had the highest accuracy of all of the models that were evaluated. Table 10 (a) contains the most influential monograms and their degree of influence. The words ‘I’, ‘III’, ‘II’, ‘A’, and ‘B’ appear to be highly influential. This is because, as shown in the figure, the classifications of the units fall into one of these five patterns. Therefore, when these classifications are listed in the PDFs, it was found that these words can be used to classify the unit more.
Also, not all PDFs contain a classification indicating these five categories. The word “解説” (solution) is not a word that describes the math exercises or the solutions themselves. Therefore, by omitting these as stopwords, shaded in gray in Table 10 (a), the prediction can be performed to obtain a more general classification prediction result. This prediction resulted in \({A}_{L}\) of 82.88%, \({F}_{L}\) of 82.82%, and \({F}_{wL}\) of 83.08%. Table 10 (b) shows the most influential words and their degree of influence in this prediction. The top five words were words representing “ベクトル” (vector), “数” (number), “関数” (function), “確率” (probability) and “複素” (complex) respectively. All of these words are used as part of more than one name of a specific unit. Therefore, it is likely that these words were helpful in classifying the text into broad categories. Note that the assertion of the organization part in a specific place in the PDF would be helpful in classifying exercises, however less generalizable as it would rely on a consistent format that might not be realistic.
Discussion
Feature selection of extracted incomplete text from PDFs
Labeling incomplete text has been tackled in previous research by using ngrams, which was shown to be an effective way to meet this problem (Cavnar & Trenkle, 1994; Graovac, 2014; Suen, 1979). In the present research, we investigated using ngrams on the extracted texts from a PDF of mathematical exercises for which complete texts were difficult to obtain and categorized them into different leveled units. First, the extracted text could not pick up any information such as mathematical equations, symbols, or numbers. When predicting the topic of incomplete texts, we found that vector classification, which involves only information on whether the text is composed of similar elements and does not involve contextual analysis such as ngrams, was more effective than models that involve contextual analysis. However, we found that monograms which are similar to more traditional methods, such as ngrams or bag of words, provided the best classification performance, contradicting results from previous research for this specific task. Therefore, we assume that the use of ngrams in the classification of incomplete texts may depend on the target of the task, which in this case was Japanese mathematical exercises. As the previous research that successfully utilized ngrams to classify incomplete text (Cavnar & Trenkle, 1994; Graovac, 2014; Suen, 1979) neither targeted Japanese nor mathematical exercises, this may have implications for future research into the classification of incomplete Japanese or mathematical texts.
Model selection for more precise prediction
We aimed at labeling Japanese math text more precisely. A previous study treating Japanese mathematical exercises’ text classification yields 79.57% accuracy with WEKE model (Tian et al., 2022). In this experiment, proposed algorithms predicted different leveled units by two methods: search by similarity sentences using cosine similarity and classification using machine learning. The results concluded that the best prediction accuracy was achieved using Random Forest, which resulted in 92.50%. This shows that our method using monograms and Random Forest performed well when it comes to Japanese mathematical text classification.
The result indicates that monograms yielded the best classification when we used the method of Random Forest classification. The reason that monogram performs well is, as Fig. 3 shows, there are incomplete parts of the text when extracting from PDF files, so there is some meaningless parts that consist of multiple words within a chunk. In addition, a previous study documented good results when using the BagofWords method and Random Forest (Montoliu et al., 2015). This is also possibly a reason why this model yielded the best performance.
Since Random Forest uses decision trees, it is easy to create accurate decision techniques for binary vectors. Therefore, we believe that classification using Random Forest was able to accurately classify binary vectors with numerous dimensions. In addition, the fact that characters such as ‘I’, ‘II’, ‘III’, ‘A’, and ‘B’ existed as typical classification indices in Japanese mathematics and had a significant influence on classification, suggests that Random Forest with monograms produced the best prediction accuracy. This also indicates that the organization characters can be useful when classifying the exercises: if the PDF sentence contains such characters, as shown in Fig. 13, it can be easier to automatically classify it.
Selection of evaluation method in the educational context
Three indices, \({A}_{L}\), \({F}_{L}\), and \({F}_{wL}\), were used in the experiment as evaluation indices. \({A}_{L}\) is desired to be evaluated with a more reliable index, since in the present data set, there are much more data that are truenegative than truepositive data and tend to rate the model that is false for all data highly (Manning et al., 2008). In this case, the indicator Fmeasure is often used for twoclass classification, but there are two ways to obtain it for multiclass classification. In the case of multiclass classification, there are two ways to obtain the Fmeasure: \({F}_{L}\) and \({F}_{wL}\).
\({F}_{L}\) returns the average of the \({F}_{l}\) obtained for each class \(l\), which is equal to the average of the \({F}_{l}\) obtained for all classes, even if the number of data in each class is not uniform. Therefore, it is possible to treat all units equally even if the number of data in each unit is not uniform. In other words, it is an effective indicator for labeling biased data sets. For example, \({F}_{L}\) is useful when a teacher in a school setting selects three 1st level units to create a test (ignoring the rest) and automatically assigns 2nd level units to exercises within those units. \({F}_{wL}\) is the Fmeasure calculated from the sum of Precision \({P}_{l}\) and Recall \({R}_{l}\) for each class \(l\). This is a weighted index that takes into account the number of each data set. Therefore, it can be said that the index accurately reflects the distribution of this data set. The index \({F}_{wL}\) is useful for labeling a uniform data set, i.e., the mathematical material studied in three years of Japanese high school at a time.
From the experimental results, we can say that the combination of monogram and Random Forest, which has the largest \({F}_{L}\), is effective when limiting the unit, and the combination of bigram and Random Forest is effective for a uniform data set of three years of high school. However, the accuracy rate is not much different when using the feature \(n=1\) or \(n=2\) in Random Forest prediction.
Practical educational implications in this research
Automatic labeling can help reduce teacher workload (Tian et al., 2022) and develop mathematics workmanship (Fishback & Schlicker, 1996) by introducing systems that require labels (Vie et al., 2019; Wang et al., 2022). For further development of mathematics in Japan, a programming environment on supporting units on mathematics using data (Kayama et al., 2022) and clarification of unit structure for knowledge association in learning (Taniguchi & Itoh, 2023) have been proposed.
The labeling method in this study allows labels to be assigned to unlabeled mathematical instructional materials by learning the text of the labeled materials, even though they are not formatted suitably for extracting text, such as handmade materials by mathematics teachers. These can facilitate unit learning based on the national standard curriculum guidelines. For example, when students study on their own, the system can suggest different exercises than the ones they have solved, with the explanation that they are part of the same unit, which can promote student understanding. Therefore, it can be said that a system using units is more easily utilized in the school contexts. In other words, the contribution of this study is that the automatic assignment of unit information to systems in the educational field will expand the range of support without burdening teachers.
In addition, although we have chosen to use mathematical exercises as the subject matter, we believe that such a method could be applied to other subjects as well, given the uniform treatment of equations, terminology, and other information as textual information. To do so, we need clearly shared criteria and examples of exercises to which they are preassigned (i.e., we can use the method proposed in Sect. 3 if the data set is in a usable format).
Limitations and future research
In this study, we proposed an algorithm to solve the problem of classifying incomplete texts of mathematical exercises into different leveled units. However, if more detailed text were available, a contextaware classification algorithm is expected to produce better accuracy.
In this experiment, we limited ourselves to one topic of the same level to be assigned to each mathematical exercise, but there also are mathematical exercises that span multiple topics. In order to properly assign topics to such mathematical exercises, a system that can assign multiple units using the algorithm verified in this study is needed. Multilabel classification is also widely used in machine learning (Sorower, 2010; Tsoumakas & Katakis, 2007). Once such a system is completed, it would be possible to recommend similar exercises using mathematical topics and analyze student learning based on topics.
This experiment showed that even if it is not possible to read mathematical expressions, numbers, or symbols, it is possible to classify with high accuracy using only the textual information obtained. Once such a system is developed, it would be possible to recommend similar exercises based on mathematical topics and analyze student learning based on topics. Since the system would be able to assign common labels to different teaching materials, it would be possible to develop a textbook recommendation system that assigns textbook subsections to exercises so that students can review them in the textbook when they make a mistake on an exercise. The information collected using these systems would then create a learning support environment that takes into account the degree of difficulty and understanding of the 1st level unit and 2nd level unit itself.
In addition, since the experiment was conducted independently of student learning, there are no results on the contribution to learner and teacher activities. It will be necessary to verify the educational effects in future experiments by using the automatic labeling of the unit to recommend teaching materials or to analyze the behavior of the learners, especially with predicting entire difficulty of the exercises and learners’ complete rate of them, or considering students reading comprehension. In addition to this, as one study developed a recommendation system which uses student’s action as a parameter of the system (Takami et al., 2022), there is also room for combining following two approaches: topicbased model driven approach (i.e., the labeling the exercises) and student’s behavior data driven approach (i.e., using the student’s achievement of the exercise into the system).
Conclusion
This paper proposes an algorithm that uses several techniques to correctly assign topics to the incomplete mathematical text obtained from PDF text. The extracted text showed that all information on numbers, mathematical expressions, and symbols was omitted when converted from PDF to text. Furthermore, we compared the prediction accuracy of the two methods at the stage of predicting topics from the obtained vectors. Two methods were used to compare their prediction accuracy: one using cosine similarity and the other using machine learning. We attempted to predict with all features and models and found that the best prediction accuracy was achieved by using monograms as features and applying Random Forest (92.5% and 68.5% for 1st level unit and 2nd level unit, respectively). We conclude that the reason for the higher accuracy was the ability to find contextindependent similarities even in incomplete sentences by using ngrams to find matches in which the remaining words are used, and the existence of organization parts (‘I’, ‘II’, ‘III’, ‘A’, ‘B’) representing common national classifications for Japanese mathematical exercises. Given that PDFs are not necessarily assigned such national symbols, we conducted a similar experiment omitting them as stop words and found that the accuracy dropped a little, but important mathematical knowledge elements appeared in the key features, which are important for the classification of mathematical exercises.
The contribution in the research is the discovery that monograms, a simpler approach similar to traditional methods like ngrams or bag of words, outperformed stateoftheart methods in classifying incomplete texts, particularly in the context of Japanese mathematics exercises. These findings challenge previous research results and suggests that the choice of text analysis techniques may depend on the specific task or target domain.
Availability of data and materials
The data of this study is not open to the public due to participant privacy.
Abbreviations
 ACARA:

Australian curriculum, assessment and reporting authority
 BERT:

Bidirectional encoder representations from transformers
 CCSS:

Common core state standard
 CNN:

Convolutional neural network
 ELMo:

Embeddings from language models
 GPT:

Generative pretrained transformer
 HTC:

Hierarchical text classification
 HTML:

Hyper text markup language
 ICT:

Information and communication technology
 K12:

Kindergarten through 12th grade
 LR:

Logistic regression
 LSTM:

Long short term memory
 MEXT:

Ministry of Education, Culture, Sports, and Technology
 MLP:

Multi layered perceptron
 MOE:

Ministry of Education
 MSC:

Mathematics subject classification
 OCR:

Optical character recognition
 PDF:

Portable document format
 PISA:

Programme for international student assessment
 RF:

Random forest
 RNN:

Recurrent neural network
 SES:

Socioeconomic status
 US:

United States
 VSM:

Vector space model
 WEKE:

Word embedding and knowledge extracting
 XGB:

EXtreme gradient boosting
 zbMATH:

Zentralblatt MATH
References
Abekawa, T., & Aizawa, A. (2016). SideNoter: Scholarly paper browsing system based on PDF restructuring and text annotation. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations, 136–140.
Australian Curriculum, Assessment and Reporting Authority (ACARA). F10 curriculum mathematics structure. Retrieved 01 September, 2023 from https://www.australiancurriculum.edu.au/f10curriculum/mathematics/structure/.
Bhartiya, D., Contractor, D., Biswas, S., Senjupta, B., & Mohania, M. (2016). Document segmentation for labeling with academic learning objectives. In Paper presented at the International Conference on Educational Data Mining, 282–287.
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
Cavnar, W. B., & Trenkle, J. M. (1994). Ngrambased text categorization. In Proceedings of SDAIR94, 3rd annual symposium on document analysis and information retrieval, 161175.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794.
Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., & Chen, K. (2015). Xgboost: Extreme gradient boosting. R Package Version, 1(4), 1–4.
Chow, J. C., & Ekholm, E. (2019). Language domains differentially predict mathematics performance in young children. Early Childhood Research Quarterly, 46, 179–186.
Church, K. W. (2017). Word2Vec. Natural Language Engineering, 23(1), 155–162.
Contractor, D., Popat, K., Ikbal, S., Negi, S., Sengupta, B., & Mohania, M. K. (2015). Labeling educational content with academic learning standards. In Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 136–144.
Cox, D. R. (1958). The regression analysis of binary sequences. Journal of the Royal Statistical Society: Series B (methodological), 20(2), 215–232.
Date, I., & Isozaki, H. (2015). Detection of mathematical formula regions in images of scientific papers by using deep learning and OCR. IEICE Technical Report, 2015(4), 1–6. in Japanese.
Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2019). BERT: Pretraining of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 4171–4186.
Dharma, E. M., Gaol, F. L., Warnars, H. L. H. S., & Soewito, B. (2022). The accuracy comparison among word2vec, glove, and fasttext towards convolution neural network (CNN) text classification. Journal of Theoretical and Applied Information Technology, 100(2), 349–359.
Ditchburn, G. (2012). A national Australian curriculum: In whose interests? Asia Pacific Journal of Education, 32(3), 259–269.
Dunne, E., & Hulek, K. (2020). Mathematics subject classification 2020. EMS Newsletter, 115, 5–6.
Fateman, R. J., Tokuyasu, T., Berman, B. P., & Mitchell, N. (1996). Optical character recognition and parsing of typeset mathematics. Journal of Visual Communication and Image Representation, 7(1), 2–15.
Fishback, P., & Schlicker, S. (1996). The impact of technology on mathematics education. Grand Valley Review, 14(1), 27.
Flanagan, B., Majumdar, R., Akçapınar, G., Wang, J., & Ogata, H. (2019). Knowledge map creation for modeling learning behaviors in digital learning environments. In Companion Proceedings of the 9th International Conference on Learning Analytics and Knowledge, 428–436.
Flanagan, B., & Ogata, H. (2018). Learning analytics platform in higher education in Japan. Knowledge Management & ELearning: An International Journal, 10(4), 469–484.
Graovac, J. (2014). Text categorization using ngram based language independent technique. Intelligent Data Analysis, 18(4), 677–695.
Graovac, J., Kovačević, J., & PavlovićLažetić, G. (2015). Language independent ngrambased text categorization with weighting factors: A case study. Journal of Information and Data Management, 6(1), 4–17.
Graovac, J., Kovačević, J., & PavlovićLažetić, G. (2017). Hierarchical vs. flat ngrambased text categorization: Can we do better? Computer Science and Information Systems, 14(1), 103–121.
Guo, Y., Silver, E. A., & Yang, Z. (2018). The latest characteristics of mathematics education reform of compulsory education stage in China. American Journal of Educational Research, 6(9), 1312–1317.
Hussein, H. B. (2023). Global trends in mathematics education research. International Journal of Research in Educational Sciences., 6(2), 309–319.
Ikeda, T. (2021). nagisa (0.2.7). https://github.com/taishii/nagisa.
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2017). Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, 2, 427–431.
Kayama, M., Nagai, T., & Asuke, T. (2022). A proposal for a visual programming environment for “use of data” related units in primary and secondary education. Journal of Japanese Society for Information and Systems in Education, 39(2), 224–234. in Japanese.
Khan, A., Baharudin, B., Lee, L. H., & Khan, K. (2010). A review of machine learning algorithms for textdocuments classification. Journal of Advances in Information Technology, 1(1), 4–20.
Khosravi, H., & Cooper, K. (2018). Topic dependency models: Graphbased visual analytics for communicating assessment data. Journal of Learning Analytics, 5(3), 136–153.
Kobayashi, T. (2021). TvMF similarity for regularizing intraclass feature distribution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6616–6625.
Kobayashi, Y., Tanaka, S., & Tomiura, Y. (2012). Pattern recognition of english scientific papers using ngrams. Information Fundamentals and Access Technologies, 12(1), 1–6.
Kühnemund, A. (2016). The role of applications within the reviewing service zbMATH. PAMM, 16(1), 961–962.
Li, B., Liu, T., Du, X., Zhang, D., & Zhao, Z. (2016). Learning document embeddings by predicting ngrams for sentiment classification of long movie reviews. In The Eleventh International Conference on Learning Representations.
Liu, G., & Guo, J. (2019). Bidirectional LSTM with attention mechanism and convolutional layer for text classification. Neurocomputing, 337, 325–338.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge University Press.
Mansur, M. (2006). Analysis of ngram based text categorization for Bangla in a newspaper corpus (Doctoral dissertation, BRAC University).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of Advances in Neural Information Processing Systems 26.
Ministry of Arts and Sciences (MEXT). (2009). 高等学校学習指導要領(平成21年3月告示). [High school curriculum guidelines (announced in March 2009)]. https://erid.nier.go.jp/files/COFS/h20h/index.htm. in Japanese.
Ministry of Arts and Sciences (MEXT). (2018). 数学編・理数編 高等学校学習指導要領(平成30年告示). [In mathematics and science, high school curriculum guidelines (announced in 2018)]. https://www.mext.go.jp/content/20230217mxt_kyoiku02100002620_05.pdf. in Japanese.
Ministry of Arts and Sciences (MEXT). (2021). 高等学校用教科書目録(令和4年度使用) [For higher education textbook catalog (for fiscal year 2021)]. https://www.mext.go.jp/content/20210604mxt_kyokasyo02000014470_4.pdf. in Japanese.
Ministry of Education of the People’s Republic of China (MOE). (2012). Mathematics curriculum standards for compulsory education (2011th ed.). Beijing Normal University Press.
Montoliu, R., MartínFélez, R., TorresSospedra, J., & MartínezUsó, A. (2015). Team activity recognition in association football using a bagofwordsbased method. Human Movement Science, 41, 165–178.
Ohnishi, T. (2011). Taskbased learning in high school mathematics. Japan Society for Science Education Research Report, 26(8), 45–48. in Japanese.
Palmer, J. A. (2021). pdftotext (2.2.2). https://github.com/jalan/pdftotext.
Peters, M., Neumann, M., Zettlemoyer, L., & Yih, W. (2018). Dissecting contextual word embeddings: Architecture and representation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 1499–1509.
Porter, A., McMaken, J., Hwang, J., & Yang, R. (2011). Common core standards: The new US intended curriculum. Educational Researcher, 40(3), 103–116.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Raff, E., Richard, Z., Cox, R., Sylvester, J., Yacci, P., Ward, R., Tracy, A., Mclean, M., & Nicholas, C. (2018). An investigation of byte ngram features for malware classification. Journal of Computer Virology and Hacking Techniques, 14, 1–20.
Ramakrishnan, C., Patnia, A., Hovy, E., & Burns, G. A. (2012). Layoutaware text extraction from fulltext PDF of scientific articles. Source Code for Biology and Medicine, 7(1), 1–10.
Ritter, B. J. (2009). Update on the common core state standards initiative. National Governors Association.
Rosenblatt, F. (1958). The perceptron: A probabilistic model for information storage and organization in the brain. Psychological Review, 65(6), 386.
Schubotz, M., Scharpf, P., Teschke, O., Kühnemund, A., Breitinger, C., & Gipp, B. (2020). Automsc: Automatic assignment of mathematics subject classification labels. In International Conference on Intelligent Computer Mathematics, 237–250.
Shen, J. T., Yamashita, M., Prihar, E., Heffernan, N., Wu, X., McGrew, S., & Lee, D. (2021). Classifying math knowledge components via taskadaptive pretrained BERT. In International Conference on Artificial Intelligence in Education, 408–419.
Shintani, R. (2014). The development process and contents of the common core state standards: Based on a comparative study with the Japanese Course of Study for lower secondary school. Journal of Japan Association of American Educational Studies, 25, 15–27. in Japanese.
Silla, C. N., Jr., & Freitas, A. A. (2011). A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1–2), 31–72.
Smith, R. (2007). An overview of the tesseract OCR engine. In Ninth International Conference on Document Analysis and Recognition, 2, 629–633
Sorower, M. S. (2010). A literature survey on algorithms for multilabel learning. Oregon State University, Corvallis, 18(1), 25.
Sosnovsky, S., & Brusilovsky, P. (2015). Evaluation of topicbased adaptation and student modeling in quizguide. User Modeling and UserAdapted Interaction, 25, 371–424.
Spelke, E. S., & Tsivkin, S. (2001). Language and number: A bilingual training study. Cognition, 78(1), 45–88.
Suen, C. Y. (1979). Ngram statistics for natural language understanding and text processing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2, 164–172.
Takami, K., Dai, Y., Flanagan, B., & Hiroaki Ogata. (2022). Educational explainable recommender usage and its effectiveness in high school summer vacation assignment. In 12th International Learning Analytics and Knowledge Conference, 458–464.
Taniguchi, Y., & Itoh, T. (2023). Unit association method with symbolization in high school mathematics textbook. Journal of Information Processing, 64(1), 256–269. in Japanese.
Tian, Z., Flanagan, B., Dai, Y., & Ogata, H. (2022). Automated matching of exercises with knowledge components. In 30th International Conference on Computers in Education Conference Proceedings, 24–32.
Tsoumakas, G., & Katakis, I. (2007). Multilabel classification: An overview. International Journal of Data Warehousing and Mining, 3(3), 1–13.
Vie, J. J., & Kashima, H. (2019). Knowledge tracing machines: Factorization machines for knowledge tracing. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1), 750–757.
Vovides, Y., SanchezAlonso, S., Mitropoulou, V., & Nickmans, G. (2007). The use of elearning course management systems to support learning strategies and to improve selfregulated learning. Educational Research Review, 2(1), 64–74.
Wang, F., King, R. B., & Leung, S. O. (2023). Why do east Asian students do so well in mathematics? A machine learning study. International Journal of Science and Mathematics Education, 21(3), 691–711.
Wang, J., Minematsu, T., Okubo, F., & Shimada, A. (2022). Topicwise representation of learning activities for new learning pattern analysis. In 30th International Conference on Computers in Education Conference Proceedings, 1, 268–278.
zbMATH OPEN, The first resource for mathematics. Mathematics subject classification—MSC2020. Retrieved 03 September, 2023 from https://zbmath.org/classification/.
Zheng, E., Moh, M., & Moh, T. S. (2017). Music genre classification: A ngram based musicological approach. In 2017 IEEE 7th International Advance Computing Conference, 671–677.
Funding
This work was partly supported by JSPS GrantinAid for Scientific Research (B) JP23H01001, JP22H03902, JP20H01722, JSPS GrantinAid for Scientific Research (Exploratory) JP21K19824, and NEDO JPNP20006.
Author information
Authors and Affiliations
Contributions
RN, BF, and HO contributed to the research conceptualization and methodology. TY wrote the manuscript. RN, YD, KT, BF, and HO provided comments to improve the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The author declares no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Yamauchi, T., Flanagan, B., Nakamoto, R. et al. Automated labeling of PDF mathematical exercises with word Ngrams VSM classification. Smart Learn. Environ. 10, 51 (2023). https://doi.org/10.1186/s40561023002719
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40561023002719