Non-intrusive assessment of learners’ prior knowledge in dialogue-based intelligent tutoring systems
© Rus and Ştefănescu. 2016
Received: 3 November 2015
Accepted: 22 January 2016
Published: 3 February 2016
Goal and Scope
This article describes a study whose goal was to assess students’ prior knowledge level with respect to a target domain based solely on characteristics of the natural language interaction between students and conversational Intelligent Tutoring Systems (ITSs). We report results on data collected from two conversational ITSs: a micro-adaptive-only ITS and a fully-adaptive (micro- and macro-adaptive) ITS. These two ITSs are in fact different versions of the state-of-the-art conversational ITS DeepTutor (http://www.deeptutor.org).
Approach and Results
Our models rely on both dialogue and session interaction features including time on task, student generated content features (e.g., vocabulary size or domain specific concept use), and pedagogy-related features (e.g., level of scaffolding measured as number of hints). Linear regression models were explored based on these features in order to predict students’ knowledge level, as measured with a multiple-choice pre-test, and yielded in the best cases an r=0.949 and adjusted r-square =0.833. We discuss implications of our findings for the development of future ITSs.
Assessment is a key element in education in general and in Intelligent Tutoring Systems (ITSs; (Rus et al. 2013)) in particular because fully adaptive tutoring presupposes accurate assessment (Chi et al. 2001; Woolf 2008). Indeed, a necessary step towards instruction adaptation is assessing students’ knowledge state such that appropriate instructional tasks (macro-adaptation) are selected and appropriate scaffolding is offered while students are working on a task (micro-adaptation or within-task adaptation).
We focus in this article on assessing students’ prior knowledge in dialogue-based ITSs based on characteristics of the tutorial dialogue interaction between students and such systems. Assessing students’ other states, e.g. affective state, that are important for learning and therefore important to further adapt instruction to each individual learner is beyond the scope of this work.
When students start interacting with an ITS, their prior knowledge with respect to the target domain is typically assessed using a multiple choice pre-test although other forms of assessment such as open answer problem solving are sometimes used. The pre-test serves two purposes: enabling macro-adaptation in ITSs, i.e. the selection of appropriate instructional tasks for a student based on student’s knowledge state before the tutoring session starts, and, when paired with a post-test, establishing a baseline from which the student progress is gauged by computing learning gains (post- minus pre-test score). This widely used pre-test/post-test experimental framework is often necessary in order to infer whether the treatment was effective relative to the control.
While the role of a pre-test is important for assessing students’ prior knowledge, there are several challenges with having a pre-test. First, a pre-test (as well as the paired post-test) takes up a non-trivial amount of time. This is particularly true for experiments consisting of only one session in which case the pre-test and post-test may take up to half the time of the full experiment. For instance, a 2-h experiment could be broken down into three parts: 30 min for pre-test, 1 h of actual interaction with an ITS, and 30 min for post-test. Altogether, in this particular case the pre-test and post-test take 1 h which is half the time of the whole experiment.
More worryingly is the fact that in such experiments the pre-test may have a tiring effect on students. By the time students reach the post-test many of them will be so tired they will underperform even if they learned something during the actual training, thus, jeopardizing the whole experiment. For instance, in one of our experiments about 30 % of the subjects simply randomly picked one of the choices for the multiple-choice questions in the post-test without even reading the question. We observed this behavior by analyzing the time students took to pick their choice after they were shown a question on screen. About a third of the students took on average less than 5 s per question which is not even enough to read the text of the question. By comparison, the same students took on average 36 s to respond to similar questions in the pre-test. By eliminating the pre-test in the above illustrative experiment, we can reduce the overall experimental time to 1 h and 30 min, thus reducing tiring effects. By eliminating both the pre-test and post-test, we can further reduce the total experiment time.
Additionally, many times there is a disconnect between the pre- and post-test questions and the actual learning tasks and process. To overcome this challenge, Shute and Ventura (2013) argue for a shift towards emphasizing performance-based assessment which is about evaluating students’ skills and knowledge while applying them in authentic contexts. For instance, reading instructions in a role-playing game allows assessing students’ reading comprehension skills (Shute and Ventura 2013). Using explicit tests in such contexts would interfere with the main task and are therefore not recommended. They advocate for the use of stealth assessment while students engage in a particular activity. Like in stealth assessment, we advocate here for non-intrusive assessment during problem solving in dialogue-based ITSs. To this end, the goal of our work presented here was to investigate to what degree we can automatically infer students’ knowledge level directly from their performance while engaging in problem solving with the help of an ITS.
Eliminating the need for learners to go through a standard pre-test and a post-test saves time for more training, eliminates tiring effects and testing anxieties, and ultimately provides a more accurate picture of students’ capabilities as the assessment is conducted in context, i.e. while they solve problems in our case. In particular, we investigate how well we can predict students’ prior knowledge, as measured by a standard multiple-choice pre-test, based on characteristics of the tutorial dialogue interaction with the hope that if the predictions are close enough we can do without the pre-test in the future. We are also interested in finding out the minimum tutorial dialogue interaction that would yield an accurate estimate of students’ prior knowledge.
We would like to emphasize that we are not arguing for a complete elimination of explicit assessments such as multiple-choice tests which have their own advantages for learning such as testing effects (the memory retrieval processes activated during testing benefit long-term memory of the target material; (Roediger and Karpicke 2006)). Rather, we propose to investigate to what extent we can measure students’ knowledge level from interaction characteristics such that, when needed, we can employ this kind of non-intrusive assessment.
We conducted our research on data collected from an experiment with high-school students using the state-of-the-art conversational computer tutor DeepTutor (Rus et al. 2013). As mentioned, our goal was to find interaction features that are good predictors of students’ pre-test scores and to create prediction models that would be as useful as the multiple choice pre-tests in measuring students’ prior knowledge. The best model we found can predict students’ prior knowledge, as measured by a summative pre-test, with r=0.949 and adjusted r-square =0.833. We also determined the minimum dialogue length which is necessary to be able to make the best predictions.
The remainder of the article is organized as follows: Section “Related work” briefly discusses previous relevant work while Section “DeepTutor: a state-of-the-art dialogue-based intelligent tutoring system” presents a brief overview of the computer tutor that provided the context for our experimental analysis. The following section decribes the approach. The data is presented in the next section which is followed by the “Experiments and results” section offering details about the various prediction models and the results we obtained from these models. The article ends with a section on conclusions and further work.
The most directly relevant previous work to ours is by Lintean et al. (2012) who studied the problem of inferring students’ prior knowledge based on prior knowledge activation (PKA) paragraphs elicited from students. PKAs were generated by students as part of a meta-cognitive training program. Lintean and colleagues employed a myriad of methods to predict students’ prior knowledge including comparing the student PKA paragraphs to expert-generated paragraphs or to a taxonomy of concepts related to the target domain, which in their case was biology. Students’ prior knowledge level or mental model were modeled as a set of three categories: low mental model, medium mental model, and high mental model. There are significant differences between our work and theirs. First, we deal with dialogues as opposed to explicitly elicited prior knowledge paragraphs. Second, we do not have access to a taxonomy of concepts against which we can compare students’ contributions. Third, we model students’ prior knowledge using scores obtained on a multiple-choice pre-test.
Predicting students’ learning and satisfaction is another area of research directly relevant to ours. Among these, we mention the work of Forbes-Riley and Litman (2006) who used three types of features to predict learning and user satisfaction: system specific, tutoring specific, and user-affect-related. They employed the whole training session as unit of analysis, which is different from our own analysis because we use instructional task, i.e. a Physics problem in our case, as the unit of analysis. Our unit of analysis serves better our purpose of finding out the minimum number of leading instructional tasks to accurately assess students’ knowledge level. Furthermore, their work was in the context of a spoken dialogue system while in our case we focus on a chat-based/typed-text-based conversational ITS. Another difference between our work and theirs is their focusing on user satisfaction and learning while we focus on identifying students’ knowledge level.
Williams and D’Mello (2010) worked on predicting the quality of student answers (as error-ridden, vague, partially-correct or correct) to human tutor questions, based on dictionary-based dialogue features previously shown to be good detectors of cognitive processes (cf. (Williams and D’Mello 2010)). To extract these features, they used LIWC (Linguistic Inquiry and Word Count; (Pennebaker et al. 2001)), a text analysis software program that calculates the degree to which people use various categories of words across a wide array of texts genres. They reported that pronouns (e.g. I, they, those) and discrepant terms (e.g. should, could, would) are good predictors of the conceptual quality of student responses.
Yoo and Kim (2012) worked on predicting the project performance of students and student groups based on stepwise regression analysis on dialogue features in Online Q&A discussions. To extract dialogue features they made use of LIWC and speech acts, which are semantic categories such as Greetings or Questions that indicate speakers’ intentions (Moldovan et al. 2011). Yoo and Kim found that the degree of information provided by students and how early they start to discuss before the deadline, are two important factors explaining project grades. A similar research was conducted by Romero and colleagues (Romero et al. 2013) who also included (social) network related features. Their statistical analysis showed that the best predictors related to students’ dialogue are the number of contributions (messages), number of words, and the average score of the messages.
In our work presented here, we use some of the features described by the above researchers, such as session length or dialogue turn length, and other novel features such as information content.
DeepTutor: a state-of-the-art dialogue-based intelligent tutoring system
The work described in this article has been conducted in the context of the state-of-the-art intelligent tutoring system DeepTutor (http://www.deeptutor.org). To better understand this context, we offer in this section an overview of intelligent tutoring systems in general and of DeepTutor in particular.
Intelligent tutoring systems
One-on-one human tutoring is one of the most effective solutions to instruction and learning that has attracted the attention of many for decades. Encouraged by the effectiveness of one-on-one human tutoring (Bloom 1984), computer tutors such as DeepTutor that mimic human tutors have been successfully built with the hope that a computer tutor could be available to every child with access to a computer (Rus et al. 2013).
How effective are state-of-the-art ITSs at inducing learning gains in students?
An extensive review of tutoring research by VanLehn (2011) showed that computer tutors are as effective as human tutors. VanLehn reviewed studies published between 1975 and 2010 that compared the effectiveness of human tutoring, computer-based tutoring, and no tutoring. The conclusion was that the effectiveness of human tutoring is not as high as it was originally believed (effect size d = 2.0) but much lower (d = 0.79). The effectiveness of computer tutors (d = 0.78) was found to be as high as the effectiveness of human tutors. So, there is something about the one-on-one connection that is critical, whether the student communicates with humans or computers. Graesser et al. (1995) argued that the remedial part of tutorial interaction in which tutor and tutee collaboratively improve an initial answer to a problem is the primary advantage of tutoring over classroom instruction. Chi et al. (2004) advanced a related hypothesis: tutoring enhances students’ capacity to reflect iteratively and actively on domain knowledge. Furthermore, one-on-one instruction has the advantage of engaging most students’ attention and interest as opposed to other forms of instruction such as lecturing/monologue in which the student may or may not choose to pay attention (VanLehn et al. 2007).
Dialogue-based intelligent tutoring systems
Intelligent Tutoring Systems (ITSs) with conversational dialogue form a special category of ITSs. The development of conversational ITSs such as DeepTutor is driven by explanation-based constructivist theories of learning and the collaborative constructive activities that occur during human tutoring (Rus et al. 2013). Conversational ITSs have several advantages over other types of ITSs. They encourage deep learning as students are required to explain their reasoning and reflect on their basic approach to solving a problem. Such conceptual reasoning is more challenging and beneficial than mechanical application of mathematical formulas (Hestenes et al. 1992). Furthermore, conversational ITSs have the potential of giving students the opportunity to learn the language of scientists, an important goal in science literacy. A student associated with a more shallow understanding of a science topic uses more informal language as opposed to more scientific accounts (Mohan et al. 2009).
DeepTutor is a state-of-the-art conversational ITS that is intended to increase the effectiveness of conversational ITSs by promoting deep learning of complex science topics through a combination of advanced domain modeling methods, deep language and discourse processing algorithms, and advanced tutorial strategies. DeepTutor is the first ITS based on the framework of Learning Progressions (LPs; (Corcoran et al. 2009)). LPs, which were developed by the science education research community, can be viewed as incrementally more sophisticated ways to think about an idea that emerge naturally while students move toward expert-level understanding of the idea. DeepTutor is an effective ITS: a recent experiment showed that DeepTutor is as effective as human tutors (Rus et al. 2014) yielding effect sizes comparable to the effectiveness of human tutors as reported by VanLehn (2011).
DeepTutor currently targets the domain of conceptual Newtonian Physics but it is designed with scalability in mind (cross-topic, cross-domain). DeepTutor has been developed as a web service and a first prototype is fully accessible through a browser from any Internet-connected device, including regular desktop computers and mobile devices such as tablets, thus moving us closer to the vision of providing cost-effective and tailored instruction to every learner, child or adult, anywhere, anytime.
The spin-off project of AuthorTutor (http://www.authortutor.org) aims at efficiently porting DeepTutor-like ITSs to new domains by investigating well-defined principles and processes as well as developing software tools that would enable experts to efficiently author conversational computer tutors across STEM disciplines. Another authoring tool, called SEMILAR (derived from SEMantic simILARity toolkit; (Rus et al. 2013)), is being developed as well to assist with authoring algorithms for deep natural language processing of student input in conversational ITSs. More information about the SEMILAR toolkit is available at http://www.semanticsimilarity.org.
It is beyond the scope of this article to describe all the novel aspects of DeepTutor or related projects. Instead, we present next the general instructional framework in DeepTutor with an emphasis on macro- and micro-adaptation which is important to know in order to better understand the data analyses presented in this article.
We would like to just mention that DeepTutor proposed major improvements in core ITSs tasks: modeling the task domain, tracking students’ knowledge states, selecting appropriate learning trajectories, and the feedback mechanisms. Advances in these core tutoring tasks will move state-of-the-art ITSs closer to implementing fully adaptive tutoring which means tailoring instruction to each individual student.
The DeepTutor instructional framework
All other things equal, low knowledge students will most likely struggle to provide solid self-explanations and therefore most likely to omit important steps in the solution and articulate misconceptions which would lead to more scaffolding dialogue moves in terms of hints and correcting misconceptions, respectively, on the part of the computer tutor. High knowledge students would need less scaffolding and therefore the corresponding dialogues should be shorter. That is, each dialogue between the system and a student has a unique signature or dialogue interaction fingerprint which we exploit in our work here in order to infer students’ prior knowledge.
Macro- and micro-adaptivity in DeepTutor: the 3-loop instructional framework
The behavior of DeepTutor can be described using three major loops: the task loop, the solution-step loop, and the hint loop. This framework was inspired from VanLehn’s two-loop characterization of tutoring systems (VanLehn et al. 2007). According to VanLehn, ITSs can be described in broad terms as running two loops: the outer loop, which selects the next task to work on, and the inner loop, which manages the student-system interaction while the student works on a particular task. The outer loop corresponds to our task loop while the inner loop corresponds to both the solution-step and hint loops.
We believe that our framework better explains and guides the development of a fully adaptive ITSs. Indeed, having only two loops, the outer loop and the inner loop, is too coarse and obscures important instructional layers that need be addressed explicitly by adopting appropriate instructional strategies as illustrated above for the solution-step and hint level loops. In fact, Rus et al. (2013) suggested there should be even more loops (than the three in our framework) accounted for in a fully independent, comphrehensive, longitudinal education technology that monitors and tutors students over a long period of time spanning many topics and grade levels. According to Rus and colleagues, there should be a loop for each of the following instructional levels: curriculum/standards level, the course level, the lesson level, the activity level, the solution level, and the hint level. Each such loop will have to be guided by different instructional strategies that are appropriate for the corresponding instructional level. For instance, strategies for sequencing instructional tasks across many instructional sessions in a course, which should be informed by principles of interleaving and spacing (Pavlik and Anderson 2008) that have been shown to promote long-term learning, should be implemented as part of the course level loop in the Rus and colleague’s taxonomy of instructional levels. For simplicity and to fairly describe the current state of the DeepTutor system, we only limit our discussion to the three-loop framework mentioned above which addresses the activity (or task) instructional level, solution-step instructional level, and hint instructional level. These three loops are essential in order to understand micro- and macro-adaptivity in DeepTutor which in turn are important to understand the context of our presented here.
Our approach to predict students’ knowledge level in the context of dialogue-based ITSs relies on the fact that each tutorial dialogue between the system and a student has its own characteristics which are strongly influenced by students’ background and the nature of instructional tasks. Indeed, students’ knowledge level is reflected in the tutorial dialogue between the system and the student, e.g. as the learner becomes more competent the level of help from the ITS should drop. The level of help can be quantified as the number of hints, for instance. Furthermore, the dialogue characteristics are also influenced by the nature of the training tasks. If similar tasks (addressing same concepts in similar or related contexts) are used throughout a whole tutorial session, one might expect that by the time a student reaches the last problems in the session he would master them, thus, requiring less help from the tutor by the end of the session. On the other hand, if the problems are increasingly challenging or simply unrelated to each other then the students would be continuously challenged throughout the whole session; in such a scenario the number of hints a student receives should not drop throughout a session.
We are exploring the relationship between students’ prior knowledge and dialogue features in two different setups with two different task selection strategies which allows us to explore the impact of different task selection policies on the dialogue characteristics and therefore on our models for predicting students’ prior knowledge. Indeed, we work with data collected from training sessions with two versions of DeepTutor: micro-adaptive-only and fully-adaptive (macro- and micro-adaptive). In the micro-adaptive-only condition, students are working on tasks that were so selected to address typical challenges for all students, i.e. following a one-size-fits-all approach. In this micro-adaptive-only condition, students received scaffolding while working on a task (within-task adaptivity) based on their individual performance on that particular task. For instance, if a student articulated a misconception during the solving of a problem, the system would correct it.
In the macro-adaptive condition, students were assigned to four groups corresponding to four knowledge levels (low knowledge, medium-low knowledge, medium-high knowledge, and high-knowledge) and appropriate instructional tasks were assigned to each group using an Items-Response Theory style analysis (Rus et al. 2014). That is, high-knowledge students received more challenging problems appropriate for their level of expertise while low knowledge students received less challenging problems. The consequence of this more-adaptive task selection policy is reflected in the dialogue characteristics as, for instance, the percentage of hints (explained later) is expected to be similar for both high-knowledge and low-knowledge students as the tasks are similarly challenging relative to the knowledge level of the students. Within a task, the fully-adaptive ITS offered identical micro-adaptivity to the micro-adaptive-only ITS. It should be noted that in the micro-adaptive-only case, the problems were selected (two each) from the set of problems used for the four knowledge groups in the fully-adaptive condition.
The features of the prediction model
The proposed approach relies on a set of features that was inspired from the previous work mentioned earlier as well as other work such as automated essay scoring (Shermis and Burstein 2003) in which the goal is similar to some extent to ours: infer students’ knowledge level or skills based on their language in a written essay. Furthermore, our set of features is grounded in the learning literature as explained next.
The set of dialogue interaction features we employed can be classified into three major categories: time-on-task, generation, and pedagogy. Time-on-task, which reflects how much time students spend on a learning task, correlates positively with learning (Taraban and Rynearson 1998). Time-on-task is measured in several different ways in our case such as total time (in minutes) or normalized total time (we used the longest dialogue as the normalization factor). We computed several additional time-related features such as average time per turn and winsorized versions of the basic time-related features.
Generation features are about the amount of text produced by students. Greater word production has been shown to be related to deeper levels of comprehensions (Chi et al. 2001; VanLehn et al. 2007). We mined from our dialogues many generation-related features such as dialogue length, average turn length, vocabulary size, content word vocabulary size (content words: nouns, verbs, adjectives, and adverbs), and target domain vocabulary size, i.e. a measure of how many words from our target domain, which is Physics, students used.
Lastly, we extracted pedagogy-related features such as how much scaffolding a student received (e.g. number of hints) during the training. Scaffolding is well documented to lead to more learning than lecturing or other, less interactive types of instruction such as reading a textbook (VanLehn et al. 2007). Feedback is an important part of scaffolding and therefore we also extracted features about the type (positive, neutral, negative) and frequency of feedback (Shute 2008).
We extracted raw features as well as normalized versions of the features. In some cases, the normalized versions seem to be both more predictive and more interpretable. For instance, the number of hints could vary a lot from simpler/short problems, where the solution is relatively short and require less scaffolding in general, to more complex problems with longer solutions which require more scaffolding as there are more steps in the solution. A normalized feature such as percentage of hints would allow us to better compare the level of scaffolding in terms of hints across problems of varying complexity or solution length. In our case, we normalized the number of hints by using the maximum number of hints a student may get for a particular problem which happens when the student responds entirely incorrectly to every single hint from the computer tutor. We can infer the largest number of helpful moves, i.e. hints, from our dialogue management component a priori.
Statistics of the dialogue corpus
total_time: the time length of the dialogue in minutes
avg_time_per_turn: the average length of a student turn in minutes
dialogue_size: total length of the student dialogue (#words, excl. punctuation)
avg_dialogue_size_per_turn (#words, no punctuation)
dialogue_length_div_voc: dialogue_size divided by student’s vocabulary size
#chunks: total number of syntactic constituents or chunks
#sentences: total number of sentences
content_vocSize: the vocabulary size of content words
non_content_vocSize: the vocabulary size of non-content words
vocSize: total vocabulary size
%physicsTerms: percentage of physics related terms out of all the words used
%longWords: percentage of long words out of those used
%puctuation:percentage of punctuation out of all tokens used
%articles: percentage of articles such as an or the out of all the words used
%pronouns: # of non-self-reference pronouns (you, they) out of all words
%self-references: # of self-reference pronouns (me or we) out of all words
totalIC: total Information Content of the dialogue
positiveness: text positiveness computed based on SentiWordNet
negativeness: text negativeness
#turns: total number of student’s turns
#normalized total number of student turns
#c_turns: number of student turns classified as contributions (no questions)
%pos_fb: percentage of turns for which student received positive feedback
%neg_fb: percentage of turns for which student received negative feedback
pos_div_pos+neg: positive feedback divided by (positive+negative) feedback
#shownHints: total number of shown hints
#shownPrompts: total number of shown prompts, a type of hints
#shownPumps: total number of shown pumps, a type of hints
Statistics of the dialogue corpus
# of complete dialogues
# of dialogue turns
Experiments and results
Our goal was to understand how various characteristics associated with dialogue units corresponding to instructional tasks in a session relate to students’ prior knowledge as measured by the pre-test, which is deemed as an accurate estimate of students’ prior knowledge level. Our first step towards this goal was to do a feature analysis which is described next.
Correlations values with pre-test (top) and pre-test-FM (bottom) for the most interesting features on each of the 8 problems in the micro-adaptive-only condition
From Table 3 one can see that with some exceptions for problem 5, the time length (ft1), the total number of sentences (fg7), the number of turns (fs1), and the number of hints (fs11) and prompts shown (fs12) have negative correlations with the pre-test scores, while the average word-length of a turn (fg2) and the percentage of turns receiving positive feedback (fs7) have positive correlations. These findings confirm similar findings from previous studies (VanLehn et al. 2007; Stefanescu et al. 2014). Interestingly enough, the number of sentences students produce seem to be less and less correlated with the pre-test scores as the students advance through the training session.
Correlations values with pre-test (top) and pre-test-FM (bottom) for the most interesting features on each of the 8 problems in the fully-adaptive condition – the high-knowledge group of students
Predicting students’ knowledge level
To predict students’ knowledge levels, we generated regression models from subsets of consecutive problems in a training session in order to understand after how many problems the prediction of students’ knowledge level is best. The models were generated not only based on all the available features, but also on subsets of features corresponding to the three major categories of features: Time-on-Task, Generative, and Pedagogy/Scaffolding. All the models were generated using the Backward method in SPSS, so as to be able to find the r value corresponding to the highest adjusted r square value and the lowest degrees of freedom (fewest predictors). It is important to note that in the fully-adaptive condition the models were generated separately for the four groups of students corresponding to the four knowledge levels.
r (top) and adjusted r square (bottom) values for cumulative sub-dialogues in the micro-adaptive condition and the pre-test
r (top) and adjusted r square (bottom) values for cumulative sub-dialogues in the micro-adaptive-only condition and the pre-test-FM
r (top) and adjusted r square (bottom) values for cumulative sub-dialogues in the fully-adaptive condition and the pre-test
r (top) and adjusted r square (bottom) values for cumulative sub-dialogues in the fully-adaptive condition and the pre-test-FM
Conclusions and future work
We explored in this article models to predict students’ prior knowledge based on features characterizing the dialogue-based interaction between a computer-based tutor and a learner. This work was part of our greater goal to move towards non-intrusive assessment methods that would allow learners to focus on the major task, e.g. solving problems or playing a game, and improve their learning experience by eliminating test axieties and tiring effects.
Our results are quite promising with respect to moving towards a world in which learners focus on instruction with no explicit testing. Indeed, our linear regression models based on a number of interaction features yielded in the best cases an r= 0.949 and adjusted r-square = 0.833. This best result was obtained when developing prediction models using the data from the fully-adaptive ITS. This is expected because in the fully-adaptive case the models were more specialized, i.e. we derived prediction models for each of the four student knowledge levels: low knowledge, medium-low knowledge, medium-high knowledge, and high-knowledge. It should be noted that the best results for the prediction model derived from the micro-adaptive-only ITS data were very good too: r= 0.878 and r-square = 0.693. Furthemore, scaffolding features seemed to be the most predictive as a group, as somehow anticipated in a tutorial context, followed by content-generation features.
Our findings have two important implications for the future development of ITSs that would integrate non-intrusive assessment methods such as the ones proposed in this article. First, the best models derived from the micro-adaptive-only sessions provide a better estimate of the accuracy ITS developers should expect for predicting learners’ prior knowledge level in future ITSs and should be the model to be integrated first in such future ITSs, despite the fact that these models are less accurate, although pretty accurate for that matter, than the more specialized models derived from the fully-adaptive ITS data. The reason is obvious: in order to use the fully-adaptive models, the ITS needs to make a guess or have some a priori measurement of the learners’ knowledge, so that it can decide which fully-adaptive model to use for a more precise measurement of learners’ knowledge levels based on their performance on the tasks in the tutorial session. However, giving learners a pre-test in order to infer their knowledge level first defies in a way the whole purpose of our intended goal: inferring learners’ prior knowledge level from characteristics of the tutor-learner interaction only, without an explicit pre-test. In the case when a learner’s knowledge level is known a priori, e.g. from a recent classroom test, and is available as input to the ITS then the ITS could simply trigger the more specialized and more accurate prediction model corresponding to the specific learner’s knowledge level without the need to use the micro-adaptive-only prediction model.
Second, the fully-adaptive models’ high accuracy can be interpreted as validating the set of selected instructional tasks, i.e. Physics problems in our case, in the tutorial session. Task selection is a critical step in a computer tutor because it has major implications for the effectiveness of the system. If the tasks are too easy, then the learner is bored leading to her disengagement while if the tasks were too difficult the learner would be frustrated and, again, disengaged, to the point that in some cases she might even quit using the tutoring system. Indeed, the tasks should be at the right level of difficulty, not too easy and not too difficult but just right, in order to stimulate the learner and keep her engaged in the learning process throughout the whole tutorial session. That is, the role of the intelligent tutoring system is to keep the learner in the zone of proximal development (Vygotsky 1978) through an appropriate set of tasks with respect to the learner’s current knowledge state. In this sense, having components that could monitor the quality of the selected task would thus be very beneficial. It should be noted that because the task selection step is an upstream step in the tutorial process any bad decision regarding task selection would propagate to later, downstream tutoring stages. To illustrate our point, imagine an ITS with a perfect micro-adaptive module which would provide ideal scaffolding to each learner working on a particular Physics problem. Even if the scaffolding within a task were optimal, learners would not learn much if the Physics problem were way below their knowledge level. Not only that but, as mentioned earlier, the learner would feel bored and in the worst case scenario she might decide to quit using the tutoring system. Our recommendation is that future developers of ITSs should implement both types of models: the micro-adaptive-only models are needed to get a sense of learners’ knowledge level without an explicit pre-test while the fully-adaptive models are needed to monitor and validate learners’ knowledge level and the quality of the instructional tasks throughout the entire tutorial session.
We plan to further explore the topic of assessing students’ prior knowledge from dialogues by investigating affect-related features as well as by using other prediction mechanisms such as classifiers to predicting categorical knowledge levels. Furthermore, we plan to study how similar models can predict post-test scores. We are aware that students’ knowledge levels evolve during training, assuming they learn, and therefore there are limitations to our methodology. We do plan to explore in the future ways to infer students’ knowledge levels throughout a session, e.g. by having a human expert read the transcripts of a tutoring session.
This research was supported by the Institute for Education Sciences (IES) under award R305A100875 to Dr. Vasile Rus. All opinions and findings presented here are solely the authors’.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- BS Bloom, The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educ. Res. 13:, 4–16 (1984).View ArticleGoogle Scholar
- MTH Chi, SA Siler, H Jeong, Can tutors monitor students’ understanding accurately?Cogn. Instr. 22(3), 363–387 (2004).View ArticleGoogle Scholar
- MTH Chi, SA Siler, H Jeong, T Yamauchi, RG Hausmann, Learning from human tutoring. Cogn. Sci. 25(4), 471–533 (2001).View ArticleGoogle Scholar
- T Corcoran, FA Mosher, A Rogat, Learning progressions in science: An evidence-based approach to reform (CPRE Research Report #RR-63). Consortium for Policy Research in Education. University of Pennsylvania (2009). http://eric.ed.gov/?id=ED506730.
- K Forbes-Riley, DJ Litman, in Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics. Modelling user satisfaction and student learning in a spoken dialogue tutoring system with generic, tutoring, and user affect parameters (Association for Computational LinguisticsNew York, New York, 2006), pp. 264–271.View ArticleGoogle Scholar
- AC Graesser, NK Person, JP Magliano, Collaborative dialogue patterns in naturalistic one-to-one tutoring. Appl. Cogn. Psychol. 9:, 495–522 (1995).View ArticleGoogle Scholar
- D Hestenes, M Wells, G Swackhamer, Force concept inventory. Phys. Teach.30(3), 141–158 (1992).View ArticleGoogle Scholar
- M Lintean, V Rus, R Azevedo, Automatic detection of student mental models during prior knowledge activation in metatutor. Int. J. Artif. Intell. Educ.21(3), 169–190 (2012).Google Scholar
- L Mohan, J Chen, CW Anderson, Developing a multi-year learning progression for carbon cycling in socio-ecological systems. J. Res. Sci. Teach.46(6), 675–698 (2009).View ArticleGoogle Scholar
- C Moldovan, V Rus, AC Graesser, in The Proceedings of 22nd Midwest Artificial Intelligence and Cognitive Science Conference. Automated speech act classification for online chat, (2011), pp. 23–29.Google Scholar
- P Pavlik, JR Anderson, Using a model to compute the optimal schedule of practice. J. Exp. Psychol. Appl.14(2), 101 (2008).View ArticleGoogle Scholar
- JW Pennebaker, ME Francis, RJ Booth, Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates. 71:, 2001 (2001).Google Scholar
- HL Roediger, JD Karpicke, The power of testing memory: Basic research and implications for educational practice. Perspect. Psychol. Sci.1(3), 181–210 (2006).View ArticleGoogle Scholar
- C Romero, M-I López, J-M Luna, S Ventura, Predicting students’ final performance from participation in on-line discussion forums. Comput. Educ. 68:, 458–472 (2013).View ArticleGoogle Scholar
- R Rus, W Baggett, E Gire, D Franceschetti, M Conley, A Graesser, in Design Recommendations for Intelligent Tutoring Systems: Learner Modeling, 1, ed. by R Sottilare, AC Graesser, X Hu, and H Holden. Toward learner models based on Learning Progressions in DeepTutor (Army Research LaboratoryOrlando, FL, 2013), pp. 183–192.Google Scholar
- V Rus, S D’Mello, X Hu, A Graesser, Recent advances in conversational intelligent tutoring systems. AI Mag.34(3), 42–54 (2013).Google Scholar
- V Rus, M Lintean, R Banjade, NB Niraula, D Stefanescu, in ACL (Conference System Demonstrations). Semilar: The semantic similarity toolkit (Citeseer, 2013), pp. 163–168.Google Scholar
- V Rus, AC Graesser, W Baggett, D Franceschetti, D Stefanescu, N Niraula S Trausan-Matu, K Boyer, M Crosby, K Panou (eds.), Macro-adaptation in conversational intelligent tutoring matters. Automated response to questions with production rules (Springer International Publishing, Switzerland, 2014).Google Scholar
- MD Shermis, J Burstein, Automated Essay Scoring: A Cross-disciplinary Perspective (Lawrence Erlbaum Associates, Inc., Hillsdale, NJ, 2003).Google Scholar
- VJ Shute, Focus on formative feedback. Rev. Educ. Res. 78(1), 153–189 (2008).View ArticleGoogle Scholar
- VJ Shute, M Ventura, Stealth Assessment: Measuring and Supporting Learning in Video Games (MIT Press, Cambridge, 2013).Google Scholar
- D Stefanescu, V Rus, AC Graesser, in Educational Data Mining 2014. Towards assessing students’ prior knowledge from tutorial dialogues (International Educational Data Mining Society, 2014).Google Scholar
- R Taraban, K Rynearson, Computer-based comprehension research in a content area. J. Dev. Educ. 21(3), 10 (1998).Google Scholar
- K VanLehn, The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educ. Psychol. 46(4), 197–221 (2011).View ArticleGoogle Scholar
- K VanLehn, AC Graesser, GT Jackson, PW Jordan, A Olney, CP Rosé, When are tutorial dialogues more effective than reading?Cogn. Sci. 31(1), 3–62 (2007).View ArticleGoogle Scholar
- LS Vygotsky, Mind in Society: The Development of Higher Psychological Processes (Harvard University Press, Cambridge, 1978).Google Scholar
- C Williams, S D’Mello, in Intelligent Tutoring Systems. Predicting student knowledge level from domain-independent function and content words (SpringerSpringer-Verlag Berlin, Heidelberg, 2010), pp. 62–71.View ArticleGoogle Scholar
- BP Woolf, Building Intelligent Interactive Tutors: Student-centered Strategies for Revolutionizing E-learning (Morgan Kaufmann, Elsevier, Burlington, 2008).Google Scholar
- J Yoo, J Kim, in Intelligent Tutoring Systems. Predicting learner’s project performance with dialogue features in online q&a discussions (Springer, 2012), pp. 570–575.Google Scholar