 Research
 Open Access
 Published:
Evaluating the quality of the ontologybased autogenerated questions
Smart Learning Environmentsvolume 4, Article number: 7 (2017)
Abstract
An ontology is a knowledge representation structure which has been used in Virtual Learning Environments (VLEs) to describe educational courses by capturing the concepts and the relationships between them. Several ontologybased question generators used ontologies to autogenerate questions, which aimed to assess students’ at different levels in Bloom’s taxonomy. However, the evaluation of the questions was confined to measuring the qualitative satisfaction of domain experts and students. None of the question generators tested the questions on students and analysed the quality of the autogenerated questions by examining the question’s difficulty, and the question’s ability to discriminate between high ability and low ability students. The lack of quantitative analysis resulted in having no evidence on the quality of questions, and how the quality is affected by the ontologybased generation strategies, and the level of question in Bloom’s taxonomy (determined by the question’s stem templates). This paper presents an experiment carried out to address the drawbacks mentioned above by achieving two objectives. First, it assesses the autogenerated questions’ difficulty, discrimination, and reliability using two statistical methods: Classical Test Theory (CTT) and Item Response Theory (IRT). Second, it studies the effect of the ontologybased generation strategies and the level of the questions in Bloom’s taxonomy on the quality of the questions. This will provide guidance for developers and researchers working in the field of ontologybased question generators, and help building a prediction model using machine learning techniques.
Introduction
Ontology is a formal and explicit specification of a shared conceptualisation (Uschold and Gruninger 1996; Studer et al. 1998; Borst 1997). It is a knowledge representation structure, which models a specific domain of interest by providing a formal machine readable representation of entities in the domain. Entities include classes, individuals, and properties. Classes represent sets of individuals, individuals represent actual objects in the domain, and properties represent relationships in the domain between individuals.
Ontologies have been used in Virtual Learning Environments (VLEs) to capture the concepts in an educational course (Gruber 1993). Sakathi (Murugan et al. 2013) developed an ontology, which captures concepts in the computer networks domain such as the network topology, the communication’s medium, and the Open Systems Interconnection (OSI) model. Lee et al. (2005), Kouneli et al. (Kouneli et al. 2012), and Ganapathi et al. (2017) developed ontologies, which capture the educational concepts in the Java language introductory courses (Arnold et al. 1996). The ontologies aimed to teach students the fundamental concepts of programming in Java.
On the other hand, some ontologies were not developed to capture particular domains. Instead, they aimed to have the world’s largest and complete knowledge base that covers different domains. Among these ontologies is the OpenCyc ontology (OpenCyc). OpenCyc covers several domains such as; mathematics, physics, medicine, computer networks and many others, and it consists of hundreds of thousands of concepts and properties.
Ontologies have been used by several ontologybased question generators to autogenerate true and false, multiple choice, and short answer assessment questions. The question generators used several ontologybased generation strategies which exploit the ontology classes, individuals, and properties. The ontologybased generation strategies could be categorised into the following three main strategies (Papasalouros et al. 2017;2011, Cubric and Tosic 2017; Grubisic 2012; Grubisic et al. 2013; AlYahya 2014):

1.
The classbased strategy, which uses the relationship between the ontology classes and individuals.

2.
The terminologybased strategy, which uses the relationship between the class and subclass in ontologies.

3.
The propertybased strategy, which uses the object, datatype, and annotation properties in the ontologies.
Papasalouros et al. (2017;2011) defined the class, terminology, and propertybased generation strategies which traverse the domain ontology and autogenerate the multiple choice question’s correct answer (key) and incorrect answers (distracters). The three main strategies contain several substrategies which specify the classes, individuals or properties from which the question’s key and distractors are generated. For example, Table 1 illustrates a multiple choice question generated using Papasalouros’s terminologybased strategy. The question was generated from Sakthi’s computer network ontology. The question’s key is a subclass of the concept OSI model and the question’s distractors are sibling classes of the OSI model class. Table 1 also shows that the question had the "Choose the correct sentence" text, which is called the question’s stem, and it is used in all the questions generated using Papasalouros’s question generator.
Cubric and Tosic (2017) built a question generator which used the ontologybased generation strategies defined by Papasalouros. However, they extended the propertybased strategies to include more substrategies, which used the annotation properties in the ontology. Moreover, instead of using Papasalouros’s stem template, which is not related to an educational theory, Cubric and Tosic defined a set of stem templates, which aimed to assess student cognition at different levels in the Bloom’s taxonomy, which is widely used in the educational research (BS 1956; Krathwohl 2002; Anderson and Sosniak 1994). Bloom’s taxonomy categorise the assessment questions into the following six major levels, which are arranged in a hierarchical order according to the complexity of the cognitive process involved (BS 1956; Krathwohl 2002; Assessment 2002): 1) Knowledge: at this level the students need only to recall certain concepts in the domain. For example students need to list, define, and describe specific concepts in the domain without understanding how they are related to other concepts. 2) Comprehension: at this level the students need to start thinking about the meaning of the concepts in terms of their relationship with other concepts in the domain. 3) Application: at this level the students need to demonstrate their ability to use the concepts they have learned in real situations. For example the students need to provide and show examples that prove their understanding of the domain concepts. 4) Analysis: at this level the students need to understand the domain terminology structure. For example the students need to have a good overview of the concepts in the domain by analysing how they are classified and related to each other. 5) Synthesis: at this level the students should be able to relate concepts from different domains to create and develop new ideas. 6) Evaluation: at this level the students need to make judgments, assess and compare ideas and evaluate the data.
Each level in Bloom’s taxonomy is subsumed by the higher levels, for example a student functioning at the application level had mastered the educational concepts in the knowledge and comprehension levels (BS 1956). Bloom’s associated the levels hierarchical order with the question’s difficulty (BS 1956), for example knowledge level questions are easier than questions which assess other levels in Bloom’s taxonomy, and synthesis and evaluation question are more difficult than the comprehension level question (BS 1956).
Cubric and Tosic (2017) generated questions which assess students at the knowledge, comprehension, application and analysis levels only. Grubisic (2012); Grubisic et al. (2013) followed a similar approach to Cubric and Tosic by defining a set of question stem templates which assess students’ cognition at the knowledge, comprehension, application and analysis levels. However, unlike the previous work, Grubisic generated different types of questions (true and false, multiple choice, and short answer). Moreover, she ignored the classbased strategies, and only used the terminologybased and propertybased strategies to traverse the ontology and generate assessment questions.
Grubisic (2012); Grubisic et al. (2013) used ontologybased generation strategies similar to Papasalouros. However, fewer restrictions were applied for selecting the distractors in the generated questions. For example, if a question is generated to assess students on the educational concept EC, Papasalouros defined that the distractor should be one of class EC siblings, while Grubisic allowed selecting any class randomly from the ontology as long as it has no relationship with EC.
AlYahya (2014;2011) also built a question generator for autogenerating true and false, multiple choice, and short answer questions using classbased and propertybased strategies. She defined question stem templates aimed only to assess students’ cognition at the knowledge level in Bloom’s taxonomy (AlYahya 2014;2011). AlYahya followed Grubisic’s steps in allowing distracters to be randomly selected from the domain ontology.
The ontologybased question generators discussed above evaluated the autogenerated questions. However, the evaluation of the questions was confined to measuring the qualitative satisfaction of domain experts and the students who agreed that the autogenerated questions could be used as assessment questions in learning environments. None of the ontologybased questions’ generators tested the questions on students to analyse the quality of autogenerated questions by examining the question’s difficulty, and the question’s ability to discriminate between high ability and low ability students. In addition, the question generators autogenerated different types of questions using different ontologybased generation strategies. However, none of the ontologybased question generators studied the effect of the ontologybased generation strategies and the level of question in Bloom’s taxonomy on the quality of questions generated. Therefore, this paper makes the following contributions to knowledge:

1.
Developing an ontologybased question generator which integrates the stem templates and generation strategies introduced by Papasalouros et al. (2017;2011), Cubric and Tosic (2017), Grubisic (2012); Grubisic et al. (2013), and AlYahya (2014;2011). The generator could be used to generate questions from any domain ontology. In addition, it helps in evaluating the quality of questions quantitatively. This help researchers autogenerate questions with specific characteristics (e.g., high discrimination);

2.
Quantitatively analyse the quality of ontologybased autogenerated question’s for the first time;

3.
quantitatively analyse the quality of assessment tests formed from the ontologybased autogenerated questions; and

4.
study the effect of different ontologybased generation strategies and the level of question in Bloom’s taxonomy on the quality of question’s generated.
This paper is structured as follows: Section Related work illustrates the analysis used by existing question generators and presents the limitations in these analysis and the importance of our study. “Evaluation methods” explains the evaluation methods used in this paper to evaluate the quality of autogenerated questions. Section Experimental study presents the experimental study. Section Results and discussion illustrates the experiment results. Finally, “Conclusion and future work” concludes the paper and suggests future work.
Related work
Different qualitative and quantitative analyses were carried out to evaluate questions autogenerated from domain ontologies (Alsubait et al. 2014; AlYahya 2014; Vinu and Kumar 2017; Seyler et al. 2016; Susanti et al. 2017). Papasalouros et al. (2017;2011) autogenerated multiple choice questions (MCQs) from the Eupalineio Tunnel ontology, which is a domain ontology about the ancient Greek history. The questions were evaluated by two domain experts who found that all the questions were satisfactory for assessment regardless of some errors in the questions’ syntax (75% of the MCQs were assessed as syntactically correct) (Papasalouros et al. 2017). Cubric and Tosic (2017) developed an online environment where users could upload their domain ontologies, autogenerate MCQs, and evaluate the questions created by them or other users in the environment. The users evaluate the autogenerated questions by determining the question quality (the question is easy to understand and the grammar is correct), and the question usability (the question could be used in an assessment test). Cubric and Tosic did not publish any evaluation results.
Grubisic (2012); Grubisic et al. (2013) evaluated the questions autogenerated from the ‘computer as system’ domain ontology using two groups of students. The first group consisted of 14 students who had good prior knowledge in the ‘computer as system’ domain. However, the students had no experience working with VLEs. The second group consisted of 16 students who had learned about the ‘computer as system’ domain three years before the experimental study was carried out and had a good knowledge of different VLEs. 21% of the students in the first group found the questions comprehensible while 29% had a neutral opinion, and 50% found the questions incomprehensible (Grubisic et al. 2013). On the other hand, 38% of the students in the second group found the questions comprehensible, 38% had a neutral opinion, and 24% found the questions incomprehensible (Grubisic et al. 2013). Grubisic concluded that the students in the second group who were more mature (students who took the ‘computer as system’ course three years before the experiment was carried out) and who had more experience working with different VLEs were more satisfied in terms of understanding the ontologybased generated questions.
AlYahya (2014); AlYahya (2011) autogenerated true and false, multiple choice, and short answer questions from several domain ontologies such as the travel ontology, which captures information about travel destinations and hotels (Protege ontology library  protege wiki 2017). She evaluated the autogenerated questions by assessing if the questions are syntactically correct and whether the questions were suitable to be used in an assessment test. AlYahya’s evaluation results revealed that 90% of the questions generated were syntactically correct and could be used as assessment questions (AlYahya 2011). AlYahya carried out further evaluation to assess if the autogenerated MCQs were syntactically correct and could be used as assessment questions using three domain experts. The experts had experience in formulating MCQs and were asked to assess the MCQs generated from two domain ontologies (an ontology which captures the Arabic vocabulary (AlYahya et al. 2010) and a history ontology in Arabic which captures the historical concepts taught to students in the 8^{th} grade (AlYahya 2014; 2011)). The experts agreed that 82% of the MCQs generated from the Arabic vocabulary were syntactically correct and could be used as assessment questions, while 60% of the MCQs generated from the history ontology were syntactically correct and could be used as assessment questions (AlYahya 2014). AlYahya stated that the difference in the evaluation results was due to the content of the domain ontologies, as the MCQs, which were classified as unacceptable in the history ontology, were dealing with common sense or general knowledge. This was not the case in the Arabic vocabulary ontology (AlYahya 2014).
In summary, the ontologybased question generators mentioned above have the following limitations: Firstly, the evaluation of the autogenerated questions was confined to measuring the qualitative satisfaction of domain experts and the students who agreed that the autogenerated questions could be used as assessment questions in learning environments. However, none of the ontologybased questions’ generators tested the questions on students to analyse the quality of autogenerated questions by examining the question’s difficulty, and the question’s ability to discriminate between high ability and low ability students. Secondly, none of the ontologybased question generators studied the effect of the ontologybased generation strategies and the level of question in Bloom’s taxonomy on the quality of questions generated. Therefore, Section Evaluation methods presents the evaluation methods used in this paper to evaluate the questions quantitatively.
Evaluation methods
This section presents two statistical methods, which have been used to evaluate the quality of ontologybased generated questions.
Classical Test Theory
Classical Test Theory (CTT) is used to evaluate the quality of questions and assessment tests in learning environments using the statistical measures described in the following sections (Alagumalai and Curtis 2005; Ding and Beichner 2009; Doran 1980; Cohen et al. 2013; Erguven 2014).
Question difficulty index
The question’s difficulty index (P) measures the question easiness and it is defined as the proportion of students choosing the correct answer (Ding and Beichner 2009; Doran 1980; Cohen et al. 2013; Schmidt and Embretson 2003):
Where N1 is the number of correct answers and N is the total number of students taking the test. P values range from 0 to 1. Table 2 shows that questions with high difficulty indices are easy while questions with low difficulty indices are difficult.
Question discrimination index
The question’s discrimination index measures how well the question could discriminate between high ability (students with high scores) and low ability students (students with low scores) (Ding and Beichner 2009; Doran 1980; Cohen et al. 2013). The discrimination index is defined as the difference between the proportion of the top quartile students who answered the question correctly and the proportion of the bottom quartile students who answered the question correctly (Ding and Beichner 2009; Doran 1980):
Where N_{H} and N_{L} are the number of correct answers in the top quartile and bottom quartile, and N is the total number of students taking the test. Table 2 shows that questions with discrimination indices <0.3 have low discrimination, while questions with discrimination indices ≥ 0.6 have high discrimination.
Question reliability
The question’s reliability is measured using the point biserial correlation coefficient, which is the correlation between students scores in the question and students’ total scores (Ding and Beichner 2009; Schmidt and Embretson 2003; Brown 1996):
Where R _{ pbi } is the point biserial correlation coefficient for question i, \(\bar {x}_{1}\) is the average total score of students who correctly answered question i, \(\bar {x}_{0}\) is the average total score for students who did not answer question i correctly, σ _{ x } is the standard deviation of students’ total scores, and P_i is the difficulty index for question i. R_pbi value ranges from [1, 1] and high R_pbi value means that students who selected the correct answer are students with high total scores and students who selected the incorrect answer are students with low total scores. Higher R_pbi values are better (Ding and Beichner 2009). The reliability is also used to measure the question’s discrimination. Table 2 shows that questions with R_pbi <0.3 have low reliability (discrimination) while questions with R_pbi ≥ 0.6 have strong reliability (discrimination).
Test discrimination power
The test discrimination power is measured using Ferguson’s delta (δ) (Ferguson 1949), which investigates how broadly the test scores are distributed over the possible range of scores (Zhang and Lidbury 2013). Ferguson’s delta (δ) is measured using the following formula:
Where N is the total number of student who attempted the test, f_i is the number of students whose total score is i, k is the number of questions in a test. δ ranges from 0 to 1, where 0 indicates that the test has minimal discrimination and this occurs when all students have the same score. On the other hand, when δ is 1 this means all possible scores occur in the test with the same frequency (Hankins 2007). Ferguson’s delta (δ) value greater than 0.9 is considered a good discrimination as it represent the normal distribution of scores (Kline 1986; 2013a; 2013b).
Test reliability
The test reliability is measured using Cronbach’s α (Cronbach and Shavelson 2004), which measures the internal consistency of the test by finding the correlation between each question’s score in the test and the whole test score. In other words, Cronbach’s α examines whether a test is constructed from questions that address the same material and it is measured using the following formula:
Where K is the number of questions in a test, P_i is the difficulty index of the i^{th} question in the test, σ _{ x } ^{2} is the variation of the total test scores.
The CTT statistical measures have a range of desired values that questions and tests in learning environments are recommended to achieve (see Table 3).
Even though the CTT is widely used in evaluating the questions and tests in learning environments (Schmidt and Embretson 2003), it is limited in several ways: 1) Question’s difficulty, discrimination, and reliability values vary across different samples of students (Haladyna 1994). For example, questions’ are easy when the sample of students used in the analysis have high ability, and questions are difficult when the sample of students have low ability (De Ayala 2009). 2) Students and test characteristics can not be separated and they are interpreted in the context of each other (Hambleton 1991). Question’s difficulty, discrimination, and reliability values depend on the sample of students and the ability of students depends on the assessment test. For example, if a test is easy this indicates that students have high ability and viceversa. 3) CTT is test oriented rather than question oriented, as it can not predict how a particular student may do in a particular assessment question (Hambleton 1991).
These limitations have been addressed by the IRT, which is explained in the following section.
Item Response Theory
Item Response Theory (IRT) is a family of probabilistic models that relates students’ ability (θ) to the probability of answering a test question within a particular category (Lord 1980). Similar to CTT, IRT models are used to assess the question’s difficulty and discrimination. However, IRT addresses the CTT drawbacks by achieving the following (Baker 2001; Reckase 2009): 1) The question’s difficulty and discrimination values measured using IRT are sample independent, i.e., question’s difficulty discrimination values does not change across different samples of students such as high ability and low ability students. 2) Students and test characteristics in IRT can be separated; the question’s difficulty and discrimination are independent of the sample of students used in the analysis. Moreover, students’ ability is independent of the assessment questions
Models
IRT includes the following set of probabilistic models, which differentiate in the number of parameters used to describe the characteristics of the assessment questions:
1) One parameter logistic model (1PL):
This is the simplest model in IRT as it has one parameter for describing the characteristics of a student (ability), and one parameter for describing the characteristics of an assessment question (difficulty). This model assumes that all questions in the test are equally discriminating. 1PL model is presented in the following equation:
Where X _{ ij } represents the response of a student j to question i, X _{ ij } = 1 means that question i is answered correctly and X _{ ij } = 0 means that question i is answered incorrectly. θ _{ j } represents the ability of student j, and b _{ i } is the difficulty parameter of question i.
2) Two parameter logistic model (2PL):
This model is a slightly more complex model, as it considers both the question’s difficulty and discrimination. The model is presented in the following equation:
Where a _{ i } is the question’s discrimination parameter. The higher the value of a _{ i }, the more sharply the question discriminates between high ability and low ability students.
3) Three parameter logistic model (3PL):
This model is more complex than the previous models. It considers the possibility that the student correct answers could be obtained by guessing. The model is presented in the following equation:
Where G _{ i } is the guessing parameter which accounts for the possibility that all students even the ones with very low ability have a nonzero probability of answering a question correctly by guessing.
Assumptions
In order to use the IRT models to analyse an assessment’s test data, the following two assumptions underlying the model must be satisfied (De Ayala 2009; Reckase 2009; Hambleton and Swaminathan 1985; Comer and Kendall 2013; Toland 2014):
1) Unidimensionality: This assumption means that the assessment test measures only one ability parameter (θ), while multidimensionality means the test measures more than one ability parameter. Unidimensionality could be examined using the Principle Component Analysis (PCA) test (Chou and Wang 2010). PCA outputs the number of components underlying the assessment test. If one component is found the unidimensional IRT (UIRT) models can be used to analyse the assessment test data, otherwise the multidimensional IRT (MIRT) can be applied to the assessment test data.
2) Local independence: This assumption states that the only influence on an individual question response is that of the ability parameter being measured (De Ayala 2009). This indicates that there is no influence on the individual question response from other questions or other ability variables. The term local is used to indicate that responses are assumed to be independent at the level of individual students having the same ability (θ). Local independence is examined using the Local Dependence chisquare (LD x ^{2}) test which is applied for each pair of questions in the assessment test (Chen and Thissen 1997). The LD x ^{2} is computed by comparing the observed and expected frequencies of students’ responses for each pair of questions. In addition, it is applied under the null hypothesis that there is local independence between each pair of questions.
Model selection methods
Selecting the IRT model, which is the closest fit to the assessment test data is essential to obtain question’s difficulty and discrimination values which are invariant across different samples of students (Hambleton and Swaminathan 1985; Gler et al. 2014). In this paper the following methods have been used to select the IRT model with the closest fit to the assessment test data:
1) The likelihood ratio: The Liklihood Ratio (LR) statistical test (De Ayala 2009; Comer and Kendall 2013; Toland 2014) could be used to select the best IRT model from the three nested models (1PL, 2PL, and 3PL). Moreover, it could be used to select the best model from UIRT and MIRT models, which have different dimensions and the same number of parameters. LR is a chisquare based statistical test and it is measured as the difference between deviances for the two IRT models being compared. The deviance statistic is defined as:
The maximum likelihood (ML) is obtained for the IRT models using Bock and Aitkin’s ExpectationMaximization algorithm (BAEM) (Bock and Aitkin 1982). The LR statistical test is applied under the null hypothesis that there is no difference between the two compared models (model 1 and model 2). If the difference between the models deviances which has a chisquare distribution is statistically significant then model 2 has better fit to the assessment test data compared to model 1, otherwise, model 1 has a better fit to the assessment test data.
2) Information theoretic methods:
The LR test tends to select models with more parameters (e.g., the 2PL model) which are more complex models and may be a better fit to the assessment test data compared to the models with fewer parameters (e.g., 1PL model) (De Ayala 2009; Kang and Cohen 2007). Therefore, the Akaike’s Information Criterion (AIC) (Akaike 1974) and the Bayesian Information Criterion (BIC) (Schwarz 1978) are model selection methods, which penalise the IRT models according to their complexity. They are used as a tradeoff between the complexity of the model and the goodness of fit between the model and the assessment test data. Akaike’s Information Criterion is measured using the following equation:
Where −2∗ log(Maximum Likelihood(model)) is the deviance and N _{ parm } is the number of parameters being estimated. The model with the smallest AIC is the closest fit to the assessment test data (De Ayala 2009; Toland 2014).
Bayesian Information Criterion is measured using the following equation:
Where N _{ parm } is the number of parameters being estimated, N is the sample size which is the total number of students who attempt the assessment test. The model with the smallest BIC is the closest fit to the assessment test data (De Ayala 2009; Toland 2014). Equation. 10 shows that AIC penalise the model based on the number of parameters estimated and it does not take into account the sample size. This results in AIC favouring more complex models when the sample size increase (Kang and Cohen 2007; DeMars 2012). On the other hand, BIC tends to select models that are simpler than those selected by AIC when the sample size is large (Kang and Cohen 2007). Equation 11 shows that BIC takes into account the sample size and the penalty for model complexity increases for large samples (DeMars 2012).
Experimental study
This section presents the research questions which will be answered using the evaluation methods discussed in the previous section. In addition, it presents the experimental setup and participants.
Experiment questions
This experiment aims to answer two main questions:

1.
Do the questions and tests generated from ontologies have satisfactory difficulty, discrimination and reliability values?

2.
Do the ontologybased generation strategies and the level of the questions in Bloom’s taxonomy affect the questions’ difficulty and discrimination?
Experimental setup
A question generator prototype was developed in Java and used to generate true and false, multiple choice, and short answer questions using the ontologybased generation strategies defined by Papasalouros et al. (2017); Papasalouros et al. (2011), Cubric and Tosic (2017), Grubisic (2012); Grubisic et al. (2013), and AlYahya (2014); AlYahya (2011). Figure 1 shows an example of a classbased strategy integrated in the question generator. The question generator also integrated 20 question stem templates defined by Grubisic (2012); Grubisic et al. (2013), Cubric and Tosic (2017) to autogenerate questions aim to assess student’s cognition at the knowledge, comprehension, application and analysis levels in Bloom’s taxonomy. Table 4 shows part of the stem templates integrated in the question generator.
Grubisic (2012); Grubisic et al. (2013) knowledge level stem templates focused on assessing if students could recall concepts in the domain ontology and understand the subclasses or superclasses properties between concepts. The comprehension level stem templates focused on the meaning of the concepts in terms of their relationship with other concepts in the domain. Application level stem templates assumed that students are more familiar with the domain ontology being tested, as students are asked about the relationship between individuals and concepts in the domain ontology. Analysis level stem templates focused on assessing the concept’s annotation properties and the concept’s datatype and object properties with other concepts in the domain ontology. Cubric and Tosic followed a different approach in forming the stem templates. They used words that define each level in Bloom’s taxonomy such as demonstrate, define, relate, and analyse (Assessment 2002; Felder and Brent 1997). No generation strategies or stem templates were defined by Papasalouros et al. (2017); Papasalouros et al. (2011), Cubric and Tosic (2017), Grubisic (2012); Grubisic et al. (2013), and AlYahya (2014); AlYahya (2011) to autogenerate questions which assess students at the synthesis and evaluation levels in Bloom’s taxonomy.
The Computer Networks (Murugan et al. 2013) and the OpenCyc (Matuszek et al. 2006) ontologies were used to autogenerate questions which covered the ’transport layer’ topic. 44 questions were chosen and syntactically checked by a domain expert who is a lecturer in the School of Computer Science and teaches the Computer Networks course. After that, the questions were imported into Moodle VLE to form three different tests. Tables 5, 6, and 7 illustrate the distribution of the questions generated using the ontologybased generation strategies. Each test contained true and false, multiple choice and short answer questions, and consisted of questions which aim to assess students’ cognition at different levels in Bloom’s taxonomy. Table 7 shows that the number of short answer questions used in the experiment was small compared to the true and false and multiple choice questions. This is due to that fact that Grubisic (2012); Grubisic et al. (2013) and AlYahya (2014); AlYahya (2011) defined only two generation strategies and stem templates for generating short answer questions.
The quality of questions generated was evaluated using the CTT and IRT which are explained in details in Sec.2.
Participants
In 2013/2014, third year undergraduate students registered in the Data networking course (TUO a) and the Computer Networks course (TUO b) at the University of Manchester, volunteered to take part in the experiment. In total, 126 students attempted testone, 88 students attempted testtwo, and 89 students attempted testthree. Students accessed the three tests using Moodle VLE. Their responses were recorded and used to analyse the quality of the questions and tests.
Results and discussion
This section illustrates the experiment results obtained using the CTT and IRT. Before applying the IRT models to the assessment test data, the unidimensionality and local independence assumptions were investigated. Table 8 illustrates the results obtained by applying the PCA to testone, which consists of 14 questions and was answered by 126 students. Initially, 14 components were identified; i.e., the number of components equals the number of questions in testone. Table 8 shows that testone data results in six components with eigenvalues greater than one. The first component had a 2.225 eigenvalue which is higher than the next five components (1.635, 1.248, 1.213, 1.078, and 1.004). 15.894% of the test variance was explained by the first component and a cumulative variance of 60.02% was explained by the first six components (see Table 8). The results obtained using the PCA suggests that testone is not unidimensional and it does not measure a single ability parameter. The same analysis were applied to testtwo and testthree and the results obtained also suggest that both tests are not unidimensional.
The local dependence assumption was also investigated on testone, testtwo and testthree data using the LD x ^{2} test. The results revealed that the questions are independent of each other.
After the assumptions were investigated, several IRT models were applied to the three tests and the model selection methods explained in Section Model selection methods were used to select the model with the best fit. The PCA analysis revealed that testone is not unidimensional, and six components had eigenvalues greater than one. Therefore, the model’s data fit analysis was examined using the UIRT, and the MIRT models starting from two dimensions and up to six dimensions. The following abbreviations are used throughout the analysis:
Where M is the type of IRT model which could be one parameter logistic model (1PL), two parameter logistic model (2PL), or three parameter logistic model (3PL). D is only used with MIRT as it represents the number of dimensions in IRT.
The analysis started with the 2PL model. Table 9 illustrates the likelihood ratio, Akaike’s information criterion (AIC), and the Bayesian information criterion (BIC) goodness of fit statistics after applying UIRT (2PL) and MIRT (2PL) models to testone.
Table 10 shows the chisquare test between several models. The results revealed that AIC, BIC and chisquare tests gave consistent results identifying the 2MIRT (2PL) model as the best fit for testone data, as 2MIRT (2PL) had the smallest AIC and BIC values, and the chisquare test revealed a statistically significant difference between the 2MIRT (2PL) and the UIRT (2PL) models.
Further investigations were carried out to examine the effect of changing the type of IRT model (e.g., 2PL and 3PL) in 2MIRT on the goodness of fit statistics. Table 11 shows the goodness of fit statistics for the 2MIRT (2PL) and the 2MIRT (3PL) models. The results revealed that 2MIRT (2PL) fits testone data better than 2MIRT (3PL), as it has lower AIC and BIC values, and the chisquare test revealed no statistically significant difference (Pvalue >0.05) between the 2MIRT (2PL) and the 2MIRT (3PL) models (see Table 12). In summary, the 2MIRT (2PL) model was the closest fit to testone. The same analysis were applied to testtwo and testthree data and the results revealed that the UIRT (2PL) model has the closest fit to testtwo and testthree. The 2PL model in the three tests assumes that questions have no guessing parameter.
Do the questions and tests generated from ontologies have satisfactory difficulty, discrimination and reliability values?
The questions difficulty indices measured using the CTT when applied to questions administered to third year undergraduate students registered in the Data networking course and the Computer Networks course at the University of Manchester could be summarised as follows: The questions difficulty indices varied from very easy to very difficult in testone (see Table 13), and very easy to moderately difficult in testtwo (see Table 14) and testthree (see Table 15). 16% (7 questions out of 44) of the questions in the three tests were very easy or very difficult which results in low discriminating questions.
The CTT analysis results also revealed when applied to tests administered to third year undergraduate students that the three tests had medium difficulty with 0.525, 0.540, and 0.564 average difficulty index values. The difficulty fall within the CTT desired range of values (see Tables 16, 17, and 18) (Doran 1980; Ding et al. 2006). In addition, the tests’ average difficulty index values were very close to 0.5, which is the value that test authors are advised to achieve when constructing questions and where the test have the maximum discrimination (Doran 1980; Mitkov et al. 2017; Mitkov et al. 2006). The maximum discrimination is obtained only when all the students with high ability (students with high scores) answer the questions correctly and all the students with low ability do not answer the questions correctly.
The IRT was also used to assess the question’s difficulty due to its invariance assumption. Tables 19, 20, and 21 illustrates the IRT analysis results obtained for testone, testtwo, and testthree accordingly. The results revealed a strong relationship between the difficulty indices obtained using the CTT and IRT (Pearson R= 0.602, Pvalue <0.05). In addition, the IRT analysis revealed that 22.7% (10 questions out of 44) of the questions were either very easy or very difficult.
The discrimination was also measured for the individual questions and the entire assessment tests. The question discrimination indices obtained using the CTT when applied to the three tests administrated to third year undergraduate students (see Tables 13, 14, and 15) had positive values. This indicates that the autogenerated questions may not need to be reviewed or eliminated from the assessment tests (Doran 1980; Mitkov et al. 2006; Mitkov and Ha 2017). In addition, the three tests had satisfactory average discrimination values above 0.30 (see Tables 16, 17, and 18) which indicates that the questions could efficiently discriminate between high ability and low ability students (Doran 1980; Zhang and Lidbury 2013; Thorndike and Hagen E 2017; Corkins 2009). Similar results were obtained using the IRT, which could be seen in Tables 19, 20, and 21. The results revealed that the questions in the three tests had positive discrimination values and that the autogenerated questions may not need to be reviewed or eliminated from the assessment tests (Baker 2001; Hambleton and Swaminathan 1985).
The CTT was also used to obtain the tests’ discrimination power using Ferguson’s delta. The results revealed that the three tests had satisfactory discrimination power with Ferguson’s delta values above 0.90 which is the discrimination power for normally distributed test scores.
The questions’ reliability was measured using the point biserial correlation coefficients (R _{ pb }), which is shown in Tables 13, 14, and 15. The results revealed that the question’s reliability values in the three tests administrated to third year undergraduate students were positive and the questions’ could effectively discriminate between low ability and high ability students as the average point biserial coefficients in each test were satisfactory with values above 0.2.
The test’s reliability values was obtained using Cronbach’s α, which revealed that testone and testtwo had poor reliability with 0.54, 0.56 reliability values respectively, while testthree had a higher reliability value (0.604), which is considered acceptable. The tests low reliability values obtained using Cronbach’s α are due to the fact that the individual questions in each test had satisfactory reliability values (R _{ pb }) which are not high enough to improve the tests’ overall reliability (Jones 2009). Higher R _{ pb } values are desired and lower R _{ pb } values indicate that a question is not testing the same educational material or may not be testing the same educational material at the same level (Ding and Beichner 2009). In this experiment the questions are generated from the same domain ontologies (OpenCyc and Computer Networks). As a result the context of the educational material being tested is known. However, the autogenerated questions were designed to assess different educational concepts at different levels of Bloom’s taxonomy, which may result in satisfactory reliability values at the questions’ level (average R _{ pb }) but low reliability values at the test’s level (Cronbach’s α).
Do the ontologybased generation strategies and the level of the questions in Bloom’s taxonomy affect the questions’ difficulty and discrimination?
This section studies the effect of the ontologybased generation strategies and the level of questions in Bloom’s taxonomy on the questions’ difficulty and discrimination obtained using the CTT (dependent on the sample of students) and the IRT (independent from the sample of students).
The study was carried out on the CTT difficulty and discrimination indices obtained for the whole 44 questions (total number of assessment question in testone, testtwo, and testthree), and on the IRT difficulty and discrimination indices, which did not experience variance across different samples of students. The invariance of IRT measurements was tested for the whole 44 questions by dividing the students in each test (testone, testtwo, and testthree) into two groups: low ability students (students with test scores less than 50%) and high ability students (students with test scores above or equal 50%) following the approach in (Hambleton and Swaminathan 1985). Students in each test could also be divided according to their gender or year of study (De Ayala 2009; Crocker and Algina 1986). However, this was not applicable in the experiment carried out in this paper due to the large difference in students’ numbers when the students in each test were divided according to their gender or year of study. The IRT model, which has the best fit to the whole sample of students in each test, was applied to the low ability and high ability sample of students separately to obtain the questions’ difficulty and discrimination indices. The standard deviation was measured for the question’s difficulty and discrimination across the three groups of students: the whole sample of students, students with low ability, and students with high ability. Questions with large standard deviation values compared to other questions in the assessment test were considered outliers as they experienced high variance across the three groups of students. In total 10 questions out of 44 violated the IRT invariance assumption and were not used in the upcoming evaluations.
Does the ontologybased question generation strategy affect the question difficulty and discrimination?
The results revealed that generating questions using different generation strategies (class, terminology, and property) appear to affect the question difficulty and discrimination obtained using the CTT and IRT. A statistically significant difference in the CTT difficulty indices (U = 69, Pvalue <0.05) and IRT difficulty indices (U = 26, Pvalue <0.05) was found between questions generated using the terminologybased strategies and questions generated using the propertybased strategies. This suggests that students found questions which assess their knowledge about an educational concept and how it is related to other concepts using the superclass and subclass properties easier than questions which assess their knowledge about the concept’s object, datatype, and annotation properties). Questions generated using terminologybased strategies had higher CTT difficulty indices (Spearman’s R = 0.476, Pvalue <0.01) and lower IRT difficulty indices (Spearman’s R = 0.583, Pvalue <0.01). Higher difficulty indices in CTT means the question is easy while in IRT it means the question is more difficult.
No statistical significant difference was found in the CTT and IRT difficulty indices between questions generated using classbased strategies and terminologybased strategies, and between questions generated using classbased strategies and propertybased strategies. This suggests that the students found questions autogenerated using the individual and class relationship in the ontology as difficult as questions generated using the terminologybased strategies and the propertybased strategies.
The questions’ discrimination indices were also investigated and the results revealed a statistical significant difference in CTT discrimination indices (U= 74, Pvalue <0.05), CTT R _{ pb } (U = 59, Pvalue <0.05), and IRT discrimination indices (U = 43, Pvalue <0.05) between questions generated using the terminologybased strategies and questions generated using the propertybased strategies. Questions generated using the terminologybased strategies have better discrimination values compared to questions generated using the propertybased strategies; questions generated using terminologybased strategies had higher CTT discrimination indices (Spearman’s R = 0.454, Pvalue <0.01), higher CTT R _{ pb } (Spearman’s R = 0.521, Pvalue <0.01), and higher IRT discrimination indices (Spearman’s R = 0.456, Pvalue <0.01) compared to questions generated using the propertybased strategies.
The results also revealed that there is a statistical significant difference in CTT discrimination indices (U = 2, Pvalue <0.05) and CTT R _{ pb } (U = 2, Pvalue <0.05) between questions autogenerated using classbased strategies and terminologybased strategies. Questions generated using terminology based strategies had higher CTT discrimination indices (Spearman’s R = 0.63, Pvalue <0.05), and higher CTT R _{ pb } (Spearman’s R = 0.617, Pvalue <0.05) compared to questions generated using classbased strategies. However, this result depends on the sample group of students as no statistical significant difference was found in the IRT discrimination indices (sample independent) between questions generated using the classbased strategies and questions generated using the terminologybased strategies. In addition, no statistical significant difference was found in the CTT discrimination indices, the CTT R _{ pb }, and IRT discrimination indices between the questions generated using classbased and property based strategies. This suggests that the classbased and propertybased strategies produce questions, which have similar discrimination indices.
Do Bloom’s taxonomy stem templates affect the question difficulty and discrimination?
Grubisic (2012); Grubisic et al. (2013), Cubric and Tosic (2017) defined several question stem templates to autogenerate questions aimed to assess students’ cognition at different levels in Bloom’s taxonomy. However, they never investigated whether the question stem templates order the autogenerated questions according to their easiness in Bloom’s taxonomy or whether the question stem templates affect the questions’ discrimination. Therefore, this section investigates the effect of the level of question in Bloom’s taxonomy on the question difficulty and discrimination.
The results revealed that the question stem templates defined by Grubisic (2012); Grubisic et al. (2013), Cubric and Tosic (2017) appear to order questions according their easiness in Bloom’s taxonomy. A statistical significant difference in the CTT difficulty indices (U = 21, Pvalue <0.05) and IRT difficulty indices (U = 13, Pvalue <0.05) was found between questions in the knowledge and comprehension levels. Questions generated to assess the students in the knowledge level are easier than questions generated to assess the students in the comprehension level, as they have higher CTT difficulty indices (Spearman’s R = 0.614, Pvalue <0.01) and lower IRT difficulty indices (Spearman’s R = 0.616, Pvalue <0.01). The results are expected as the knowledge level stem template shown in Table 4 focused on assessing whether students could recall concepts and are aware of the subclass and superclass relationships between concepts. However, the comprehension level stem templates focused on students’ understanding about the similarity of the relationship between concepts (see question 2 in Table 4) and whether students know all the concept’s subclasses and superclasses.
The results also revealed that questions in the knowledge level were easier than questions in the application and analysis level. This is due to the fact that the application level stem templates defined by Grubisic (2012); Grubisic et al. (2013), Cubric and Tosic (2017) focused on the relationship between the individual and superclass (see question 3) as students need to provide an example of the concept he/she learned. Similarly, in the analysis level stem templates students are assessed on the annotation and object properties in classes and individuals (see question 4). For students, these stem templates are harder than knowledge level stem templates which focus on recalling concepts in the domain ontology.
However, no statistical significant difference in the CTT difficulty indices and IRT difficulty indices was found between the other levels in Bloom’s taxonomy. This suggests that comprehension, application, and analysis level questions appeared to have to the same difficulty to students.
Questions’ discrimination was also investigated and the results revealed that the knowledge level questions, which are the easiest questions, tend to have lower discrimination compared to comprehension, application and analysis level questions. On the other hand, no statistical significant difference in the CTT discrimination indices, CTT R _{ pb }, and IRT discrimination indices was found between comprehension, application, and analysis level questions which suggest that comprehension, application and analysis question stem templates autogenerate questions which have the same discrimination.
Conclusion and future work
This paper presented the experiment carried out to analyse the quality of the questions, which were generated using Papasalouros et al. (2017); Papasalouros et al. (2011), Cubric and Tosic (2017), Grubisic (2012); Grubisic et al. (2013), and AlYahya (2014); AlYahya (2011) question generators. It has three main contributions to the field of ontologybased question generators: 1) Developing an ontologybased question generator which integrates the preexisting stem templates and generation strategies to generate questions that assess students at different levels in Bloom’s taxonomy. 2) providing a quantitative analysis for the autogenerated questions using the CTT and IRT statistical methods. 3) Studying the effect of the ontologybased generation strategies and the level of the questions in Bloom’s taxonomy on the questions quality measurements.
The results obtained using the CTT revealed that the three assessment tests formed from the autogenerated questions had medium difficulty values, which are very close to the value (0.5) that the test authors are advised to achieve when constructing tests Doran (1980); Mitkov et al. (2017); Mitkov et al. (2006). In addition, the results revealed that the questions and tests had satisfactory positive discrimination values, which indicate that the questions and tests could effectively discriminate between high ability and low ability students, and that the questions may not need to be reviewed or eliminated from the assessment tests Doran (1980); Mitkov et al. (2006); Mitkov and Ha (2017). In addition to the CTT, the Item Response Theory (IRT) was used to assess the quality of the autogenerated questions because of its invariant assumption. The IRT analysis revealed similar results to the CTT, as the questions’ discrimination indices had positive values which justify that the autogenerated questions may not need to be reviewed or eliminated from the assessment tests Baker (2001); Hambleton and Swaminathan (1985).
As mentioned earlier, this paper also investigated the effect of the ontologybased generation strategies and the level of the questions in Bloom’s taxonomy on the questions quality measurements. The results revealed that the generation strategies and the level of the questions in Bloom’s taxonomy affect the question’s difficulty and discrimination. This provides guidance for developers and researchers working in the field of ontologybased question generators.
The analysis results obtained were based on 44 questions generated from the ’transport layer’ topic and used in three different tests which consequently consists of 14, 16 and 14 questions. The experiment could be enhanced in the future work by: 1) increasing the number of questions in each test. 2) Increasing the number of students participating in the experiment. 3) Generating questions from different topics in the computer networks domain or different domains (e.g., medicine).
The experiment results obtained using the CTT and IRT could be used in future work to build a prediction model using machine learning techniques (e.g., multiple linear regression James et al. (2014)) to predict the question’s difficulty (very difficult, moderately difficult, moderately easy, and very easy) and discrimination (low, medium, and high) in the computer networks domain using the following two features: the ontologybased generation strategy and the level of the question in Bloom’s taxonomy. This will help researchers and developers save time and effort in terms of testing the autogenerated questions on real students Arnold et al. (1996). In addition, the ontologybased question generator developed for the purpose of analysing the autogenerated questions quantitatively could be enhanced in the future work to autogenerate personalised formative feedback which takes into account the question characteristics (e.g., the level of question in Bloom’s taxonomy) Mason and Bruning (2001).
References
H Akaike, A new look at the statistical model identification. Automatic Control IEEE Trans.19(6), 716–723 (1974).
M AlYahya, in Advanced Learning Technologies (ICALT) 2011 11th IEEE International, Conference on. Ontoque: a question generation engine for educational assesment based on domain ontologies (IEEE, 2011), pp. 393–395.
M AlYahya, Ontologybased multiple choice question generation. Sci World J (2014).
T Alsubait, B Parsia, U Sattler, in OWLED. Generating multiple choice questions from ontologies: Lessons learnt, (2014), pp. 73–84.
M AlYahya, Ontologybased multiple choice question generation. Sci World J (2014).
M AlYahya, H AlKhalifa, A Bahanshal, I AlOdah, N AlHelwah, An ontological model for representing semantic lexicons: an application on time nouns in the holy quran. Arab J Sci Eng. 35(2), 21–35 (2010).
S Alagumalai, DD Curtis, Classical test theory (Springer, 2005).
LW Anderson, LA Sosniak, Bloom’s taxonomy: A fortyyear retrospective.ninetythird yearbook of the national society for the study of education, (1994).
K Arnold, J Gosling, D Holmes, The Java programming language, Vol 2 (Addisonwesley Reading, 1996).
FB Baker, The basics of item response theory (ERIC, 2001).
Bloom BS, C o C. Examiners, University, Taxonomy of educational objectives, Vol. 1 (David McKay, New York, 1956).
R Bock, M Aitkin, Marginal maximum likelihood estimation of item parameters. Psychometrika. 47(3), 369–369 (1982).
WN Borst, Construction of engineering ontologies for knowledge sharing and reuse (Universiteit Twente, 1997).
JD Brown, Testing in language programs (Prentice Hall Regents, New Jersey, 1996).
CA Assessment, Assessment of higher order skills (2002). http://www.caacentre.ac.uk/resources/faqs/higher.shtml.
YT Chou, WC Wang, Checking dimensionality in item response models with principal component analysis on standardized residuals. Educ. Psychol. Meas. 70(5), 717–731 (2010).
WH Chen, D Thissen, Local dependence indexes for item pairs using item response theory. J. Educ. Behav. Stat. 22(3), 265–289 (1997).
L Cohen, L Manion, K Morrison, Research methods in education (Routledge, 2013).
JS Comer, PC Kendall, The Oxford Handbook of Research Strategies for Clinical Psychology (Oxford University Press, 2013).
J Corkins, The Psychometric Refinement of the Materials Concept Inventory (MCI) (ProQuest, 2009).
LJ Cronbach, RJ Shavelson, My current thoughts on coefficient alpha and successor procedures. Educ Psychol Meas.64(3), 391–418 (2004).
L Crocker, J Algina, Introduction to classical and modern test theory (ERIC, 1986).
M Cubric, M Tosic, Towards automatic generation of eassessment using semantic web technologies. Intl J eAssessment (2017).
CP Dancey, J Reidy, Statistics Without Maths for Psychology: Using Spss for Windows (PrenticeHall Inc., 2004).
RJ De Ayala, Theory and practice of item response theory (Guilford Publications, 2009).
CE DeMars, Confirming testlet effects. Appl. Psychol. Meas. 36(2), 104–121 (2012).
L Ding, R Beichner, Approaches to data analysis of multiplechoice questions (2009).
L Ding, R Chabay, B Sherwood, R Beichner, Evaluating an electricity and magnetism assessment tool: Brief electricity and magnetism assessment. Phys Rev Special TopicsPhysics Educ Res. 2(1) (2006).
RL Doran, Basic measurement and evaluation of science instruction (National Science Teachers Association, Washington, DC, 1980).
R Ebel, Essentials of Educational Measurement (PrenticeHall, 1979). http://books.google.co.uk/books?id=eEv0NeqTUXYC.
M Erguven, Two approaches to psychometric process: Classical test theory and item response theory. J Educ. 2(2), 23–30 (2014).
G Ferguson, On the theory of test development. Psychometrika. 14:, 61–68 (1949).
RM Felder, R Brent, Objectively speaking. Chem. Eng. Educ. 31:, 178–179 (1997).
G Ganapathi, R Lourdusamy, V Rajaram, in World Congress on Engineering. Towards ontology development for teaching programming language, (2017).
TR Gruber, A translation approach to portable ontology specifications. Knowl Acquisition. 5(2), 199–220 (1993).
A Grubisic, Adaptive students knowledge acquisition model in elearning systems. Thesis (2012).
A Grubisic, S Stankov, B žitko, in ICIIS 2013: International Conference on Information and Intelligent Systems. Stereotype student model for an adaptive elearning system, (2013).
N Gler, GK Uyank, GT Teker, Comparison of classical test theory and item response theory in terms of item parameters. Eur. J. Res. Educ. 2(1), 1–6 (2014).
M Hankins, Questionnaire discrimination:(re)introducing coefficient d. BMC Med Res Methodol. 7(1) (2007).
TM Haladyna, Developing and validating multiplechoice test items/Thomas M.Haladyna, (Hillsdale, NJ; Hove, UK:Erlbaum, Hillsdale, NJ; Hove, UK, 1994).
RK Hambleton, Fundamentals of item response theory, Vol 2 (Sage publications, 1991).
RK Hambleton, H Swaminathan, Item response theory: Principles and applications, Vol 7 (Springer, 1985).
G James, D Witten, T Hastie, An introduction to statistical learning: With applications in r, (2014).
A Jones, Using the right tool for the job: An analysis of item selection statistics for criterionreferenced tests (ProQuest, 2009).
T Kang, AS Cohen, Irt model selection methods for dichotomous items. Appl.Psychol. Measurement. 31(4), 331–358 (2007).
A Kouneli, G Solomou, C Pierrakeas, A Kameas, Modeling the knowledge domain of the java programming language as an ontology (Springer, 2012).
DR Krathwohl, A revision of bloom’s taxonomy: An overview. Theory Into Pract. 41(4), 212–218 (2002).
P Kline, A handbook of test construction: Introduction to psychometric design (Methuen, 1986).
Kline, P, Handbook of psychological testing (Routledge, 2013a).
P Kline, Personality: The psychometric view (Routledge, 2013b).
MC Lee, DY Ye, TI Wang, Fifth IEEE International Conference on. Java learning object ontology (IEEE, 2005).
FM Lord, Applications of item response theory to practical testing problems (Routledge, 1980).
C Matuszek, J Cabral, MJ Witbrock, J DeOliveira, in Proceedings of the 2006 AAAI Spring Symposium on Formalizing and Compiling Background Knowledge and Its Applications to Knowledge Representation and Question Answering. An introduction to the syntax and content of cyc, (2006), pp. 44–49.
BJ Mason, R Bruning, Providing feedback in computerbased instruction: What the research tells us, (2001). http://dwb.unl.edu/Edit/MB/MasonBruning.html.
R Mitkov, LA Ha, A Varga, L Rello, in Proceedings of the Workshop on Geometrical Models of Natural Language Semantics. Semantic similarity of distractors in multiplechoice tests: extrinsic evaluation (Association for Computational Linguistics, 2017), pp. 49–56.
R Mitkov, L An Ha, N Karamanis, A computeraided environment for generating multiplechoice test items. Nat. Lang. Eng. 12(02), 177–194 (2006).
R Mitkov, LA Ha, in Proceedings of the HLTNAACL 03 workshop on Building educational applications using natural language processingVolume 2. Computeraided generation of multiplechoice tests (Association for Computational Linguistics, 2017), pp. 17–22.
S Murugan, RP Bala, G Aghila, An ontology for exploring knowledge in computer networks. Int. J. Comput. Sci. Appl. (IJCSA). 3(4), 13–21 (2013).
OpenCyc, Opencyc for the semantic web. http://sw.opencyc.org/.
A Papasalouros, K Kanaris, K Kotis, in eLearning. Automatic generation of multiple choice questions from domain ontologies, (2017), pp. 427–434.
A Papasalouros, K Kotis, K Kanaris, Automatic generation of tests from domain and multimedia ontologies. Interact Learn Environ.19(1), 5–23 (2011).
Protege ontology library  protege wiki (2017). http://protegewiki.stanford.edu/wiki/Protege_Ontology_Library.
MD Reckase, Multidimensional item response theory (Springer, 2009).
KM Schmidt, SE Embretson, Item response theory and measuring abilities. Handb Psychol (2003).
G Schwarz, Estimating the dimension of a model. Ann. Stat. 6(2), 461–464 (1978).
D Seyler, M Yahya, K Berberich, Knowledge questions from knowledge graphs (2016). arXiv preprint arXiv: 1610.09935.
MD Toland, Practical guide to conducting an item response theory analysis. J Early Adolesc. 34:, 120–151 (2014).
TUO, Manchester, Data networking. http://www.eee.manchester.ac.uk.
TUO, Manchester, Computer networks. http://studentnet.cs.manchester.ac.uk/ugt/COMP28411/syllabus.
RL Thorndike, Hagen E, Measurement and evaluation in psychology and education (2017).
R Studer, VR Benjamins, D Fensel, Knowledge engineering: principles and methods. Data Knowl. Eng. 25(1), 161–197 (1998).
Y Susanti, T Tokunaga, H Nishikawa, H Obari, Evaluation of automatically generated english vocabulary questions. Res Pract Technol Enhanced Learn.12(1), 11 (2017).
M Uschold, M Gruninger, Ontologies: Principles, methods and applications. Knowl. Eng. Rev. 11(02), 93–136 (1996).
EV Vinu, PS Kumar, Automated generation of assessment tests from domain ontologies. Semantic Web. 8(6), 1023–1047 (2017).
F Zhang, BA Lidbury, Evaluating a genetics concept inventory. Bioinformatics: Concepts Methodol Tools Appl, 29–41 (2013).
Acknowledgments
Not applicable.
Funding
Not applicable.
Availability of data and materials
Students responses to the three tests which were evaluated in this paper can be found in the link below. https://drive.google.com/open?id=0B25z6hoT8MGnNWZxUkdLbmdoMlE
Author information
Affiliations
Contributions
MD is the main author in this manuscript, she collected data, performed analysis on all samples, interpreted data, and wrote manuscript. MMG supervised the development of work. He also helped in reviewing the article and gave the final approval of the version to be submitted. NF helped in the experimental design part and setting up the environment for collecting the participants data. The author also helped in reviewing the article. All authors read and approved the final manuscript.
Corresponding authors
Correspondence to Mona Nabil Demaidi or Mohamed Medhat Gaber or Nick Filer.
Ethics declarations
Ethics approval and consent to participate
The research study carried out in this paper has been approved by COMMITTEE ON THE ETHICS OF RESEARCH ON HUMAN BEINGS at the University of Manchester in 2013.
Consent for publication
Not applicable.
Competing interests
The authors declares that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Ontology
 Ontologybased question generator
 Ontologybased generation strategy
 Bloom’s taxonomy
 Classical Test Theory
 Question difficulty index
 Question discrimination index
 Reliability
 Discrimination power
 Item Response Theory