Measuring students’ conceptual understanding of real functions: A Rasch model analysis

ABSTRACT


INTRODUCTION
Mathematics teaching requires frequent assessment of students' conceptual understanding in order to offer them a timely response to overcome revealed misconceptions and provide the opportunity to understand new concepts related to those previously adopted.This is very important when students encounter the concept of a real function, which cannot be considered as a separate concept (connected to sub-concepts such as domain, extremum, flow, etc.), as it connects algebra and geometry and can be represented in various ways (Dreyfus & Eisenberg, 1982).Understanding one sub-concept of a function helps students to understand a new concept.Students' ability to represent function concepts using multiple contexts is of great importance for the cognitive development of a student's conceptual knowledge of functions (Bell, 2001).It is helpful to ask questions that require transitioning from one representation to another to assess students' understanding of a concept.A solid understanding of a concept, like function, requires recognizing and smoothly transitioning between different representations.According to Hitt (1998), this is an effective way to evaluate students' conceptual understanding.
The creation of an item bank that encompasses a multitude of items functioning together requires several steps.In a recent study, Hrnjičić et al. (2022) detailed the development of a real function item bank known as conceptual understanding of real functions (CURF).CURF comprises different kinds of functions, concepts of function zero, sign of a function, function evenness/oddness, the limit and asymptotes of a function, the maximum/minimum, and increasing and decreasing of a function.The item bank measures conceptual understandings with items that require the capability to shift from one representation of the same concept to another, including verbal, algebraic, and graphical representations of content.
After creating an item bank, we must regularly assess if the items still effectively measure the given construct (Glamočić et al., 2021).Since the content and cognitive validity of the items were already established in a previous study by Hrnjičić et al. (2022), our goal for this study is to use Rasch model to determine construct validity, evaluate other psychometric properties of the item bank, and ensure its ability to measure the comprehension of real functions among first-year university students.In order to examine a larger sample of students in a more effective way, we created a computer-assisted assessment (CAA) as a web application that contains all items from CURF item bank and applied it on a similar sample of respondents as in the research by Hrnjičić et al. (2022).We have decided to use Rasch model because it helps us ensure the quality of our item bank (Boone et al., 2013;Wolfe, 2000).It also enables us to evaluate whether the items in our bank work well together to measure a single construct and to inspect the three basic assumptions of Rasch model: item fit, uni-dimensionality, and local independence.Additionally, we can check for any bias in the items towards the gender of the students or the type of high school completed.This study and paper by Hrnjičić et al. (2022) will provide a systematic overview of developing an item bank using Rasch framework.

THEORETICAL BACKGROUND
Real functions of one real variable are an indispensable part of mathematics curriculum for primary school, high school and university.Students encounter difficulties in understanding concepts covered by CURF item bank (Hrnjičić et al., 2022).In order to have an opportunity to check possible misconception at any moment it is necessary to have a quality assessment that would offer us reliable and trustworthy evaluations.Objectivity, reliability, and validity of measurement are crucial in determining the quality of assessments (Glamočić et al., 2021).When we talk about objectivity, we are referring to how much the results of measurements depend on the person who administers the test and scores the answers given by students (Marcus & Bühner, 2009).In order to fulfil the requirement of objectivity, we developed an online CAA.CAA is suitable for computer marking because the correct answers are defined in advance so there is no subjectivity in marking.Bull and McKenna (2004) defined a CAA as "the use of computers to assess students".The purpose of CAA is to deliver, mark and analyze assignments or examinations (Bull & McKenna, 2004).Previous studies (Ashton et al., 2003;Nugent, 2003;Sealey et al., 2003) have demonstrated that CAA is an effective method for evaluating students and delivering timely feedback on individual and class progress.Test validation involves gathering evidence to determine if the test items accurately represent the concept domain and if the test measures the proposed properties (Day & Bonn, 2011).Reliability directs to the viscosity of a test within itself and over time, and it is determined through statistical calculations that examine both individual items and the test as a whole (Day & Bonn, 2011).
CURF item bank consisted of 32 items and was invented by Hrnjičić et al. (2022).The content validity of the item bank was verified through an expert survey, while the cognitive validity, wording, and clarity of items were determined using student surveys.Descriptive statistics within the framework of classical test theory (CTT) were utilized to evaluate both the field-tested items and the test as a whole.The item and test analyses showed that all 32 items possess good psychometric characteristics and can be integrated into a scale that reliably measures students' CURF in university introductory mathematics courses (Hrnjičić et al., 2022).We conducted a second pilot test on the 32 items using Rasch model to validate this further.

A Rasch Model Approach
Tests can be designed using either CTT or probabilistic test theory (Mešić et al., 2019).CTT assumes that a person's test score comprises their "true" score and a measurement error.In contrast, probabilistic test theory allows statements about the outcome as probabilities for observable variables (Schmid, 1992).In CTT theory, we measure the knowledge dimension by adding up the test scores on all items.In item response theory (IRT), the unobservable trait is known as the latent trait (θ), and we cannot measure it directly (Bond & Fox, 2015).Rasch model is considered the simplest probabilistic test model (Bond & Fox, 2015).Rasch (1960) devised a probabilistic model for specific intelligence and achievement tests.According to this model, a person with a higher capacity than another person should have a greater possibility of solving any item of the same type.Also, if one item is more complex than another, the probability of solving the second item is more significant for any person.The assumptions of Rasch model include representing each person through its ability (  ), each item through its difficulty (  ), which in the end can be demonstrated by numbers along one line (Bond & Fox, 2015).The likelihood of someone succeeding on a test item is established by their ability (the number of correct answers) compared to the difficulty level of the item (the number of people who have succeeded).Different models for testing probabilities exist, and they vary based on the shape of the item's characteristic function.This function models the relationship between a person's responses to test items and their ability level (Bond & Fox, 2015;Hambleton et al., 1991.)According to Hambleton et al. (1991), Rasch model's item characteristic function can be expressed, as in Eq. (1): where   represents the probability that an examinee with ability   will answer the item  correctly when selected randomly.
The computer yields a log-transformed display of estimates for a person's ability and item difficulty, presented on a logit scale (logarithmic transformation of the odds or probability of success).Every estimate comes with an associated margin of error, as explained by Bond and Fox (2015).
Rasch model is a construct validation instrument (Fisher Jr, 1994).Cronbach and Meehl (1955) introduced the term "construct validity", and according to Overton (1999), it represents the extent to which a test (or score) can be used to estimate a theoretical construct or trait.Rasch model implies that observed behaviors are indicators of an underlying construct (Bond & Fox, 2015).The model also allows for estimating reliability for both individuals and items.The item reliability index assesses the reproducibility of item placement along a given path, assuming the same items are given to another group of individuals with similar characteristics.The item separation index indicates the distribution or separation of items on the measured variable.Likewise, the person reliability index measures the repeatability of a person ordering on a parallel set of items measuring the same construct.The person separation reliability estimates how well individuals can be distinguished based on the measured variable.In this article, Rasch model's three core assumptions -item fit, uni-dimensionality, and local independence -are involved in analyzing the construct validity and reliability of CURF item bank.Uni-dimensionality means that only one latent trait is present in item responses, while local independence requires that test items should not be related.Local dependency may be evident when a participant's item responses depend on their answers to other test items.Item fit statistics are divided into outfit statistics, which focus on unexpected responses away from the item's or person's measurement, and infit statistics, which focus on unexpected responses close to the measurement (Bond & Fox, 2015).

METHODOLOGY Research Design
Our research design follows the steps outlined by Liu (2010).It includes testing a representative sample of the target population, conducting Rasch modelling, reviewing item fit statistics and the Wright map, and establishing validity and reliability claims for the measurement instrument.For the purpose of this research, we created an online accessible CAA, which contains all 32 items, presented in the same order as in the original CURF bank.The reason why we decided on the online CAA available via the link is because of its advantages such as: rapid scoring and feedback, originality, flexibility, interactivity, innovation, cost reduction and operational efficiency in terms of paper use and scoring services, item-banking, and consistency with item response theory (Brown, 2013).

Participants
At the beginning of the study, 207 first-year college students enrolled in introductory mathematics courses participated in the research.Convenience sampling was used for obtaining the student sample, based on the availability of access to state universities.We obtained informed consent from all participants who were tested at the beginning of the first semester, before they attended any classes on the concepts covered in CURF.The only inclusion criterion was that the students studied these concepts in the final year of high school, and the similarity of the mathematics curriculum in high school.At the beginning of the analysis, we excluded 40 students whose answer to the question: "have you studied about real functions in high school?"was no, as well as seven students who did not study these concepts according to the same curriculum as the other students.Thus, our final sample consisted of 160 freshmen, namely: 92 students from the Faculty of Traffic and Communication of the University of Sarajevo, 39 from the Faculty of Science of the University of Sarajevo and 29 students from the Faculty of Mechanical Engineering of the University of Zenica.64 students were female and 96 were male.70 students graduated from grammar school and 90 from technical high school.

Instrument
CURF is a multiple-choice test with a dichotomous scoring system: one point for a valid answer and zero point for a wrong one.Each question has four possible answers, one of which is correct, and the other three are distractors.CURF CAA test can be carried out online without a time limit.The test consists of 32 questions spread across four cards.Card A contains questions that require a transition from graphic to verbal representation, card B contains questions that demand a transition from graphic to algebraic representation, card C contains questions that need a transition from algebraic to graphic representation, and card D contains questions that direct a transition from algebraic to verbal representation.Each concept mentioned has four questions that measure the ability to move from one representation to another of the same concept.Items A1, B1, C1, and D1 require the ability to recognize the kind of a function (e.g., linear, quadratic, etc.).Items A2, B2, C2, and D2 refer to the concept of zero of a function.Items A3, B3, C3, and D3 relate to even functions, while A4, B4, C4, and D4 relate to odd functions.Items A5, B5, C5, and D5 examine the ability to determine in which intervals the function is positive and in which it is negative, while items A6, B6, C6, and D6 examine understanding of the limit and asymptotes of a function.Items A7, B7, C7, and D7 require the ability to recognize in which intervals the function increases/decreases, and items A8, B8, C8, and D8 refer to the concept of maximum/minimum of a function.At the end of the test, students are shown a report on how many items they have answered correctly, as well as the percentage of correct answers.You can find question cards in English by visiting the link: https://sites.google.com/view/supplementalmaterialaphd1/.

Procedure
Before testing the students, we organized meetings with faculty administration and professors who give lectures on introductory mathematics courses.On this occasion we asked them for their cooperation and introduced them to the details of conducting the test.Each faculty had a specific date and time for the test.At the beginning of the test, the subject professor introduced the students to the test procedure and forwarded the link for accessing the test.In order to further motivate them, the professors decided to evaluate the test work as a student activity during the semester.

Data Analysis
We utilized software WINSTEPS version 3.72.3for our Rasch analysis, following Szabó's (2008) suggestion to check for person misfit by examining negative point-biserial score correlations and person infit/outfit.Positive point-biserial correlation statistics indicated that all items were functioning as predicted, while negative or close-to-zero values suggested inconsistencies with the construct.We also used mean square residual (MNSQ) and standardized mean square residual (ZSTD) statistics to assess item fit.After addressing these factors, we checked for multidimensionality using principal component analysis (PCA) of the residuals.We calculated reliability and separation indices for items and persons and verified local item independence by examining standardized residual correlations.We also utilized the Wright map to explore the relationship between item difficulty and a person's ability and investigated differential item functioning (DIF) concerning gender and high school type.Overall, our quality measure research aimed to ensure our analysis's fairness, accuracy, and consistency.

RESULTS
We conducted a Winsteps analysis on the responses of 160 students to 32 items, first checking for person misfits.Six individuals with negative point-biserial correlations were identified and removed from our dataset.We re-ran Winsteps on a sample of 154 students and 32 items.We removed another person due to a negative correlation.After that we ran Winsteps on a sample of 153 students and 32 items and the correlations for all items and all persons were positive.According to Linacre (2022), MNSQ of fit statistics is considered productive for measurement if its values are in the interval 0.50-1.50.The acceptable range for infit MNSQ values for individuals is between 0.50 and 1.50, with a more specific range of 0.58 to 1.38.However, two individuals had an outfit MNSQ value greater than 1.50, with minimal deviation from the ideal range at 1.66 and 1.62.Their standardized infit or outfit statistics, measured as ZSTD, were outside the acceptable range of -2.00 to 2.00 (Bond & Fox, 2015), with values greater than 2.00.We deleted these two persons from the database.We ran Winsteps again for 151 students and 32 items (Table 1) and concluded that all persons had infit MNSQ below 1.50, and only one person had outfit indices slightly above 1.50, more precisely an outfit MNSQ of 1.52.Since this person had good fit indices and positive correlations, we kept it in the database.
Our analysis discovered that none of the items had negative correlations with point measurements.The values for pointmeasure correlation statistics are ranging from 0.24 to 0.55.It is necessary to examine item fit statistics because misfitting items could display additional dimensions (Liu, 2010).According to Linacre (2019), a good fit between the model and data is indicated when infit and outfit MNSQ values fall within the range of 0.50 to 1.50.In this study, all 32 items had infit statistics within this range, ranging from 0.82 to 1.18.Figure 1 visually displays MNSQ infit index for item difficulty measures, with the size of each item's bubble reflecting the standard error of its difficulty estimate.Outfit statistics were also within the ideal range for 31 items, except for item C6, whose outfit MNSQ value of 1.61 indicated a slight deviation.
The estimates of item difficulty in logits range from -2.10 to 2.03.Person abilities are in the range from -1.74 to 3.87, while three persons have a value of 5.12 and they answered everything correctly.The mean measure for items in logit is 0.00, and standard deviation (SD) is 0.99.The value of the estimated mean of person ability of 0.54 in comparison of the values of the estimated mean of item 0.00 shows that the test was easier for this sample.The person reliability is at 0.82, and the item reliability is at 0.96.2016), if the reliability index is close to 1 it shows internal consistency.The item reliability value of 0.96 suggests that the construct is estimated consistently.
In order to check the representative aspect of validity, the item strata separation is given-for the test to be accepted as representative, there should be at least two levels (Smith Jr, 2001).For persons, the separation index is 2.15 (satisfactory), and for items it is 4.83, showing a wider range of item difficulty.Based on Rasch model algorithm, the unstandardized fit estimates (also known as mean squares) are designed to have an average of one (Bond & Fox, 2015).
In our study, the unstandardized item fit statistics for CURF item bank have means that are very close to the anticipated one (infit mean squares=1.0& outfit mean squares=0.97),with little deviation from the ideal for both infit and outfit mean squares (infit SD=0.11 & outfit SD=0.21).Additionally, the values for the average fit and SDs of individuals are acceptable (infit=1.00,SD=0.18, outfit=0.97,& SD=0.28).
After examining negative point-biserial and item fit statistics, Linacre (1998) suggests checking for multidimensionality.We used PCA of residuals in Winsteps to check for this.The measure's raw variance explained is 26.9%, with 10.6% explained by persons and 16.3% by items.If the data fit the model perfectly, 27.3% would have been explained.The difference between the observed variance and that expected by the model was minimal at only 0.4%.The variance of the residuals represents the unexplained variance in a data set, while the variance of the persons and items defines the explained variance, as per Bond and Fox (2015).In our paper SD value of person ability is 1.23, while SD value of item difficulty is 0.99.
According to the nomogram (Linacre, 2008), the expected "variance explained by Rasch measures" is consistent with our data and results.Uni-dimension can be judged by using the eigenvalue in Winsteps.The critical factor to consider in the process is the amount of variance, also known as eigenvalue, present in the first contrast, as stated by Bond and Fox (2015).If the first contrast has less than three units or eigenvalue, we can conclude that the test is unidimensional, as suggested by Linacre (2022, p. 647).
Our paper shows that the first contrast's units or eigenvalue is less than three, specifically 2.8 items strong.
We investigated local dependency by examining the most significant standardized residual correlations.As per Linacre (2017), a cause for concern may be the positive between-item correlations greater than 0.7.In our case, all standardized residual correlations are less than 0.70 (the highest correlation is 0.33 for A3 and A4 items).In order to find out more about the item person relationships in relation to the underlying variable, we examine the Wright map (Figure 2), where we can see the person distribution against the item distribution.
The map variable contains items on the right side, and the left is occupied by Xs representing students.Each X indicates two students on the map.The bottom portion of the map comprises the most straightforward items and the less capable students, while the top portion contains the most challenging items and the most capable students (Linacre, 2022).According to Bond and Fox (2015), Rasch model typically sets a 50.0%chance of success for individuals on an item positioned at the same point on the item-person logit scale.Test information function is a technique used in a Rasch analysis to evaluate how a test is functioning (Boone & Staver, 2020).It provides us with additional insight into person-test targeting (Mešić et al., 2019).Test information is a concept that shows the amount of information each item provides as it relates to any person parameter (Bond & Fox, 2015).Well-targeted persons provide more information (and less error) as opposed to poorly targeted persons (Bond & Fox, 2015) (Figure 3).
To ensure fairness, we conducted a DIF analysis to check for any bias in our items.Specifically, we tested whether our items were similar for boys and girls using Mantel-Haenszel procedure-a widely-used method for analyzing dichotomous items (Figure 4).
Our results showed that, on average, boys had a measure of 0.52 logits while girls had 0.56 logits, indicating that girls were slightly more successful.It is important to note that extreme scores from both persons and items were excluded from our analysis as they do not exhibit differential ability across items.Then we inspected whether some of | | ≥ 0.64 or statistically significant (  −   ≤ 0.05 ) (Linacre, 2017).Items with a | | ≥ 0.64 , which are statistically significant are: A8 and D1 and they are harder for boys and easier for girls.This is in line with the conclusion that girls find the test to be slightly easier than the boys.In Figure 5, we conducted a uniform DIF analysis based on the completed high school type.The majority of the items did not demonstrate DIF.However, A4 and B8 show statistically significant DIF,  ≤ 0.05.A4 is more difficult for grammar school, while B8 is more difficult for technical school.

DISCUSSION
According to an investigation by Hrnjičić et al. (2022), CURF item bank underwent pilot testing and was analyzed using descriptive statistics within the framework of CTT.The study marked the first time Rasch dichotomous model was used to calibrate CURF item bank.Rasch measurement model is considered superior for construct validation of measures based on CTT, as per Crocker and Algina (1986).All items showed positive point-measure correlations, confirming that they operate in the planned direction and exist in the same direction as the latent variable.Upon analyzing the overall fit of the data to the proposed model, we found that all 32 conceptual items had excellent item fit statistics, so there was no need to remove any from the analysis.The only item slightly out of fit was item C6, based on the outfit MNSQ statistic.However, since it is a minimal outfit that would not distort or degrade the measure or the construct, it would not be necessary to eliminate this item from the test.In order to obtain information about how much was measured and how well it was measured, we created the pathway map (Figure 1), which provides the following conclusion: most of the 32 items are found on the pathway, so the quality-control aspect of item performance may be considered satisfactory (Bond & Fox, 2015).The item reliability index of 0.96 indicates that we can rely on the order of item estimates to be replicated in other suitable samples using CURF.Person reliability is at satisfactory level.Five levels of item performance are identified by item strata of 4.76, while person strata of 2.15 indicates that we can identify two different groups of respondents.All of the values met the requirements for accurate measurement.The unstandardized item and person infit statistics mean value is one, while the unstandardized item and person outfit statistics mean value is very close to the expected one (more precisely 0.97).
Our uni-dimensionality examination found that the measure accounted for 26.9% of the raw variance.As per Sumintono and Widhiarso's (2015) guidelines, any value over 20.0% is acceptable.They also recommend that the 'unexplained variance' be at most 15.0% for an ideal result.In each of the five contrasts the unexplained variance was less than 10.0% and we can consider it as an ideal value (Sumintono & Widhiarso, 2015).As the first eigenvalue in raw variance unexplained by Rasch model was smaller than three this instrument is unidimensional.
When analyzing the positions of items on the Wright map's latent variable continuum (Figure 2), it becomes apparent that most individuals are positioned opposite the items.It implies that the items are well-suited for these individuals.If there are gaps of more than 0.50 logits in the distribution of item difficulties, new items are necessary, according to Linacre (2022).The biggest gap is evident between A5 and A1 and has a value of 0.35.However, there is a group of person locations (26 persons) above the hardest item (C6), in the region above 2.03 logits, which indicates that our test cannot give us accurate information about students with extraordinary ability.Our outcomes show that our test is more effective for students with lower abilities than those who excel on the Wright map.The questions located in the lower section of the Wright map focus primarily on comprehension of the sign of a function concept (A5: determining where the function is positive/negative based on a given image), identification of functions (A1, C1, & D1: exponential and quadratic functions), and recognition of local minimum and maximum points (A8).The questions located at the top of the map assess understanding of odd functions (A4 & B4), function limits (B6 & C6: connecting function values with appropriate graphs), and function flow (B7: describing the algebraic growth/decrease of a function based on a given graph).
Established on the test information function, we can determine that the most accurate measurements are obtained at approximately zero logits.The least precise measurements are found at the low and high ends of the latent trait continuum.It is vital to note that the test information function is symmetrical when the ability level is at 0.00.The highest amount of test information is approximately 6.60, with a standard error of estimate of 0.39.The most minor standard error of measurement is located at the peak of the test information function.Persons whose ability corresponds to the peak of CURF are measured most precisely.
We discovered that our dataset met the assumption of local item independence.None of the pairs of items had a standardized residual correlation greater than 0.70.The highest correlation was found between items "A3" and "A4", with a value of 0.33.It is worth noting that these items both relate to similar concerns.In item A3, the task was to recognize the graph of an even function on the basis of four offered graphs, and in item A4, the task was to recognize the graph of the odd function on the set of offered graphs.A Rasch calibration was accomplished on 151 students.When comparing male and female students, the analysis showed no statistically significant DIF for most items, indicating test invariance.Only in two out of 32 items DIF contrast was more noticeable than 0.64 logits.Also, in comparing DIF of students based on completed high school, only two items had statistically significant DIF.

Table 1 .
Estimations of parameters for CURF item bank According toOon et al. (