Common Framework for Mathematics – Discussions of Possibilities to Develop a Set of General Standards for Assessing Proficiency in Mathematics

The article discusses the challenges of and solutions of developing a common, general standardsreferenced student assessment framework for mathematics. Two main challenges are faced. First, the main challenge is the lack of commonly accepted standards as the basis for the criterionand standards-referenced assessment. Second challenge is that, even if having criteria and standards for the assessment, the descriptions of the standards are, in many cases, so vaguely worded that it is not possible to create unambiguous test items on the basis of a specific level of the standards. An initial common framework for mathematics standards is introduced on the basis of Common European Framework in Reference for Languages (CEFR) – a common framework for mathematics


INTRODUCTION
It is not rare to see a following type of description of standard for mathematics: "the student understands information given by numbers, symbols, diagrams and charts used for different purposes and in different ways in graphical, numerical and written material". This description comes from the National Qualifications Framework (NQF) used mainly within the Commonwealth of Nations. It is regarded by the British government as the Entry level 3 in the competence-based standards for adult numeracy in England (Parsons, 2012;CLS, 2014). The same kind wording may be seen in many national curricula. The challenge in assessing the proficiency level of the students or test takers by using this kind of standard is obvious: the objective or standard is so broadly worded that, practically speaking, every possible mathematical test item can be added in the test and still we could say that we measure this standard. It would be quite an easy task to develop such a test to measure whether the objective is reached where all pupils at the end of their first school year can give full marks. However, it could be also possible to develop an alternative test where very few 9 th graders would found correct answers. This kind of description characterizes a typical one-way standard. It may tell the instructor or the decision maker what we are expecting as an output, but it cannot be used in the national or international testing settings as a basis for the item writing without a heavy process of operationalization and national consensus what is meant with "student understands", "information given by numbers, symbols, diagrams and charts", "different purposes", "different ways", and "graphical, numerical and written material". This kind of standard cannot be used any reasonable way in a pure criterion-or standards-referenced assessment (CRA/SRA -the naming is discussed in Section 1.2); it leads easily to a norm-referenced assessment (NRA).
The ultimate challenge is SRA in regard to mathematics is the (lack of) existence of the commonly accepted and comparable two-way standards. Actually, there are no internationally recognized universal standards for any of the other school subjects than languages. 1 This does not mean that it would be useless to try to create such standards for mathematics too. When there are no standards at all or the one-way "standards" are so vaguely worded that it is difficult (or impossible) to create the set of true standards-referenced tests on the basis of those, what can one do? Where to start the process of creating the standards? What to take into account? What would be the basis of that standard -the theories of human psychological development or cognitive psychology, practical everyday life needs, or a competence-based, action-driven theory as used, for instance, in the Common European Framework for Reference of Language (CEFR)?
This article discusses the challenges when creating a standards-referenced assessment system for assessing the proficiency in mathematics in a way that the standards could be used strictly in the item writing, that is, the two-way standards. Sections 1.1 on discusses SRA from the historical-and concept wise viewpoint. Section 2 handles some known challenges in SRA briefly and Section 3 discusses an initial suggestion for general standards for mathematics proficiency: a common framework for mathematics (CFM). The preliminary ideas of the general framework in reference of mathematics proficiency (GFRM) suggested by Metsämuuronen (2016a) are reshaped, broadened and deepened.

SRA in the Nutshell -A Brief Historical Note and Definitions
Thought the idea of standards-or criterion-reference assessment, -testing, or -measure has been known for a millennium (Reynolds & Fletcher-Janzen, 2002, 260), the concept of "criterion-referenced measure" was first proposed by Glaser and Klaus (1962). Usually, though, Glaser's later writing (1963) is referred to; the criterion-referenced measure is related to a student's acquisition of knowledge along a continuum ranging from no proficiency to perfect performance. Glaser indicated that specific behaviors might be identified as standards for each level of knowledge: the "criterion-referenced measure indicates the contents of the behavioral repertory, and the correspondence between what an individual does and the underlying continuum of achievement" (Glaser 1963, 520). Another leading advocate of criterion-referenced measures was Popham (Popham & Husek, 1969;Popham, 1978a;1978b;. In the early 1980s, Hambleton (1980) noted, basing on Gray (1978), that between 1963 and 1978 there already were 57 different definitions for the criterion-referenced tests -and he added one more (Hambleton 1981). One can just imagine the myriads of the literature and definitions of the topics after those days. In the ERIC database, one finds over 4000 publication with the keyword "criterion-referenced" -alone the descriptor "criterion-referenced tests" can be found almost 3000 times at the time of writing this (Summer 2016). One of the many definitions comes from Brown (1988, 4). He, nicely hitting the essence, notes the basic idea of not comparing the students with each other but with the criterion: "An evaluative description of the qualities which are to be assessed … without reference to the performance of others". Mehrens and Lehmann (1991, 18) highlight the same character by comparing the norm-referenced measurements (NRM) and criterion-referenced measurements (CRM): "… the most logical distinction between NRM and CRM has to do with whether the score is compared with other individual's scores (norm referencing) or to some specified standard or set of standards (criterion referencing)." From the measurement viewpoint, Mehrens and Lehmann (1991, 16) see, behind the division between NRA and SRA, a radical difference in thinking the scores: whether the scores are taken to carry relative (NRA) or absolute meaning (SRA). Modifying the example given by Mehrens and Lehmann (1991), let's think about a student with two tests on the final examination, say, of mathematics and native language. Assume that the student scores 70 and 80 respectively. If the scores are taken absolute and the scores in different tests are thought to correspond with each other, it seems obvious that the student is better in the native language than in mathematics. If the scores are taken relative and we compare the student into the other students, it may appear that, in the test for the native language, the test-taker is ranked at close to the median of the testtakers while in mathematics she/he might be the best test-taker in the whole population. The rank depends strictly on the difficulty level of the tests and the proficiency level in the sample/population. NRA is very effective when needed to prepare discriminative tests and to study whether there are differences between certain groups, such as sexes or geographical areas. SRA are useful when willing to study the changes in proficiency. Both approaches have their own strengths and weaknesses -this article focuses on standardsreferenced assessment and -measurement (SRM).

Criterion-or Standard-referenced Assessment?
There seems to be a slight confusion with the terms of 'criteria' and 'standards' -in many cases, in speech, they are used interchangeably. However, technically and historically, there is a major difference between these concepts. The confusion in the loose wording of the concepts is that the standards such as "elementary -basic -independent -fluent proficiency" are called criteria which, apparently, is a misconception. Popham (2014), after 45 years of his first writing on criterion-referenced testing (Popham & Husek, 1969) in his recent retrospective analysis, claims that four implementation mistakes have distorted the use of criterion-referenced measurement during the years. One of those mistakes was the misconception that a criterion is a level of performance instead of a domain of knowledge or skill. Also, Mehrens and Lehmann (1991, 17) were more or less disappointed to some early developers (such like Popham, 1978b;, Hambleton & Eignor, 1978, or Nitko, 1980 because they were prone to use the term "criterion"; according to Mehrens and Lehmann, more appropriate and accurate term would have been "domain-referenced" 2 . Note that also Linn and Gronlund (2000, 43) used the term "domain-reference".
On the basis of Oxford English Dictionary (1983), Sadler (1987, 194) reminds us that the criterion refers to a characteristics or dimension of performance, such as "proficiency in mathematical operations" or "proficiency in Arithmetic". Criterion does not tell the quality of the matter while the standard tells. Sadler (1987) reminds us that, when meaning that something is of high quality, it is not normally said to be 'of a high criterion' although it may be said to be 'of a high standard'. The criteria may have several types of standards, such as "fail/pass", "fail -satisfactory -good -excellent" or "elementary -basic -independent -fluent". Thus, by using this logic, CEFR divides the language proficiency into four criteria (proficiency in Reading, Writing, Speaking and Listening) with six standards (A1, A2, B1, B2, C1, C2). However, this logic seems to have been changed since about the time of Glaser (1963), after which a 'criterion' has been used to mean the particular score that is taken to designate competence or mastery (of tracing the shift in the meaning, see Glass, 1978).
Though it is too late to change the long-lasted and stone-carved concepts, in most cases, the concept of standards-referenced assessment would be more appropriate that criterion-referenced assessment (Sadler, 1987). 3 In what follows, SRA is used instead of CRA whenever it is appropriate. With the historical discussion, CRA is used as the concept because the early writers tend to use the concept "criterion" rather than "standards" or "domain".

Challenges in Using Cut-off Scores in SRA
In many cases, the test developers are prone use the cut offs of the score in indicating the standard level. Mehrens and Lehmann (1991, 17) remind us that the early developers of the criterion-referenced assessment did not mean the standard was a cutting score (Glaser & Nitko, 1971, 653;Nitko, 1980, 50; of the early development, see also Dziuban & Vickery, 1973). Nitko specifically stated that one ought not to "confuse the meaning of criterion-referencing with the idea of having a passing score or cut off score" (Nitko, 1980, 50, emphasis is original). However, the modern test theory allows rather eloquent solutions for using cut offs in standards setting. This requires Item Response Theory (IRT) or Rasch modeling as well as the item wise content analysis. In these procedures, after several testing settings with calibrated items and equated test scores, it is possible to estimate what kind of score would be gained by those students who were able to solve certain type of items -maybe only the simplest or the most demanding items. This knowledge can be used when setting the standard of the individual test-takers. One may find this kind of logic, for instance, in PISA community (OECD, 2014).
However, in too many cases, the test developers seem to set strict scores indicating the passing and failing in the standards with less eloquent manners (see a critical discussion in Sadler, 1987;. Such as 30% or less correct in a test may mean "fail" while the higher values would indicate "pass". The challenges of using the cut off scores as indicating the standard are obvious. In order to give comparable cut offs, it necessitates strictly parallel or equated tests between the years, very finely nuanced standards for the set of criteria, and very high construct validity in the test development. Even if all these are taken into account, on what basis would the test developer define that 30% of the score means "fail" or 80% means "excellent"? The boundaries are more or less dependent on the difficulty levels of the tests. Sadler (2009; notes that the assumptions underlying the standard setting based on the cut off scores do not hold up when it comes to setting and holding standards: "The fundamental problem with it follows from a basic property of measurement in education: the aggregates are not composed of standardized points or units, neither does a given score increment necessarily represent the same achievement increment at all parts of the scale. In addition, aggregates are usually made up of scores derived from all summative tests and tasks in the course, leaving the equivalence of score units derived from different instruments completely unexamined. Basically, there are as many underlying scales and units as there are assessment instruments." (Sadler, 2012, 208.) Using cut off scores without IRT-or Rasch modeling and test score equating (see Béguin, 2000), test score scaling (AERA, APA, NCME, 2014) or test score linking (Linn, 1993;Mislevy, 1992) connected with a heavy content wise analysis of the test items may lead to an illusion that the test user is using SRA or CRA. Nevertheless, loose wording in the standards and using cut off scores in standards setting easily leads to a naïve criterion-referenced testing type of "all curriculum-based assessment is criterion-based assessment". This comes very close to the situation warned by Angoff (1974): "one only has to scratch the surface of any criterion-referenced assessment system in order to find a norm-referenced set of assumptions lying underneath". 4

Item Writing, Reliability, and Standard Setting in SRA
When the standards are well-nuanced, two-way standards, they allow the item writing without a heavy process of operationalization. For example, in the extended CEFR standards by the Finnish National Board of Education (FNBE, 2004;; see Section 2.1), the (abridged) description for the standard A1.3 in Reading is as follows (FNBE, 2004, 283): [At this level, student…] • Can read familiar and some unfamiliar words.
• Understands very short messages dealing with everyday life and routine events or giving simple instructions.
• Can locate specific information required in a short text (postcards, weather forecasts).
From the item writing viewpoint, the two-way standards themselves give valuable hints what kind of items would be relevant at this level. In the case, items related with a simple postcard with familiar words, the weather forecast, the timetable of busses/trains/libraries, or short personal, unofficial, emails, would be proper for the measure whether the test taker has reached the level A1.3.
Generally, the standards-referenced measurement includes items that are directly relevant to the learning outcomes to be measured, without regard to whether the items can be used to discriminate among students. No attempt is made to eliminate easy items or alter their difficulty. If the learning tasks are easy, then test items will be easy. The goal in SRM is to obtain a description of the specific knowledge and skills each student can demonstrate. (Linn & Gronlund, 2000, 43.) Hence, in SRM, the reliability issues are secondary 5 but the 4 Wiliam (1993, 341), as an example, seem to defence the loose wording in standards: "[N]o criterion, no matter how precisely phrased, admits of an unambiguous interpretation. … [W]e have to use norms, however implicitly, in determining the appropriate interpretations… [T]he criterion is interpreted with respect to the target population". Alone, the fact that, in language testing, the international board for CEFR has been able to create standards independently on the target population makes Wiliam's argument seem an excuse for the ambiguous wording in standards. 5 Note that in the norm-references testing the reliability issues, that is, the discriminative characteristics of the tests, are usually more highlighted than in the standards-reference testing. Popham and Husek (1969, 3) noted that reliability indices based on variability in the dataset, such as alpha reliability, "are not only irrelevant to criterion-referenced uses, but are actually injurious to their proper development and use". On the other hand, whenever the score or sub-scores are used as a basis for the standard setting, the reliability issues are relevant. Kane (1986, 221), for example, suggests that "the test-based procedure is found to improve the accuracy of universe score estimates only if the test reliability is above 0.50". validity issues are crucial; the test items need to be directly involved in the criteria (e.g. items for measuring the proficiency in Arithmetic or mathematical operations) and standards (e.g. measuring the satisfactory, good, or excellent level). From the basic classical test theory, one remembers that the reliability gets higher the more there are differences between the test-takers. In the extreme situation in SRA, if (almost) all the test takers get the same score -like when measuring very low a proficiency level (e.g. A1 in CEFR) in a more advanced group (e.g. C-level native speakers) -the reliability would be zero or even negative because of technical reasons. In SRA, this is not necessarily a problem: if all the items were deliberately written to match the low level standard, they should all be easy because the aim of the test is not to compare the students with each other but with the pre-set standard.
When the criteria and standards are in use, and the test is developed to measure those standards, one of the many standard setting methods is then used to determine the level of the test taker. These methods for standard setting are really many - Kaftandjieva (2004, 11) calculated that there are more than 50 methods and many of those have several modifications. One of the recent ones is Metsämuuronen's Three-phased Theory-based and Test-centered method for the Wide range of proficiency levels (3TTW, Metsämuuronen, 2013;cl. Metsämuuronen's 2TTW in 2009; developed specifically for the use of the national level student assessment where, in many cases, it is important to get an image of the national distribution of the proficiencies with one shot. It has been used in Finland in assessing the proficiency of Finnish for the Swedishspeaking students (Toropainen, 2010), Finnish as the Second language (Kuukka & Metsämuuronen, 2016), Mathematics in Vocational Education (Metsämuuronen, 2016b;Metsämuuronen & Salonen, 2016), and Sustainable development in Vocational Education and training (Räkköläinen & Metsämuuronen, 2016). It has also been used in Nepal in assessing the national proficiency level in Nepali (Acharya, Metsämuuronen, & Adhikari, 2013;Metsämuuronen, Acharya, & Aryal, 2013;Acharya & Metsämuuronen, 2014), English (Metsämuuronen, 2014), as well as Mathematics and Science (ERO, 2015). In 3TTW, the items are originally either written to measure a certain level of the standards, or the ready-made items are classified on the basis of the (assumed typical) proficiency level needed to solve the problem. The proficiency levels of the test-takers are assessed on the basis on their success in the shorter subtests comprising only those items related to a certain standard level. The method assumes that the test-taker whose true proficiency level is low, cannot pass the subtests of the higher levels requiring higher proficiency. The profile of the test-taker in the subtests, the (IRT modelled and equated) score, and the knowledge of the distribution of all test-takers are used in the final standard setting. (Metsämuuronen 2013.)

CHALLENGES IN STANDARDS-REFERENCED ASSESSMENT
While NRA is better suited for comparing the distributions and the individual test-takers in regards to the population, SRA is better suited to measuring learning progress as discussed above. The real challenge in the standards-referenced testing is the set of the standards themselves. Some known challenges, general and specific ones, are discussed briefly in what follows.

General Challenges in SRA
Some general challenges of the criterion-referenced testing can be condensed as follows (see more exhaustive in Criterion-referenced Tests, 2014). First, the tests in SRA are only as accurate or fair as the learning standards upon which they are based. If the standards are vaguely worded, or if they are either too difficult or too easy for the test-takers being assessed, the associated test results will reflect the flawed standards. Green (2002) notes that difficulties can arise when the level descriptions do not give the clear definitions of progress or do not relate to realistic progression. Cox (1995) pointed that, in England, the level descriptions in the national curriculum did not have the carefully defined progression that was necessary to allow reliable interpretations.
Second, the process of determining proficiency levels and passing scores on standards-referenced tests can be highly subjective or misleading-and the potential consequences can be significant, particularly if the tests are used to make high-stakes decisions about students, teachers, and schools. Because reported "proficiency" rises and falls in direct relation to the standards or cut-off scores used to make a proficiency determination, it is possible to manipulate the perception and interpretation of test results by elevating or lowering either standards and passing scores. Even the reputations of national education systems can be negatively affected if a large percentage of students is scoring high in the national tests but fail to achieve "proficiency" on international assessments. Even if not manipulating the interpretation, transforming the total score into the proficiency levels may lead to odd and implausible results as shown by Metsämuuronen (2013) and Metsämuuronen, Acharya, and Aryal (2013): in an early exercise of changing the score to several proficiency levels, the obviously normally distributed national proficiency distribution turned out to be an odd Bactrian camel type of distribution (Takala & Kaftandjieva, 2009).
Third, the subjective nature of proficiency levels allows the tests to be exploited for political purposes to make it appear that schools either are doing better or worse than they actually are. For example, some years ago it was reported that a state in the USA was accused of lowering the proficiency standards of the tests to increase the number of students scoring higher and thereby avoid the consequences-negative press, public criticism, large numbers of students being held back or denied diplomas-that may result from large numbers of students failing to achieve expected or required proficiency levels. In order to avoid the manipulation of the results in the national or international (sample-based) assessments, one cannot do much without a heavy army of inspectors at all levels of processes. If the country, state, province, or city is willing to manipulate the results, it is possible though obviously unethical. As the most innocent case, only the best areas, schools, or students are selected to participate in the tests.

Specific Challenge in SRA -What Kind of Standards we are Willing to Create?
The standards can be classified into two main categories: general and local ones (Figure 1). The general standards reach beyond the local curricula or national needs. Some examples mentioned within this article of these are CEFR (Council of Europe, 2011) used mainly in Europe when comparing the language proficiencies in different countries, National Quality Framework (NQF) (QCA, 2006) used in the Commonwealth countries, and PISA standards (OECD, 2014). This category may include also the Common Core State Standards for Mathematics (CCSSM) (K-8 Publishers, 2013) used in many -if not all -states in the USA. The NQF is a typical set of one-way standards; the item writing is very demanding on the basis of description. PISA standards are based on cut-off scores; their use in any other assessments is difficult. In what follows, the general descriptions of the standards of CEFR are used as an inspiration for developing the common framework in reference of mathematic. In what follows, the aim is to develop a set of common standards for mathematics by keeping in mind the two-way characteristics of the outcome.
The local standards are meaningful within the national-or school wise assessments. These can be further divided into those attached with pre-set standards and those attached with norms and scores. Two examples of the pre-set standards are the systems in New Zealand 6 or Finland 7 with the fixed level of "good performance". Two examples of the local standards related to norms and scores are the often used standard of A-F which uses the scores of tests and cut offs in the standard setting and the Finnish Matriculation 6 see http://nzcurriculum.tki.org.nz/National-Standards/Mathematics-standards/The-standards 7 see http://www.opetushallitus.fi/download/47672_core_curricula_basic_education_3.pdf, (pp. 160-168) Generally, in the national-and international level student assessment, SRA makes sense in certain circumstances: 1) when the standards are available, 2) when the systemic show the practical differences between the levels of standards, 3) when the standards are expressed so precisely that they can be used the item writing process without a heavy program of operationalization, and 4) when the structure of the levels is nuanced so detailed that the students can be credibly be classified into the levels both at the beginning of the learning process as well as at the mastery level.
With the last two points, the challenge can also be opposite, as raised up by Popham (2014): too detailed a set of criteria and standards may be very heavy to use and it may even destroy the whole SRA systemic because of being impractical in use. In many cases, however, the opposite is the challenge: the standards are expressed so vaguely that the standards-referenced testing is hardly possible in its strict form. 8 A vague wording in standards -often seen in the curricula (a type of "pupils are expected to write in a way which is interesting, conveying meaning clearly in the chosen form for an intended reader") -leads to a situation where the standards can be interpreted differently by different scorers and across the different grades, depending on the grades and programs of study (Green 2002, 7). If failing in SRA, notes Pollit (1994, 69): "we are in danger of implementing a system of tests that behave like thermometers, all pretending to measure on the Celsius scale, but which actually each have their own freezing point and each their own idea of what constitutes a nice summer's day."

POSSIBILITIES IN CREATING COMMON STANDARDS IN MATHEMATICS 9
The first challenge in creating or developing the set of criteria and standards in reference for mathematics is to decide whether one is willing to develop a set of general standards or local standards -the general standards are handled in Section 3.1. The second challenge is related to the dimensions or criteria of the new standards -some options are discussed in Section 3.2. The third challenge is related to the hierarchical levels in the standards -this is handled in Section 3. 3. An initial suggestion for the common framework in reference to the mathematics is finally handled in Section 3.4. These sections elaborate, broaden and deepen substantially the tentative discussion of Metsämuuronen's (2016a) General Framework in Reference to Mathematics (GFRM).

General Standards
The ultimate challenge for the standards-referenced assessment is the existence of the standards as discussed above. There are no internationally recognized universal standards for any of the other school subjects than languages. Where to start the process of creating the standards? What to take into account? What would be the basis of that standard? The article do not to answer all the questions thoroughly -it just gives initial suggestions for further developments.
An example of the general standards is the aforementioned CEFR classification as discussed above. The original six stages of language proficiency in CEFR classification (A1, A2, B1, B2, C1, and C2). In Finland, it was noticed at quite an early phase that six basic levels were not a fruitful basis for student assessment in schools: CEFR were enhanced to include more levels ( Table 1).
From the mathematics viewpoint, the logic of the standards is not necessarily that transparent as in languages. However, mathematics is an ultimate language with its own syntax and logic; learning 8 Green (2002, 7) uses the term "true" criterion referencing and refers to Popham (1980). Green supposes that Popham would not accept criteria which could allow a range of interpretations as is needed in loosely worded and categorized criterion. . All shared somewhat the same opinion that creating a common set of standards for mathematics (or any other school subject) would be an interesting though demanding task. mathematics shares somewhat the same logic as learning languages. Also, as in the language learning, the new material in mathematics is more or less cumulative. In what follows in Sections 3.3 and 3.4, a parallel systemic, as is the FNBE extension of CEFR, is discussed from the mathematics viewpoint.

Dimensions in the Common Standards
The first thing is to decide what would be the dimensions or criteria of the standards in mathematics. In CEFR standards, the criteria for languages are set for the domains of Reading, Writing, Listening, and Speaking. In mathematics, there are several possible way to go. One direction would be the domain-wise division. This path would lead to a direction of criteria or domains such as "proficiency in Algebra", "proficiency in Arithmetic", "proficiency in Geometry", "proficiency in Functions", "proficiency in Sets", "proficiency in Statistics", and so on. The content area-or domain-wise division is supported by the fact that there are contentwise specialties to learn; problems in one content area do not necessarily correlate with those in other content area (Räsänen, 2015).
Another direction would be the competence-wise division. This divides mathematics learning into five to eight dimensions of competencies or skills or abilities (Hannula, 2015). In Europe, some popular classifications are Niss-Jensen-Højgaard model (Niss & Jensen, 2002;Niss & Højgaard, 2011), its further adaption by Lithner and colleagues (2010), and a further reduced model by Säfström (2013). The original Niss-Jensen-Højgaard model comprises eight competencies. 10 Lithner and colleagues (2010) reduced the competencies into six. 11 Säfström (2013) reduced the competencies further into five. 12 In both the Niss-Jensen-Højgaard model and Lithner and colleague's model, a three-level grading (or "standards") is in use: "Interpret -Do and use -Evaluate/Judge". The possible challenge in the competence-based divisions is that they are not developed for standards-referenced testing but rather for pedagogical purposes. Though the categorization seems relevant, using the original classification does not seem to take into account differences in achievement levels. The apparent challenge of this path can be illustrated with the following blunt example. Think about a first grade pupil with the simple arithmetic problem of 1 + 2 = ?. When the pupil solves the problem (that is, "interpreted" the task properly, "did and used" proper mathematical tools, and "evaluates/judges" whether the result is correct in an appropriate way), the beginner mathematicians seem to get the highest level grading in all five to eight dimensions even though (s)he, apparently, is quite far from the real mastery in mathematics.
The third direction to go, elaborated further in what follows, is to divide achievement into three dimensions: 1) proficiency in mathematical concepts, 2) proficiency in mathematical operations, and 3) proficiency in mathematical abstractions and thinking. 13 Two first dimensions are to do with mechanical calculation and the First stage of fluent proficiency third one is to do with changing the problems in a mathematical form. The rationale of two first dimensions is somewhat obvious. In order to master even the simplest and most mechanical mathematical task, a certain level of proficiency in mathematical concepts is needed: the concepts of numbers and their representations (like 'five' = 5 =  = *****) 14 as well as the consecutive nature of the numbers. For the geometry, certain basic shapes such as the triangle, square, and circle should be recognized and their names should be remembered though their mathematical properties are not familiar. The rationale in proficiency in mathematical abstractions and thinking is that the essence in mathematical proficiency is to transform the everyday life mathematical challenges into mathematical form and to solve the problems by using the mathematical operations. Without the mathematical abstractions and thinking skills, the proficiency in operations and concepts are more or less useless. As an example, one may know how to form a derivative mechanically but do not know when to use the skill.
The dimensions can easily be connected to the modern theories of long-term memory discussed within the cognitive psychology (see the condensed discussion about the concepts from the learning point of view in Metsämuuronen & Mattsson, 2013). The basic theories of human mind claim that human long-term memory includes two parts: declarative and procedural (or non-declarative) memory (e.g. Squire, 2009). The declarative memory concerns things that can be brought to mind and declared. Procedural memory stores the motoric and cognitive skills and habits and its contents cannot be put into words (Poldrack & Packard, 2002;Squire, 2009;Ullman, 2004). Declarative memory can be further divided into semantic and episodic (or narrative) memory (e.g., Tulving, 1983;Bruner, 1986;1990;1991). Now, the proficiency of mathematical concepts relates more likely with the declarative semantic memory while the proficiency in mathematical operations relates with the procedural memory. The proficiency in mathematical abstraction and thinking may relate with both flavors of the long-term memory -or it may be connected with the episodic part of the declarative memory.
The mathematical operations are more or less hierarchically organized in the normal educational process (see Section 3.3). For instance, in order to be able to manage the powers, the procedures of multiplication are needed. Further, to learn multiplication, the procedure of addition is needed. Obviously, for the operations of addition, the basic concept of numbers are needed. Also, it is wise to start learning mathematics with concrete things such as summing and subtracting the Natural numbers before introducing decimals and Rational numbers. Not only there should be a hierarchical systemic in the standards (see Section 3.3), but there is hierarchical systemic also in the criteria for the concepts, operations and mathematical thinking as hinted above. The proficiency in concepts may be independent of the proficiency in operations but the proficiency in operations depends on the proficiency in concepts. That is, a young child may have fluency in naming and recognizing the numbers and basic shapes but cannot use those in any mathematical operations. In order to use any of the mathematical operations, the concepts are needed. Primarily, the proficiency in operations can be taken as the engine of the two: one may ask what are the concepts needed in order to master the mathematical operations. Also, the operations and concepts may, in some cases, be independent on the mathematical thinking -a student may be able to solve a mechanical problem (such as the task 1 + 2 = ? above) by using (mechanically) a proper mathematical operation and knowing proper concepts without being able to do (much) mathematical abstraction in the case. However, it is difficult to imagine how a student would be able to perform mathematical thinking and abstraction without some elementary mathematical concepts and operations in mind. Hence, it seems that, of the dimensions, mathematical thinking is the highest in the hierarchy and the proficiency in concepts is the base of the hierarchy, and the mathematical operations mediate the concepts and mathematical thinking (Figure 2).
Naturally, the connection between the concepts is not this simple -especially, the role of mathematical abstraction and thinking may be more crucial in understanding the mathematical operations than expressed here. Also, as always when learning new things, the latent general intellectual capacity is a crucial factormaybe a large part of the "mathematical abstraction and thinking" falls into this category? However, at this phase of developing the standards, this division and model may be appropriate enough.

Hierarchical Structure in the General Standards
It may be relevant to create systemic with the hierarchical structure of proficiency in mathematics as a school subject in relation with the everyday life use. An initial suggestion for the classification -let us call the systemic as Common Framework for Mathematics (CFM) -is based on the CEFR classification modified by FNBE (2004; (Table 2). Though the basis of CFM comes from the CEFR levels and the logic seem to follow the basic logic of CEFR classification, the names of the levels in Table 2 are mainly different than in CEFR -only the names of the elementary basics (A1.1 to A1.3) are the same. In CFM, the idea is, differing from the CEFR systemic, that the A level is more or less the basic level with the relevance to the everyday life. The B level is an advance level with less relevance to the everyday life but t high relevance to the further studies in several professional areas like statistics, engineering, or economics. The C level is left for the professional level mathematics needed either in the practical fields (like for Statisticians, Advanced Researchers, Economist, or Engineers) or in the theoretically oriented fields (like for the professors or researchers of pure mathematics, physics, astronomy, or chemistry).

Initial Suggestion as a Common Framework of Standards in Reference of Mathematics
Some of the relevant concepts of the CFM are collected in Table 3 and initially divided into relevant levels. The description in the systemic is based on the Finnish curriculum (FNBE, 2004; 2014 for the basic education; FNBE, 2003; 2015 for the upper secondary general education) and the descriptions of a "good performance" at different levels of mathematic proficiency development (FNBE, 2004). Naturally, the classification is open for revisions -this is just an initial suggestion for the standards. Table 3 (see more detailed in Table 4) is an attempt to show what kind of set of criteria a general standard could be. The systemic is not necessarily very practical when it comes to assess the professional mathematicians or university level mathematics students' proficiency levels. However, it may fulfil the needs of compulsory education up to +12 grade quite reasonably. Specifically, it may give a valuable insight of the level of the children who are entering the educational system and their mathematical proficiency. Note that the idea in CEFR, to tell what the test taker cannot do at the specific level, is applied to the CFM systemic.   • is familiar with the numbers, but the use in mathematical operations is very limited.

The systemic in
• recognizes the basic two-dimensional shapes (circle, square, triangle) and their threedimensional counterparts (ball, box, and pyramid) and can tie their name with pictures. • can express some limited mathematical expressions, such as order of the numbers • knows the importance of numbers in stating amount and order; knows how to write numbers but the proficiency in using formulated mathematic expressions is very limited A1.2 Developing elementary proficiency • can use natural numbers at range 1 -100 • can operate with the basic operations of summing and subtracting • knows the basic forms of plane and three-dimensional figures, including the quadrangle, triangle, circle, sphere and cube • understands the concept of zero • cannot perform multiplication and division • cannot evaluate the sensibility of the solution but in a very limited extent A1.3.

Functional elementary proficiency
• can use natural numbers with fluency at range -∞ -+∞ • knows about and understand the decimal system as a place system, and know ow to use it • understands addition, subtraction, multiplication, and division and knows how to apply them in the everyday situations • understands the concept of rationale number • can evaluate the sensibility of the solution • cannot raise a number to a natural-number power and be able to divide a number into its prime factors • cannot use proportion, percentage computation, and other calculations in solving problems that come up in day-to-day life A2.1 Developing basic proficiency • uses proportion, percentage computation, and other calculations in solving problems that come up in day-to-day life • can formulate a simple equation concerning a problem connected to day-to-day life and solve it either algebraically or by deduction • can calculate circumference, area, and volume • understands the meaning of probability and randomness in day-to-day situations • knows how to determine the coordinates of point in a coordinate system • does not master powers nor squares • cannot look for the null point of a linear function A2.2 Functional basic proficiency • masters the basics of powers and its connect to multiplications • masters the square and its connection to practical situations.
• finds similar, congruent, and symmetrical figures and is able to apply this skill in investigating the properties between two angles in simple situations • reads various tables and diagrams, and can determine frequencies, average, median, and mode from the given material • knows how to look for the null point of a linear function 16 At the level B, the order of the contents is quite free. Here, the order of the contents at the level B follows the curriculum in Finland for the upper secondary general education. The logic in numbering the domains is not obvious though it makes sense. It may be possible that, in other countries than Finland, the order of the courses differ from this. Hence, the domain of Advanced Statistics and Probability (from B2.2), for example, can be placed in any of the B levels because it is more or less an omnipotent entity which does not require the previous studies of Derivatives, for example. However, some of the domains are not fully independent from each other. For example, it does not make much sense to require proficiency of advanced derivative before the basic derivatives are mastered first.  (FNEB, 2003;. The wordings are transformed somewhat more competence-oriented. For example, in quite many places the original wording in the learning objectives of a type "will learn to" is changed to "can" or "is able to". 18 Proficiency of Operations is based on the Finnish curricula. Hence, also the names of the domains comes from the Finnish curriculum. However, it may be easy-or at least possible with small work -to apply the systemic to a different contexts having slightly different content areas in the curricula. 19 Some practical examples of the mathematical thinking come from Hannula (2015). Note that especially this column is not fully operational at the level B. Examples of mathematical thinking can be based on competencies given by Säfström  • have the basic understanding of the concepts of adding, taking away, dividing something equally, and multiplication by using adding as a rationale • have the basic understanding of unseen numbers (for example, what number is missing in the consecutive order) • have the basic understanding how to place things in order; to find opposites for things; to classify things according to different attributes; to state the location of object, for example by using the words above, below, on the right, on the left, behind, and between • cannot demonstrate an understanding of concepts associated with mathematics by using them to solve problems, and by presenting and explaining them to the teacher and other pupils • is not able to reach justified conclusions and to explain what they have done, and know how to present their solutions by means of pictures and concrete models and tools, orally and in writing, for example, cannot judge which of two pictures/ computations corresponds with the oral mathematical task, or cannot recognize whether the logical inference is correct or not (type of: "I have a number which is bigger than 2. is it smaller than 5?") • cannot connect the numbers with geometry (triangle -three angles; square -four angles) • does not know how to compare the size of sets, using the words more, fewer, as many, a lot, and a few; and to write and use the comparative symbols <, =, and <. • know how to measure with simple measuring devices, and know the main quantitative expressions, such as length, mass, volume, and time • be able to note the necessary information in simple, day-to-day problems, and to use their mathematical knowledge and skills to solve these problems Data processing, statistics, and probability • recognizes different types of charts and illustration for representing results but cannot read or interpret them • knows that information can be collected by interviewing but cannot gather data and organize, classify, and present them as statistics; cannot read simple tables and diagrams • cannot clarify the number of different events and alternatives, and to judge which is an impossible or certain event • can demonstrate an understanding of concepts associated with mathematics by using them to solve simple problems, and by presenting and explaining them to the teacher and other pupils, for example, can judge which of the possible strategies in summing up and subtraction is most effective (like 6+7 can be solved as 6+4+3 or 6+6+1 or 5+5+1+2) or can formulate a mathematical abstraction and operands from a simple oral task • is able to reach justified conclusions for simple mathematical problems and to explain what have done, and know how to present solutions by means of pictures and concrete models and tools, orally and in writing, for example, can judge which of two pictures/ computations corresponds with the simple oral mathematical task, or can recognize whether the logical inference is correct or not (type of: "I have a number which is bigger than 2. is it smaller than 5?") • knows how to compare the size of sets, using the words more, fewer, as many, a lot, and a few; and to write and use the comparative symbols <, =, and <, for example, can do basic logical inference of hidden numbers (type of X<5 and X>3, X = 4) • can judge which of two pictures/ computations corresponds with the simple oral task. • can recognize whether the logical inference is correct or not (type of: "I have a number which is bigger than 2. is it smaller than 5?") • can connect the numbers with geometry (triangle -three angles; square -four angles) • cannot use the concepts associated with mathematics to problem-solving, cannot present them in diverse wayswith instrument, pictures, symbols, words, numbers, or diagrams • cannot communicate their conscious observations and thoughts by acting, speaking, writing, or using symbols • cannot depict real-world situations and phenomena mathematically by comparing, classifying, organizing, constructing, and modelling • cannot group or classify on the basis of a given or chosen criterion, to look for a shared attribute, to distinguish between a qualitative and quantitative property, and to describe groups of things and objects, positing true and untrue propositions about them • cannot present mathematical problems in a new form; cannot interpret a simple text, illustration, or event and to make a plan for solving the problem • cannot use rules • can apply addition, subtraction, multiplication, and division in the everyday situations • can use the decimal system in terms of decimal fractions; can use negative numbers and fractions • can estimate in advance the magnitude of the result and, after the problem is solved, to check the stages of the calculation and evaluate the sensibility of the solution • can formulate and continue number sequences or to present correlations • can do the basic calculations with points, line segment, horizontal line, ray, line, and angle with simple plane figures • understands the meaning of the order in the calculations but may make mistakes in the actual calculations • cannot estimate a possible result and prepare a plan for solving a problem; • cannot raise a number to a natural-number power and be able to divide a number into its prime factors • cannot solve problems in which a square root is needed • cannot use proportion, percentage computation, and other calculations in solving problems that come up in day-to-day life

Algebra
• can perform simple calculations with first-degree equation as a modification of the hidden number; 2 + a = 3, what is a? a = 1 • cannot solve a simple first-degree equation • cannot reduce simple algebraic expression • cannot perform calculation of powers • cannot formulate a simple equation concerning a problem connected to day-to-day life and solve it neither algebraically nor by deduction

Geometry
• can form figures, following the instructions given, can notice the properties of simple geometric figures and is familiar with the structure formed by the concepts of plane figures • recognize similarity; can reflect a figure across a line, and to dilate and reduce figures by a given ratio; can recognize figures that are symmetrical in relation to a line • can evaluate the size of the object being measured and the sensibleness of the measurement's result, and how to state that result in appropriate units of measurement • can calculate the area and perimeter of parallelograms and triangles • can use simple reflections and dilations • can notice the properties of basic geometric figures • knows the concepts for circumference, area, and volume but cannot calculate them • cannot recognize other than the basic geometric forms and know their properties • cannot use compass and ruler to make simple geometric constructions • cannot find similar, congruent, and symmetrical figures and cannot apply this skill in investigating the properties between two angles in simple situations • cannot apply the relationships between two angles in simple situations • cannot use the Pythagorean theorem and trigonometry to solve the parts of a right triangle • cannot perform measurement and related calculations and covert the most common units of measurements

Data processing, statistics, and probability
• can gather data and organize, classify, and present them as statistics • is able to read simple tables and diagrams • is able to clarify the number of different events and alternatives, and to judge which is an impossible or certain event • understand the meaning of probability and randomness in day-to-day situations but cannot determine the number of possible events and organize a simple empirical investigation of probability • cannot read various tables and diagrams, and determine frequencies, average, median, and mode from the given material

Functions
• can illustrate two variables in the coordination • can continue a simple number sequence according to the rule given but cannot describe the general rule for a given number sequence verbally • cannot determine the coordinates of point in a coordinate system • cannot prepare a      This kind of set of standards could be used as a basis for the item writing; the items in different levels (within a standard) can be identified. Note that, at each level, the items on the topic can be easy, medium or demanding. Let's take an example from the domain of Derivation. At the lowest (measurable 21 ) level of proficiency of concepts, the students can recognize and recall the name of the concept (such as the Derivation) but cannot necessarily remember or understands its connection to the mathematical operations. At the more advanced level, they can meaningfully connect the concept with the operation. At the highest level, they have a confident theoretical understanding about the concept, its use as well as its limitations.
At the lowest (measurable) level of proficiency in operations the students can recognize and select a proper operation (like the Derivative) without (necessarily) knowing or remembering how to use it properly. This happens easily if one does not use the advanced methods for years, for example. At a more advanced level, the students select the proper operation and can apply it in the most elementary and simple situations but do not remember its nuances by heart. At the highest level, the operation can be used with the complex problems with nuanced a way.
At the lowest (measurable) level of proficiency in mathematical abstraction and thinking, the students can solve mechanical problems just by using the basic operations, rules and formulae for a specific problem. For example, the student may solve the derivatives of a simple function just by recalling the basic rule without much mathematical thinking. At a more advanced level, the students can abstract a simple concrete (oral, written, or practical) problem to a mathematical form, select proper tools and solve the problem. At the highest level, the students can abstract a complex (oral, written, or practical) problems to a mathematical form, select proper tools and solve the problem; to solve the problem may need of combining several concepts and operations from different areas of mathematics.
The hierarchical nature of the standards makes it happen that even the easiest item of level B1.1 cannot be mastered by the students at the very elementary level. This leads to a possibility that more or less standard (norm-referenced type of) test can be administered within each level if needed; the score and cut-offs can be used in a standard setting process. The advantage of the classification is that the proficiency level can be defined by knowing which kind of items the test-taker manages.

DISCUSSION
It is more or less a pity that the global or regional society(-ies) of mathematics teachers and evaluators have not been developed a common ground for assessing the mathematical proficiency. The lack of common ground has led to very much deviating standards in different corners of the globe though the essence of mathematics is the same for all. It seems that the only global set of standards is those used in PISA and TIMSS settings. But their internal logic is first, based on the scores in the PISA/TIMSS scale -and hence, it is not easy to use the standards in any other settings than PISA/TIMSS settings -and the internal logic of the standard is not fully transparent. The article gives an open invitation to start to develop (or to continue developing) a common framework for all.
It is not obvious that "the structure is the standards" as claimed by Daro, McCallum, and Zimba (2012;also in K-8 Publishers, 2013). It's quite easy to create such structure of curricula that cannot be used as twoway standards for assessing mathematics performance. Yes, we might know what topics should be taught to the students but it is very easy to create very many different kinds of tests of different difficulty level on the 21 Naturally, the lowest absolute level would be no proficiency at all. • masters the theoretical mathematics • masters the philosophy of mathematics basis of the loosely worded "standards". It is quite sure that each country with an educational system has some kind of "structure" for teaching mathematics. It does not mean that those would be "standards" -or at least usable standards for the international settings.
Preparing these common type of standards is a huge task and it requires involving experienced mathematics teachers and curriculum developers, trade unions, psychologists and politicians to participate the process. Without a large consensus it's difficult to convince the audience of the rationale behind the levels. The challenge in creating the set of common standards is that, if and when there were no explicit and strict criteria for the different school subjects or they are worded vaguely in the curricula (as, for example, in Mathematics), it easily takes lots of time to convince all the relevant players in the field of the standards. In Finland, developing criteria just for one level "good", took several years and lots of discussions between the different stakeholders. The European Standard for Criterion for Languages (CEFR) took 10 years to build up. Even if the curriculum includes explicit sentences and standards for the educational goals, reminds Greene (2002), those involved in assessment, test development, teaching and curriculum development need to understand levels of performance and the nature of progression in the curriculum and to develop an understanding of standards of performance within a community of practice. Such a body of knowledge would help to increase the credibility of valid, reliable assessment of what students know, understand and can do in the context of transparency, clarity and shared understanding.
As the set of standards of CFM is now, more or less a rough sketch, I hope, it can courage the reader to bring some ideas how to develop it further.