The Role of Monitoring Raters in Ensuring Accurate and Meaningful Test Scores. Case Study: RFL Examinations
Providing accurate and meaningful test scores is an extremely important issue especially in the case of high-stakes examinations like the one considered here: the RFL examination, level B2, which practically conditions the admission of students to an academic program in Romanian. The paper aims to describe and explain how data regarding the achievement of raters are collected and analysed in order to ensure rating accuracy and rater reliability. Monitoring, co-ordination, standardization measures all aim at dealing with problems of leniency, inconsistency or severity of raters. The paper details the procedures used for calculating rating accuracy, intra-rater reliability, inter-rater agreement in the case of marking both receptive and productive components of the RFL examination.
The examination taken into consideration here is the test of Romanian as a foreign language (RFL),
level B2. This can be regarded as a high-stakes test as it represents the test one has to pass in order to
obtain a linguistic competence certificate in Romanian which usually conditions the admission in any
academic program taught in Romanian in any university in our country. Currently, the two categories
of population taking this test are: the students enrolled in Babeș-Bolyai University, in the preparatory
year (Faculty of Letters, Department of Romanian culture, language and civilization), and the persons
who simply need the certificate in Romanian in order to be able to start studies in Romania (60-130
candidates per year).
Being a high-stakes test, it is only natural that the organization providing it should regard as
important all aspects concerning its quality. Therefore, the Department of Romanian culture, language
and civilization submitted the test to be audited by ALTE (Association of Language Testers in Europe)
and in 2015 obtained the ALTE Q-mark (a quality indicator showing that the exams provided by the
organization “have passed a rigorous audit and meet all 17 of ALTE’s quality standards” and allowing
test users “to be confident that an exam is backed up by appropriate processes, criteria and standards”
(www.alte.org). The validity argument presented evidence of validity for all aspects of the assessment
process: test development, item writing, test administration, marking and grading, reporting of results,
etc. (Hughes, 1989).
Validity in testing and assessment is defined as discovering whether a test “measures accurately
what it is intended to measure” (Messik, 1989, p. 22). Messick saw validity not as a property of a test
or assessment, but as the extent to which one is allowed to make inferences to a construct from a test
score and the degree to which any decision one might make on the basis of the score is justifiable
(AERA, APA & NCME, 1985, p. 13). This definition of validity has become the accepted paradigm in
psychological, educational and language testing: “Validity is the most important consideration in test
evaluation. The concept refers to the appropriateness, meaningfulness, and usefulness of the specific
inferences made from test scores. Test validation is the process of accumulating evidence to support
such inferences. A variety of inferences may be made from scores produced by a given test, and there
are many ways of accumulating evidence to support any particular inference. Validity, however, is a
unitary concept. Although evidence may be accumulated in many ways, validity always refers to the
degree to which that evidence supports the inferences that are made from the score. The inferences
regarding specific uses of a test are validated, not the test itself.” (Fulcher & Davidson, 2007, p. 9).
In what concerns language testing, as we have mentioned above, part of a validity claim is that the
test administration and all the processes used by the testing agency are done according to standardized
procedures and one of the most important aspects of monitoring the quality of those is identifying the
key stages, describing what needs to be done and to what standards and comparing what is actually
done to these standards (Green, 1998, p. 127-128).
The present study focuses on one source of evidence for validity: the grading process. Accuracy of
rating and reliability of test scores are analysed by monitoring raters’ activity.
The participants in the study were 10 professional raters involved in the examination process during
the academic year 2015 (the B2 exam, the Department of Romanian language, culture and civilization,
Faculty of Letters, UBB). The staff involved in the assessment process is carefully selected. The
requirements for markers include: BA or MA in Romanian language, an academic degree in teaching
and assessing Romanian as a foreign language (teacher training module/MA in teaching RFL), at least
2 years of experience in teaching RFL and in preparing students for RFL examinations, familiarization
with the procedures to follow, with the mark scheme and with the answer key (with special focus on
partial credit items, for listening comprehension and for elements of communication construction),
attendance of group training sessions (standardization and training for assessment, in the case of
Each rater assessed a number between 9 and 12 students/papers. Thus, 106 papers assessed by 10
raters were taken into consideration for the study.
In the study both qualitative and quantitative methods have been used. In the case of the RFL
examinations (which have 5 components: Listening, Reading, Elements of Communication
Construction - ECC, Writing and Speaking), the first three components are formed from questions
which allow objective marking. These components are double-marked in order to ensure that the
mistakes are avoided/repaired. After the first marker corrects the three components and inserts the
results in the corresponding boxes, on the second page of the exam paper, the second marker performs
the second correction. If he/she arrives at a different score for any of the three components, he/she
checks his/her own correction as well as the first marker’s in order to discover possible errors of
calculation or mistakes in the application of the mark scheme. If such an error is discovered, then the
marker makes the correction and registers the change. During the marking period, markers can ask for
feedback (as the rating task is performed in the same room, under the supervision of the chief
examiner) but they are also randomly checked by the chief examiner as they carry out their task.
The mark scheme is consistent from one session to the other and the answer key is unambiguous.
For the item types used (multiple choice, true/false, matching, gap-filling) there is either only one
possible correct answer or a very limited and clearly defined number of acceptable answers (partial
credit items). As mentioned above, the assessors are aware of the procedures to follow, they are
familiar with the mark scheme and they get familiarized with the answer key before each session of
examination. As the questions in these three components allow objective marking, in these cases, raters
were only checked for accuracy (how well they apply the marking scheme). Data regarding the number
and the types of errors each rater makes was gathered and processed, as shown in Table
ECC = error when marking the elements of communication construction component L = errors when marking the listening component After this, an error rate [8:79] for each assessor was calculated (Table
As mentioned before, between 9 and 12 papers were considered in the case of each assessor. If the
error rate was too high (higher than 0.02), the assessor will be carefully observed during the following
sessions of examination.
Also, at this level, types of errors that occur within raters were identified in order for the board of
examination to prepare the following training sessions for raters focusing on the most significant
aspects that need to be improved (e.g. partial credit items).
In what concerns the Writing component, each script is analytically marked by two examiners who
use a detailed grid and assessment forms on which they write their comments justifying the score given
for each criterion. If they give widely varying marks (a difference of more than 2 points in the final
marks), the script is marked a third time by the chief examiner. The number of points he/she grants is
taken into consideration as well when calculating the average which represents the final score. If no
such situation appears, the overall score for the writing component is represented by the average of the
two marks given by the two assessors. One method of monitoring raters adopted in the case of RFL
examination is the use of pre-assessed scripts (Council of Europe, 2001, p. 43). The chief examiner
performs this task himself. He makes copies of the written productions of some of the candidates and
rates them, placing the papers back in the pile of unmarked papers. After the rating is done by the rater
who marks all of the test components, his/her marking is compared to the pre-assessed productions. In
the case of each rater, a set of 4 pre-assessed scripts was used and in this way raters were checked for
leniency and severity.
Rating of the Speaking component is carried out simultaneously by two assessorswho have
previously undergone at least one training process. Their only task during the development of the oral
examination is to assess the oral production of the candidates using a detailed grid and assessment
forms for their own comments regarding the performances. If raters give widely varying marks (a
difference of more than 2 points in the final marks), then a discussion takes place during which the
examiner can express her opinion on the production of the candidate. If the two raters do not agree, the
chief examiner will later grade the oral production as well, as each spoken performance is recorded and
can be reassessed anytime.
Rating accuracy was also checked in the case of Writing and Speaking components through
qualitative analysis. Rater’s comments justifying the points awarded for each criterion in the case of
each candidate written on the blank grids (assessment forms) during the session of examination were
analysed and compared to the descriptors in the Department assessment grids. If the assessors’
comments were not consistent with the descriptors in the Department grids (e.g. the assessor’s
comments reflect a performance that would be ranked on level 5 according to the descriptors for level 5
in the Department grid, but he/she ranked it on level 4), raters will be closely monitored and, if needed,
sent to new training sessions.
Inter- and intra-rater reliability was examined only in the case of the marking of scripts and of
spoken performances. For checking
similarly – they do not need to agree completely, but, as they use the same criteria, their ratings should
not be wildly different), a correlation coefficient between the two raters was calculated using the Excel
Pearson function (in the case of the points given by the two assessors for every criterion for every
candidate they both assessed). (see Table
If the coefficient is higher than 0.8, the assessors do not need special monitoring (Council of
Europe, 2001, p. 79). In case it is lower than 0.8, different measures should be taken, according to the
value of the coefficient.
was checked through both qualitative and quantitative analyses. Raters’ comments from the blank grids
(assessment forms) on one criterion and for one score (e.g. 5 points) were analyzed and checked for
consistency (if the rater uses the same descriptors to judge all the performances he rated with 5 points,
for example), then the same was done with the other criteria. If the comments are consistent, the
assessor does not need special monitoring. If, however, the comments differ or contradict each other, a
range of measures are applicable (feedback sessions, retraining, exclusion from the team, etc.). Also,
the performance of raters was monitored by calculating the standard deviation (Excel, STDEV.S
function). This, again, was calculated for each criterion (See an example below, in Table
Then, severity and leniency in the case of each criterion were taken into consideration by checking
if there was one criterion where the assessors had the tendency to give more or fewer points (by
comparing the number of candidates who got one score for each criterion – Table
appeared to be too lenient, too severe or those who show inconsistency in marking or in applying the
criteria will be sent to another training session before rating again.
Results, discussion and recommendations
As described above, several aspects were taken into account when illustrating the performance of
raters: the types and frequency of errors they made and the extent to which they agreed with each other
and with themselves. The results demonstrate validity, accuracy and reliability, but they also draw
attention on several issues which should be the core of the following training sessions.
4.1.Listening, Reading, Elements of Communication Construction (ECC)
For the first three components of the test, only accuracy of marking could be calculated as they
include just items that can be objectively marked.
No error rate was problematic (= higher than 0.02) and there was one case when the error rate was 0
(= one rater who made no mistakes). However, 9 raters did make several mistakes each, so we consider
having each paper double-marked a good way of ensuring accuracy in the final score.
4.1.2.Types of errors
The error rates are not concerning but this does not mean that the errors the markers made cannot
reveal some relevant issues in the grading process. Very few errors were made when calculating the
final scores for each component or when checking the answers for the true/false, multiple choice or
matching items, so, in these cases, it was most probably a focusing problem that caused the very few
mistakes. Most of the errors were made when assigning points for the partial credit items (Listening
and ECC components). This means that either the marking schemes are not clear enough (= they don’t
cover all the possible situations) and should be adjusted or that markers were not well enough trained
and the following standardization workshops should focus on the process of marking partial credit
4.2.Speaking and Writing
In the case of Speaking and Writing, accuracy, inter- and intra-rater agreement, leniency and
severity were monitored.
Generally, raters used the descriptors from the assessment grids (or references to them) in their
comments justifying the points awarded for each criterion. When raters’ comments for one criterion
mention some aspects that one cannot find in the descriptions, they usually don’t seem to have any
effect on the score for that criterion, as they are in all cases complemented by other comments in direct
relation to the descriptors from the assessment grids. However, there seem to be some features from the
descriptors in the Department grids which are preferred by the raters in the case of each criterion. For
example, when describing
complex or simple grammatical structures and vocabulary range is often minimized, although this is
also an important aspect of the criterion. When commenting upon
seem to focus on the extent to which students covered/accomplished the tasks, downplaying other
aspects mentioned in the descriptors: how the communicative functions were expressed, problems
concerning style and register, etc. Regarding
to the frequency and complexity of the connecting words students use and to the length and frequency
of pauses they make. In the case of writing, the criterion is called
comments on layout, connecting words and special formulaic language students use. Very few
comments regarding cohesive devices are to be found. When judging
to the grids in what concerns both writing and speaking. They refer to the types and frequency of
errors, to the way errors affect the message and almost all raters write down examples of errors
extracted from the students’ performances.
Overall, we can conclude that all raters seem to use the assessment grids correctly. Nevertheless,
there are aspects which are downplayed in the case of each criterion. Therefore, we consider that this
represents an issue that should be discussed during the following standardization workshops which, in
our opinion, should be mandatory before each session of examination.
The correlation coefficient was calculated for each pair of raters and for each criterion. Only in the
case of rating
1 out of 5 pairs in speaking (0.69) and for 2 out of 5 pairs in writing (0.60 and 0.79)). This could
indicate the fact that the descriptors for these two criteria (
writing) are not concrete enough or one or both raters from each of the pairs in question need more
training. Considering the small number of cases where the coefficient is lower than allowed and the
fact that in all these cases the coefficient is not so far from the limit (0.8), we can conclude that raters
agree with each other to a high extent.
Each rater was monitored for internal consistency by analyzing and comparing his/her comments
when assigning a number of points for each criterion. In most of the cases the comments were
consistent: the raters used more or less the same words and took into account the same aspects when
assigning one score to various performances. There were some isolated cases when the rater had the
same comments for a script he/she rated with 4 points and for another one he rated with 5 points, for
example, but as these cases were exceptional, they were not considered relevant for the overall
performance of the rater.
4.2.4.Leniency and severity
In order to monitor the raters for leniency and severity in assigning scores for the writing
component, the method of pre-rated scripts was used. Each rater was given 4 pre-rated scripts in the
pile of scripts he was supposed to grade. The correlation coefficient (between the number of points
assigned by the chief examiner during the pre-rating task and that assigned by the first rater during the
rating task) was in all cases higher than 0.78 for the following criteria: complexity and
scores for all four papers) and two raters proved to be too lenient with respect to the organization
criterion (they systematically assigned higher scores than the chief examiner). These results were
further confirmed by the other method of monitoring raters for leniency and severity: checking if there
is one criterion where the assessors have the tendency to give more or fewer points (comparing the
number of candidates who got one score for each criterion). Evidence was found that the same three
raters had problems when applying the grid for the two criteria (
them was too severe (only one script out of 12 received the maximum score for grammatical
and 3 out of 12 were rated with 4 points) and two of them were too lenient (they both assigned 5 and 4
The results of the study demonstrate that the assessors rate accurately, they are consistent with
themselves and they agree with other raters to a great extent, they are not too lenient and not too severe.
However, the analysis revealed some aspects that could be revised in the grading phase:
the marking scheme should be adjusted with respect to partial credit items, as some errors occur
when rating this kind of items;even if raters applied the grid correctly, it seemed that some of the aspects mentioned in the gridwere downplayed by the assessors – the following training sessions should focus on raisingraters’ awareness regarding these aspects (e.g. the
Limitations and further research
As a high-stakes examination, it is important that RFL, level B2 exam continues to demonstrate
validity and reliability in all aspects including or especially in what concerns test scores. We believe
our small study contributes in some way to the validity argument supporting the use of RFL B2
examinations as means of assessing the communicative competences of those who want to register to
any academic program taught in Romanian in any university in our country. Also, we believe that the
methods and procedures presented here could raise awareness (in what concerns other examination
agencies) regarding the importance of monitoring raters in ensuring accurate, valid and reliable test
However, the study should be regarded as a point of departure and the findings are intended to be
representative only for one session of examination. Evidently, it should benefit from further analysis –
more raters and more sessions of examinations should be observed, comparisons between raters’
performances should be made and the evolution/involution of each rater should be analyzed. Also,
verbal protocol analysis (Bachman, 1990) could be of great use in the case of rating scripts (raters
could be asked to record their thoughts while rating the scripts – the analysis of transcripts could lead
to interesting and relevant results).
- ALTE Minimum standards for establishing quality profiles in ALTE examinations (2007). Available online at: http://www.alte.org/attachments/files/minimum_standards.pdf American Educational Research Association (AERA), American Psychological Association (APA) and National Council on Measurement in Education (NCME) (1985). Standards for Educational and Psychological Testing. Washington, DC: AERA.
- Bachman, L. (1990). Fundamental Considerations in Language Testing. Oxford: Oxford University Press. Council of Europe (2001). Common European Framework of Reference for Languages: Learning, Teaching, Assessment. Cambridge: Cambridge University Press. Available online at: http://www.coe.int/t/dg4/linguistic/Source/Framework_EN.pdf Fulcher, G., Davidson, F. (2007). Language Testing and Assessment. An advanced resource book. In Candlin, C., Carter, R. (eds.), Routledge Applied Linguistics, Routledge.
- Green, A. (1998). Verbal protocol analysis in language testing research. Cambridge: UCLES/Cambridge University Press.
- http://www.alte.org/setting_standards http://www.alte.org/setting_standards/the_alte_q_mark Hughes, A. (1989). Testing for Language Teachers, 1st ed. Cambridge: Cambridge University Press.
- Manual for Language test development and examining for use with the CEFR – produced by ALTE on behalf of Available online at: the Language Policy Unit, Council of Europe (2011).
- http://www.coe.int/t/dg4/linguistic/manuel1_en.asp Messik, S. (1989). ‘Validity.’ In Linn, R. L. (ed.), Educational Measurement. New York: Macmillan/ American Council on Education, 13–103.
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this paper as:
Click here to view the available options for cite this article.
VolumeEpSBS / Volume 15 - WLC 2016