Analysing Placement Tests
How accurate is the placement test at your school? How do you know? Can the data gathered through the placement test be used in other ways? If you’re unsure about the answer to these questions, perhaps it’s time for analysing placement tests! I recently undertook some research of my own on our archaic placement test. Here’s what I did, and why I’d recommend you do the same.
The placement test at my school has been used since time immemorial. The school directors have hinted for a while that perhaps a change is needed, but there is some reluctance among teaching staff and management. I set about establishing whether this change was necessary by the assessing the test validity. This gave me a great opportunity to implement some the theory I’d encountered during a language testing module on the Trinity DipTESOL.
Our current placement test
We use a 90 question grammar test of increasing difficulty. Questions are multiple choice cloze format (or simply multiple choice) each with four possible answers. The threshold for each level of ability is as follows:
Questions 1 – 30 = Elementary level (a score between these figures places a student in an Elementary level class)
Questions 31 – 55 = Pre-intermediate level
Questions 56 – 73 = Intermediate level
Questions 74 – 86 = Upper-intermediate level
Questions 86 – 90 = Advanced level
(see right for an example of the first 5 test questions):
Students are given approximately 30 minutes to complete the test. Classes are organised based solely on overall test scores.
Who takes the test?
Our school run short residential immersion courses (normally 1-2 weeks) for Young Learner Groups visiting the UK. Course content normally involves practising functional language to use around town, exploring British culture, and learning about local tourist attractions in preparation for group visits. Groups are accompanied by a Leader (normally their English teacher from their home country). Schools often return on an annual basis with a different group of learners. Some groups stay for a longer duration of between 4-8 weeks, but this is rare. During summer, the same test is used for the YL summer school, which includes both groups and individual students.
Current test pros and cons
Here is a brief summary of attitudes towards our current test
|It’s quick to administer and mark||It’s grammar-focused, yet courses rarely involve explicit grammar teaching|
|Group leaders are familiar with the test and the test procedure||It covers very few skills and lacks an accompanying speaking test|
|It appears to be relatively accurate||The potential for the test to inform practice is not being realised|
|If it ain’t broke, don’t fix it||The test hasn’t been changed for decades|
Why analyse the test?
In practice, our placement test seems to be effective. Our Director of Studies is familiar with almost all the groups tested (as they are annual visitors) and can normally predict learner levels prior to arrival based on past experience. The test serves merely as a rough guide, and to confirm the DOS’s assumptions. After the first few lessons teaching a new group, teachers feed back to the DOS as to whether any level changes should occur (meaning the placement test could be inaccurate). Level changes very rarely happen, although if they do it is normally due to limitations in skills such as speaking, which are not assessed in the test
The fact that the DOS can predict group levels with such accuracy makes it more interesting to assess the effectiveness of the test. It clearly provides guidance for level placement and does so reliably, given that very few students are deemed to have been placed in the wrong level after the test. However, does the test actually discriminate well between stronger and weaker level students, or could the DOS do an equally good job without it, merely based on his own experience?
How was the test analysed?
An important point to note about the test is that is should increase in difficulty. Hence, you could assume that a student scoring, say, 40 marks on the test (Pre-intermediate level) would be likely to have most of their correct answers coming between questions 1 – 55, which are for Elementary and Pre-intermediate students. There’s always the chance of outliers, especially from students guessing answers, but if you analysed a large amount of test papers together it would be fair to assume that a test like this (if accurately graded for difficulty) would reveal such patterns.
To assess the relative difficulty of our test I took a sample of test papers from 243 students. The nationality of respondents depended on tests available during a 3 week period, with 55% of students Italian, 14% Austrian, 13% Spanish, 10% Russian and 8% Thai. Overall test scores showed that from all respondents, 17 were predicted to be Elementary Level, 85 were Pre-intermediate, 93 were Intermediate and 47 were Upper-intermediate.
One way to test the difficulty of multiple-choice test items is to look at item’s Facility Index (FI). The FI is calculated by dividing the amount of correct answers for a test item by the amount of students taking the test. So, if 100 students take our placement test and 84 students get question 10 correct, the Facility Index for question 10 is 0.84. If a test is correctly graded for difficulty, then the FI of items should decrease as the test progresses. If items towards the end of a test are consistently scoring a high FI, one could assume that the test is too easy for the learners. This would be a surprise given that around 40% of the tests analysed achieved a score below pre-intermediate.
I calculated the Facility index for each test item based on the 243 test papers. I then produced an amended order of difficulty for the items based on these results. Finally, I calculated the difference between the actual order and the amended order. Results for the first ten items of the test showed the following:
|Item number||Amended item number based on Facility Index||accuracy of order based on Facility Index|
Results clearly reveal discrepancies in the test order. In the first 10 questions alone, 3 items appeared 25 places before they should if the test was truly graded for difficulty. You could argue that perhaps these items were difficult for one particular nationality, although 2 of the 3 items showed similar discrepancies when analysis of results by nationality were considered.
This doesn’t say too much, apart from that the order of the test was completely inaccurate. To establish whether items truly discriminated between stronger and weaker students, further analysis was undertaken. To establish the Discrimination Index (DI) of an item, first the overall scores for each student are ranked in order. The top 27% and the bottom 27% of results are grouped, and the correct answers for each item within top and bottom groups are calculated. A quick equation is done:
(Correct scores from the top group – correct scores from the bottom group) / the amount of participants in each group.
DI scores should be positive, and as close to 1 as possible. For example, imagine there are 50 students in both the top and bottom group. If 40 students in the top group get question 70 correct, and 3 in the bottom group get it correct, the calculation is (40 -3) / 50 = 0.74. This is fairly close to 1, so you can reasonably assume that question 70 discriminates between strong and weak students.
It would be fair to assume that all questions 74 – 86 would have a fairly high Discrimination Index, as they should be pitched at Upper-intermediate level learners. The table below shows that this is not the case.
|Item number||Discrimination Index|
The pattern was similar when analysing specific nationalities. The suggestion is that the test we use, supposedly graded for difficulty, is extremely inaccurate. There are possible explanations for such discrepancies. It’s hard to account for students’ guessing, and a lot of the distractor items for each question could be unsuitable. However, the test appears to be fairly ineffective in discriminating between learners grammar awareness.
The findings from this analysis gave us a few options
- Change the test
The one we use seems useless. Perhaps a change in test format might lead to greater validity? At the very least, a digital version of the test for students to complete prior to arrive (albeit with unreliable test conditions) could free up classroom time.
- Scrap the test completely
The DOS manages to place students accurately based on his own experience of the groups that arrive. The results of the test are fairly arbitrary.
- Keep the test in its current format
Whether the test is valid or not is beside the point. It might seem strange (unprofessional even) for a language school to not administer a placement test. The fact that this test is so familiar to existing groups and the testing routine is so familiar to staff is one of the reasons why the test has remained in place for so long.
- Don’t treat it as a placement test, but a needs analysis
The test only assesses grammar comprehension and awareness. As mentioned, grammar is rarely taught explicitly at the school. However, there are groups of learners which arrive for extended residential stays at the centre (from 4-8 weeks), and individual students may stay for up to 6 weeks during summer school. The rationale behind longer courses is that students will have lessons that cover all four skills (reading, listening, speaking, writing), plus systems (grammar, vocabulary, phonology) and language functions. However teachers have the freedom to approach these courses how they wish:
- There is no set syllabus for these courses
- There is no set course book for these courses, which might at least provide rough guidance on the grammar and vocabulary points to cover
- The school has a strong focus on helping teachers to develop. Teachers are thus encouraged to experiment in their own practice, and are given complete freedom to do so. This is seen by many staff as one of the schools’ strengths.
- Project work is encouraged for extended residential groups.
Test analysis required a large amount data input – 243 test papers, each with 90 questions. However, this took hardly any time once the spreadsheet was produced. Inputting the data from a group of 20 students takes about 30 minutes. Once inputted (and with the question sheet to hand), it’s easy to assess which questions a group did well on, and which they struggled with. Having the data on screen is far more efficient than skimming all 20 test papers for certain patterns. Without a syllabus in place, the test results prove invaluable for informing planning and ensuring that relevant grammar points are reviewed and taught. Whilst test results on a whole prove unreliable in discriminating between learners, data may prove more useful on a group by group basis, especially if a certain group are from the same school, in the same year, and following the same syllabus in their own country.
In conclusion, whilst the placement test at my school lacks validity, it might still provide some useful data to inform planning. However, it’s down to both management and teachers to realise that this data could be of some use. If the same test continues to be administered, teachers of extended stay groups should be encouraged to actively assess test results and build their grammar input sessions around it. Of course, it would help if we could rely on the data first…!
My research didn’t lead to an overhaul of placement test procedures at my school, but it did reveal a few surprises. What about your placement procedures – are you willing to put them to test?!
Thanks to input from TLI Edinburgh on assessing multiple choice tests
Pete blogs at eltplanning.com