Fairness is an essential element in testing. This is as true for an original assessment as for one that has been adapted for use in a different linguistic and cultural context. In both cases, the successful development, administration and measurement of a testing instrument depends on principles such as the equitable treatment of test takers and valid test score interpretation, as well as an absence of bias.
Test items are considered biased when they favor the performance of one subgroup over another—irrespective of the assessment’s subject. This may include subgroups based on age, gender, ethnicity, religion, social class, education, familiarity with technology, country, first language, test language, etc. Here are some examples:
Such test items will perform differently on subgroup members and non-subgroup members who are otherwise identical in abilities and achievement.
Differential item functioning maps the degree to which a test item measures the abilities of separate but similarly-matched subgroups differently.
Various statistical models may be used to detect differential item functioning, such as logistic regression, standardization, the Mantel-Haenszel approach and item response theory. These procedures assume that the test takers have approximately the same abilities.
Item response theory is currently one of the most widely-used methods for measuring differential item functioning in test adaptations. However, it requires a relatively large sample size.
Test items that are statistically flagged for differential item functioning are not necessarily biased. However, these items need to be investigated to determine the underlying cause. This should include quantitative and qualitative analyses.
If the differential item functioning is a result of previously unattributed group differences, for example differences in real abilities, then the test item is often maintained in the testing instrument.
Nonetheless, if the differential item functioning is a result of language choices that provide an advantage to one subgroup over another, or if the item is found to measure something other than what was intended, etc., then the test item is considered biased and removed from future versions of the testing instrument.
Responsive Translation specializes in the translation, adaptation, validation and review of high-stakes testing instruments.
If you’d like to find out more about our services and how we can help your organization, please get in touch at 646-847-3309 or [email protected].