Mislabeled Instances in Aya Datasets
Automated identification label issues in Aya instruction and red-teaming datasets using language detection, perplexity analysis and LLM-as-a-Judge techniques.
We propose an effective pipeline for flagging potentially mislabeled samples in the Aya dataset, a massively multilingual dataset.