EuSQuADAutomatically Translated and Aligned SQuAD2.0 for Basque

  1. García-Pablos, Aitor
  2. Perez, Naiara
  3. Cuadros, Montse
  4. Bengoetxea, Jaione
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2024

Número: 73

Páginas: 125-137

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

La amplia disponibilidad de conjuntos de datos de preguntas y respuestas en inglés ha facilitado en gran medida el avance del campo de Procesamiento de Lenguaje Natural (PLN). Sin embargo, la escasez de tales recursos para idiomas minoritarios, como el euskera, plantea un desafío sustancial para estas comunidades. En este contexto, la traducción y alineación de conjuntos de datos desempeña un papel crucial en la reducción de esta brecha tecnológica. Este trabajo presenta EuSQuAD, la primera iniciativa dedicada a traducir y alinear automáticamente SQuAD2.0 al euskera. Demostramos el valor de EuSQuAD a través de un extenso análisis cualitativo y experimentos de QA, para los cuales se ha creado además un nuevo dataset anotado por humanos.

Referencias bibliográficas

  • Abadani, N., J. Mozafari, A. Fatemi, M. A. Nematbakhsh, and A. Kazemi. 2021. ParSQuAD: Machine translated SQuAD dataset for Persian Question Answering. In 2021 7th International Conference on Web Research (ICWR), pages 163–168.
  • Agerri, R., I. San Vicente, J. A. Campos, A. Barrena, X. Saralegi, A. Soroa, and E. Agirre. 2020. Give your text representation models some love: the case for Basque. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4781–4788.
  • Artetxe, M., S. Ruder, and D. Yogatama. 2020. On the cross-lingual transferability of monolingual representations. In D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4623–4637.
  • Carrino, C. P., M. R. Costa-jussà, and J. A. R. Fonollosa. 2020. Automatic Spanish translation of SQuAD dataset for Multi-lingual Question Answering. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 5515–5523.
  • Chandra, A., A. Fahrizain, S. W. Laufried, et al. 2021. A survey on non-English question answering dataset. arXiv preprint arXiv:2112.13634.
  • Choi, E., H. He, M. Iyyer, M. Yatskar, W.-t. Yih, Y. Choi, P. Liang, and L. Zettlemoyer. 2018. QuAC: Question Answering in Context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2174–2184.
  • Clark, J. H., D. Garrette, I. Turc, and J. Wieting. 2022. CANINE: Pre-training an efficient tokenization-free encoder for language representation. Transactions of the Association for Computational Linguistics, 10:73–91.
  • Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  • d’Hoffschmidt, M., W. Belblidia, Q. Heinrich, T. Brendlé, and M. Vidal. 2020. FQuAD: French Question Answering dataset. In T. Cohn, Y. He, and Y. Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1193–1208.
  • Etchegoyhen, T., E. Martínez Garcia, A. Azpeitia, G. Labaka, I. Alegria, I. Cortes Etxabe, A. Jauregi Carrera, I. Ellakuria Santos, M. Martin, and E. Calonge. 2018. Neural Machine Translation of Basque. In Proceedings of the 21st Annual Conference of the European Association for Machine Translation, pages 159–168.
  • Forner, P. et al. 2009. Overview of the CLEF 2008 Multilingual Question Answering track. In Evaluating Systems for Multilingual and Multimodal Information Access, pages 262–295.
  • Hládek, D., J. Staš, J. Juhár, and T. Koctúr. 2023. Slovak dataset for Multilingual Question Answering. IEEE Access, 11:32869–32881.
  • Joshi, M., E. Choi, D. Weld, and L. Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611.
  • Kwiatkowski, T. et al. 2019. Natural Questions: A benchmark for Question Answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  • Mozannar, H., E. Maamary, K. El Hajal, and H. Hajj. 2019. Neural Arabic Question Answering. In W. El-Hajj, L. H. Belguith, F. Bougares, W. Magdy, I. Zitouni, N. Tomeh, M. El-Haj, and W. Zaghouani, editors, Proceedings of the Fourth Arabic Natural Language Processing Workshop, pages 108–118.
  • Otegi, A., A. Agirre, J. A. Campos, A. Soroa, and E. Agirre. 2020. Conversational Question Answering in low resource scenarios: A dataset and case study for Basque. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 436–442.
  • Rajpurkar, P., R. Jia, and P. Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789.
  • Rajpurkar, P., J. Zhang, K. Lopyrev, and P. Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392.
  • Schuster, M. and K. Nakajima. 2012. Japanese and Korean voice search. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5149–5152.
  • Sennrich, R., B. Haddow, and A. Birch. 2016. Improving Neural Machine Translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 86–96.
  • Snæbjarnarson, V. and H. Einarsson. 2022. Natural Questions in Icelandic. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 4488–4496.
  • Tasmiah Tahsin Mayeesha, A. M. S. and R. M. Rahman. 2021. Deep learning based Question Answering system in Bengali. Journal of Information and Telecommunication, 5(2):145–178.
  • Tiedemann, J. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12), pages 2214–2218.
  • Wang, Y. et al. 2022. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Y. Goldberg, Z. Kozareva, and Y. Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109.
  • Wei, J., M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. 2022. Finetuned Language Models are zero-shot learners. In International Conference on Learning Representations.
  • Wu, Y., M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. 2016. Google’s Neural Machine Translation system: Bridging the gap between human and Machine Translation. arXiv preprint arXiv:1609.08144.
  • Zeng, C., S. Li, Q. Li, J. Hu, and J. Hu. 2020. A survey on Machine Reading Comprehension—tasks, evaluation metrics and benchmark datasets. Applied Sciences, 10(21).