IMPROVING UZBEK MACHINE TRANSLATION THROUGH PARALLEL CORPORA: CHALLENGES AND SOLUTIONS
https://doi.org/10.5281/zenodo.15590463
Kalit so‘zlar
Corpus, corpus linguistics, parallel corpus, translation corpus, comparable corpus, segmentation, machine translationAnnotasiya
The thesis explores the significance of parallel corpora in modern translation studies, focusing on their crucial role in improving machine translation systems, specifically in the context of the Uzbek language. Parallel corpora, which consist of texts in multiple languages aligned at the sentence or paragraph level, are essential for training neural network-based translation systems. The paper outlines the main challenges in creating high-quality parallel corpora, particularly for underrepresented languages like Uzbek. These challenges include limited available resources, contextual mismatching, errors in segmentation and alignment, and copyright issues. The thesis discusses several solutions to these problems, such as building open-access databases, leveraging machine translation systems, using modern alignment tools, and engaging in crowdsourcing efforts. Additionally, it emphasizes the future potential of parallel corpora in advancing translation quality, supporting linguistic research, and promoting the global recognition of the Uzbek language. Ultimately, the paper argues that parallel corpora are not just a scientific resource but a technological tool, bridging the gap between human translators and machine translation systems.
Foydalanilgan adabiyotlar ro‘yhati
Koehn, P. Europarl: A Parallel Corpus for Statistical Machine Translation. MT Summit X.2005.
Tiedemann, J. Parallel Data, Tools and Interfaces in OPUS. In LREC.2012.
Bojar, O., et al. Findings of the 2014 Workshop on Statistical Machine Translation. ACL.2014.
Och, F. J., & Ney, H. A systematic comparison of various statistical alignment models. Computational Linguistics.2004.
Resnik, P., & Smith, N. A. The web as a parallel corpus. Computational Linguistics.2003.
Artetxe, M., & Schwenk, H. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond. Transactions of the ACL.2019.
Sharoff, S. Constructing Comparable Corpora for Low-Resource Languages. Language Resources and Evaluation.2020.
Translators Without Borders – https://translatorswithoutborders.org
OPUS corpus – http://opus.nlpl.eu
LaBSE (Google Research) – https://github.com/google-research/bert