Two approaches to compilation of bilingual multi-word terminology lists from lexical resources

Branislava Šandrih; Cvetana Krstev; Ranka Stanković

doi:10.1017/S1351324919000615

Two approaches to compilation of bilingual multi-word terminology lists from lexical resources

Published online by Cambridge University Press: 28 January 2020

Branislava Šandrih

Cvetana Krstev and

Ranka Stanković

Show author details

Branislava Šandrih*: Affiliation:
Faculty of Philology, University of Belgrade, Belgrade, Serbia
Cvetana Krstev: Affiliation:
Faculty of Philology, University of Belgrade, Belgrade, Serbia
Ranka Stanković: Affiliation:
Faculty of Mining and Geology, University of Belgrade, Belgrade, Serbia
*: *Corresponding author. Email: branislava.sandrih@fil.bg.ac.rs

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

In this paper, we present two approaches and the implemented system for bilingual terminology extraction that rely on an aligned bilingual domain corpus, a terminology extractor for a target language, and a tool for chunk alignment. The two approaches differ in the way terminology for the source language is obtained: the first relies on an existing domain terminology lexicon, while the second one uses a term extraction tool. For both approaches, four experiments were performed with two parameters being varied. In the experiments presented in this paper, the source language was English, and the target language Serbian, and a selected domain was Library and Information Science, for which an aligned corpus exists, as well as a bilingual terminological dictionary. For term extraction, we used the FlexiTerm tool for the source language and a shallow parser for the target language, while for word alignment we used GIZA++. The evaluation results show that for the first approach the F1 score varies from 29.43% to 51.15%, while for the second it varies from 61.03% to 71.03%. On the basis of the evaluation results, we developed a binary classifier that decides whether a candidate pair, composed of aligned source and target terms, is valid. We trained and evaluated different classifiers on a list of manually labeled candidate pairs obtained after the implementation of our extraction system. The best results in a fivefold cross-validation setting were achieved with the Radial Basis Function Support Vector Machine classifier, giving a F1 score of 82.09% and accuracy of 78.49%.

Keywords

Language resources Machine translation Terminology extraction Text classification

Information

Type: Article
Information: Natural Language Engineering , Volume 26 , Issue 4 , July 2020 , pp. 455 - 479

DOI: https://doi.org/10.1017/S1351324919000615 [Opens in a new window]
Copyright: © Cambridge University Press 2020

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

Footnotes

†

This research was supported by Serbian Ministry of Education and Science under the grants #III 47003 and 178006.

References

Aker, A., Paramita, M. and Gaizauskas, R. (2013). Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 1, pp. 402–411.Google Scholar

Ananiadou, S., McNaught, J. and Thompson, P. (2012). The English Language in the Digital Age. META-NET White Paper Series. Rehm, G. and Uszkoreit, H. (Series eds). Springer. Available at http://www.meta-net.eu/whitepapers Google Scholar

Arcan, M., Turchi, M., Tonelli, S. and Buitelaar, P. (2017). Leveraging bilingual terminology to improve machine translation in a computer aided translation environment. Natural Language Engineering 23(5), 763–788.10.1017/S1351324917000195CrossRef Google Scholar

Baldwin, T. and Kim, S. N. 2010. Multiword Expressions. Handbook of Natural Language Processing 2, 267–292.Google Scholar

Bouamor, D., Semmar, N. and Zweigenbaum, P. (2012). Identifying bilingual multi-word expressions for statistical machine translation. In Calzolari, N., Choukri, K., Declerck, T., Doan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J. and Piperidis, S. (eds), Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’12), Istanbul, Turkey: European Language Resources Association (ELRA). Available at http://www.lrec-conf.org/proceedings/lrec2012/pdf/886_Paper.pdf Google Scholar

Cram, D. and Daille, B. (2016). Terminology extraction with term variant detection. In Proceedings of ACL-2016 System Demonstrations, pp. 13–18.10.18653/v1/P16-4003CrossRef Google Scholar

Eibe, F., Hall, M. and Witten, I. (2016). The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”, Morgan Kaufmann, Fourth edition.Google Scholar

Fawi, F. and Delmonte, R. (2015). Italian-arabic domain terminology extraction from parallel corpora. In Proceedings of the Second Italian Conference on Computational Linguistics CLiC-it 2015, p. 130. Accademia University Press.CrossRef Google Scholar

Friedman, J. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics 29(5), 1189–1232.10.1214/aos/1013203451CrossRef Google Scholar

Gambette, P. and Véeronis, J. (2010). Visualising a text with a tree cloud. In Locarek-Junge, H. and Weihs, C., (eds), Classification as a Tool for Research, pp. 561–569, Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-642-10745-0.10.1007/978-3-642-10745-0_61CrossRef Google Scholar

Garabík, R. and Dimitrova, L. (2015). Extraction and presentation of bilingual correspondences from Slovak-Bulgarian parallel corpus. Cognitive Studies – Études Cognitives 15, 327–334.Google Scholar

Hakami, H. and Bollegala, D. (2017). A classification approach for detecting cross-lingual biomedical term translations. Natural Language Engineering 23(1), 31–51.CrossRef Google Scholar

Hamon, T. and Grabar, N. (2016). Adaptation of cross-lingual transfer methods for the building of medical terminology in Ukrainian. In Proceedings of the 17^th International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2016). LNCS. Springer.Google Scholar

Hazem, A. and Morin, E. (2016). Efficient data selection for bilingual terminology extraction from comparable corpora. In Proceedings of COLING 2016, the 26^th International Conference on Computational Linguistics: Technical Papers, pp. 3401–3411.Google Scholar

Hosmer, D. W. Jr., Lemeshow, S. and Sturdivant, R. X. (2013). Applied Logistic Regression, 398. John Wiley & Sons.CrossRef Google Scholar

Irvine, A. and Callison-Burch, C. (2016). End-to-end statistical machine translation with zero or small paarallel texts. Natural Language Engineering 22(4), 517–548.CrossRef Google Scholar

Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European Conference on Machine Learning, pp. 137–142. Springer.Google Scholar

Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, C. and Zens, R. (2007). Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pp. 177–180. Association for Computational Linguistics.Google Scholar

Kontonatsios, G., Claudiu, M., Korkontzelos, I., Thompson, P. and Ananiadou, S. (2014). A hybrid approach to compiling bilingual dictionaries of medical terms from parallel corpora. Statistical Language and Speech Processing, 8791. LNCS, pp. 57–69.CrossRef Google Scholar

Kovačević, L., Begenišić, D. D. and Injac-Malbaša V. (2014). Dictionary of Library and Information Sciences.Google Scholar

Krstev, C. (2008). Processing of Serbian. Automata, Texts and Electronic Dictionaries. Faculty of Philology of the University of Belgrade.Google Scholar

Krstev, C., (2014). Serbian WordNet. University of Belgrade, HLT Group and JeRTeh. Available at http://korpus.matf.bg.ac.rs/r22 Google Scholar

Krstev, C., Šandrih, B., Stanković, R. and Mladenović, M. (2018). Using english baits to catch serbian multi-word terminology. In Chair, N. C. C., Choukri, K., Cieri, C., Declerck, T., Goggi, S.,Hasida, K., Isahara, H.Maegaard, B., Mariani, J., Mazo, H., Moreno, A.,Odijk, J., Piperidis, S. and Tokunaga, T. (eds), Proceedings of the11th International Conference on Language Resources and Evaluation (LREC 2018), Paris, France: European Language Resources Association (ELRA). http://www.lrec-conf.org/proceedings/lrec2018/pdf/384.pdf Google Scholar

Lahbib, W., Bounhas, I. and Elayeb, B. (2014). Arabic-english domain terminology extraction from aligned corpora. In Meersman, R., Panetto, H., Dillon, T. , Missikoff, M., Liu, L., Pastor, O., Cuzzocrea, A. and Sellis, T. (eds), On the Move to Meaningful Internet Systems (OTM 2014 Conferences, Confederated International Conferences: CoopIS, and ODBASE 2014, Amantea, Italy, October 27–31, 2014, Proceedings), pp. 745–759. Berlin Heidelberg: Springer.CrossRef Google Scholar

Liaw, A. and Wiener, M. (2002). Classification and regression by random forest. R News 2(3), 18–22.Google Scholar

Sabtan, N. and Muhammad, Y. (2016). Bilingual lexicon extraction from arabic-english parallel corpora with a view to machine translation. Arab World English Journal 7(5), 317–336.Google Scholar

Och, F. J. and Ney, H. (2000). Improved statistical alignment models. In 38^th Annual Meeting on Association for Computational Linguistics, pp. 440–447. Stroudsburg, PA: Association for Computational Linguistics.Google Scholar

Oliver, A. (2017). A system for terminology extraction and translation equivalent detection in real time: Efficient use of statistical machine translation phrase tables. Machine Translation 31(3), 147–161.CrossRef Google Scholar

Papineni, K., Roukos, S., Ward, T. and Zhu, W. J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics.Google Scholar

Pianta, E., Girardi, C. and Zanoli, R. (2008). The TextPro tool suite. In: Proceedings of 6th edition of the Language Resources and Evaluation Conference.Google Scholar

Pinnis, M., Ljubešić, N., Stefanescu, D., Skadina, I., Tadić, M. and Gornostay, T. (2012). Term extraction, tagging, and mapping tools for under-resourced languages. In Proceedings of the 10th Conference on Terminology and Knowledge Engineering (TKE 2012), June, pp. 20-21.Google Scholar

Princeton University (2010). About WordNet. Princeton University.Google Scholar

Rish, I. (2001). An empirical study of the naive Bayes classifier. In IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 3, pp. 41–46. New York: IBM.Google Scholar

Semmar, N. (2018). A hybrid approach for automatic extraction of bilingual multiword expressions from parallel corpora. In Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Paris, France. European Language Resources Association (ELRA). Available at http://www.lrec-conf.org/proceedings/lrec2018/pdf/958.pdf Google Scholar

Spasić, I., Greenwood, M., Preece, A., Francis, N. and Elwyn, G. (2013). FlexiTerm: A flexible term recognition method. Journal of Biomedical Semantics 4(1), 27.CrossRef Google Scholar PubMed

Stanković, R., Krstev, C., Obradović, I., Lazić, B. and Trtovac, A. (2016). Rule-based automatic multi-word term extraction and lemmatization. In Proceedings of the 10th International Conference on Language Resources and Evaluation (LREC 2016), Paris, France: European Language Resources Association (ELRA). Available at http://www.lrec-conf.org/proceedings/lrec2016/pdf/1033_Paper.pdf Google Scholar

Stanković, R., Krstev, C., Vitas, D., Vulović, N. and Kitanović, O. (2017). Keyword-Based Search on Bilingual Digital Libraries, pp. 112–123. Springer International Publishing, Cham. In Cal, A., Gorgan, D., Ugarte, M. (eds) Semantic Keyword–Based Search on Structured Data Sources – Second COST Action IC1302 International KEYSTONE Conference, IKC 2016, Cluj–Napoca, Romania, September 8–9, 2016. Revised Selected Papers. Springer, LNCS, 10151, DOI: 10.1007/978-3-319-53640-8_10 CrossRef Google Scholar

Stanković, R., Krstev, C., Lazić, B. and Vorkapić, D. (2015). A bilingual digital library for academic and entrepreneurial knowledge management. In Proceeding of 10th International Forum on Knowledge Asset Dynamics – IFKAD 2015: Culture, Innovation and Entrepreneurship: connecting the knowledge dots, Bari, Italy, 10–12 June 2015, pp. 1778–1788. Bari (2015). ISSN: 2280-787XGoogle Scholar

Stanković, R., Obradović, I., Krstev, C. and Vitas, D. (2011). Production of morphological dictionaries of multi-word units using a multipurpose tool. In Jassem, K., Fuglewicz, P. W., Piasecki, M. and Przepirkowski, A. (eds), Proceedings of the Computational Linguistics-Applications Conference, October 17–19, 2011. Jachranka, Poland, pp. 77–84, Polish Information Processing Society.Google Scholar

Thurmair, G. and Aleksić, V. (2012). Creating term and lexicon entries from phrase tables. In Proceedings of the 16th Conference of the European Association for Machine Translation (EAMT 2012), Trento, Italy.Google Scholar

Tsvetkov, Y. and Wintner, S. (2010). Extraction of multi-word expressions from small parallel corpora. In Proceedings of the 23^rd International Conference on Computational Linguistics: Posters, COLING ’10, pp. 1256–1264, Stroudsburg, PA, USA. Association for Computational Linguistics.Google Scholar

Vintar, Š. and Fišer, D. (2008). Harvesting multi-word expressions from parallel corpora. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco: European Language Resources Association (ELRA). Available at http://www.lrec-conf.org/proceedings/lrec2008/Google Scholar

Vitas, D., Popović, L., Krstev, C., Obradović, I., Pavlović Laźetić, G. and Stanojević M. (2012). Srpski jezik u digitalnom dobu – The Serbian Language in the Digital Age. META-NET White Paper Series. Rehm, G. and Uszkoreit, H. (Series eds). Springer. Available at http://www.meta-net.eu/whitepapers Google Scholar

Xu, Y., Chen, L., Wei, J., Ananiadou, S., Fan, Y., Qian, Y., Eric, I., Chang, C. and Tsujii, J. (2015). Bilingual term alignment from comparable corpora in english discharge summary and chinese discharge summary. BMC Bioinformatics 16(1), 149.CrossRef Google Scholar PubMed

Article contents

Two approaches to compilation of bilingual multi-word terminology lists from lexical resources

Abstract

Keywords

Information

Access options

Article purchase

Temporarily unavailable

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests