Approaches to assessing the semantic similarity of texts in a multilingual space

Khakimova Aida; Charnine Michael; Klokov Aleksey; Sokolov Evgenii

Home / Conferences / International Conference "Computing for Physics and Technology - CPT2020" / CPT2020 The 8th International Scientific Conference on Computing in Physics and Technology Proceedings Volume 1

Approaches to assessing the semantic similarity of texts in a multilingual space

Submit manuscript Download PDF
Text

To cite

APPROACHES TO ASSESSING THE SEMANTIC SIMILARITY OF TEXTS IN A MULTILINGUAL SPACE

Section: 3. SOCIOECONOMIC TECHNOLOGIES

Proceedings: CPT2020 THE 8TH INTERNATIONAL SCIENTIFIC CONFERENCE ON COMPUTING IN PHYSICS AND TECHNOLOGY PROCEEDINGS Volume 1

UDK 81 Лингвистика. Языкознание. Языки BISAC LAN016000 Linguistics / Semantics

Khakimova Aida ¹

Charnine Michael ²

Klokov Aleksey ³

Sokolov Evgenii ⁴

Author and publication information

Authors:

1. Kama Institute, ANO “Research Center of Physical and Engineering Informatics” ( Associate Professor; Leading research assistant)
employee
Naberezhnye Chelny, Kazan, Russian Federation

2. FRC CSC of the Russian Academy of Sciences
, Russian Federation

3. Moscow Institute of Physics and Technology

4. FRC CSC of the Russian Academy of Sciences
Russian Federation

Type:

Сonference article

DOI:

https://doi.org/10.30987/conferencearticle_5fce2773b1aff6.26436513

Published:

07.12.2020

Subject area:

UDK 81 Лингвистика. Языкознание. Языки
BISAC LAN016000 Linguistics / Semantics

Language:

English

Keywords:

cross-lingual semantic similarity, semantic textual similarity measure, semantic implicit links, collection of documents, measure of similarity of texts, method of relevant phrases, vector representations for words

Abstract and keywords

Abstract (English):
This paper is devoted to the development of a methodology for evaluating the semantic similarity of any texts in different languages is developed. The study is based on the hypothesis that the proximity of vector representations of terms in semantic space can be interpreted as a semantic similarity in the cross-lingual environment. Each text will be associated with a vector in a single multilingual semantic vector space. The measure of the semantic similarity of texts will be determined by the measure of the proximity of the corresponding vectors. We propose a quantitative indicator called Index of Semantic Textual Similarity (ISTS) that measures the degree of semantic similarity of multilingual texts on the basis of identified cross-lingual semantic implicit links. The setting of parameters is based on the correlation with the presence of a formal reference between documents. The measure of semantic similarity expresses the existence of two common terms, phrases or word combinations. Optimal parameters of the algorithm for identifying implicit links are selected on the thematic collection by maximizing the correlation of explicit and implicit connections. The developed algorithm can facilitate the search for close documents in the analysis of multilingual patent documentation.

Keywords:
cross-lingual semantic similarity, semantic textual similarity measure, semantic implicit links, collection of documents, measure of similarity of texts, method of relevant phrases, vector representations for words

Text

Publication text (PDF): Read Download

References

1. Jarmasz, M., Szpakowicz, S. (2003). Roget’s Thesaurus and Semantic Similarity. Recent Adv. Nat. Lang. Process. III Sel. Pap. from RANLP 2003, vol. 111, 2004.

2. Islam, A., Inkpen, D. (2012). Unsupervised Near-Synonym Choice using the Google Web 1T. ACM Trans. Knowl. Discov. Data, vol. V, no. June, pp. 1-19.

3. Li, H., Xu, J. (2014). Semantic matching in search. Foundations and Trends in Information Retrieval, 7(5):343-469.

4. Aliguliyev R. M. (2009). A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Systems with Applications. 36. 7764-7772. 10.1016/j.eswa.2008.11.022.

5. Wäschle, K. (2015). Quantifying Cross-lingual Semantic Similarity for Natural Language Processing Applications. Heidelberg. – 139 r.

6. Wäschle, K. and Riezler, S. (2012). Structural and topical dimensions in multi-task patent translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL).Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 818–828, Avignon, France, April 23 - 27, 2012

7. Andersson, L., Hanbury, A. and Rauber, A. (2017). The Portability of Three Types of Text Mining Techniques into the Patent Text Genre, chapter 9, pages 241–280. Springer Berlin. Heidelberg, Berlin, Heidelberg. ISBN 978-3-662-53817-3.

8. Eneko, A., Enrique, A., Keith, H., Jana, K., Marius, P., & Aitor, S. (2009). A study on similarity and relatedness using distributional and WordNet-based approaches. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 19-27). Boulder, Colorado: Association for Computational Linguistics

9. Zou, W. Y., Socher, R., Cer, D.M. and Manning C.D. (2013). Bilingual word embeddings for phrase-based machine translation. In Proceedings of EMNLP (pp. 1393-1398).

10. de Melo, G. (2015). Wiktionary-based word embeddings. Proceedings of MT Summit XV (pp. 346-359).

11. Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C. and Smith, N.A. (2016). Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.

12. Michael, J. F., Alon, Y. H., & David, M. (2005). From databases to data spaces: A new abstraction for information management. SIGMOD Record, 34(4), 27-33

13. Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P. and Stein, B. (2014). Overview of the 6th International Competition on Plagiarism Detection. In PAN at CLEF 2014. Sheffield, UK (pp. 845-876).

14. Ferrero, J., Besacier, L., Schwab, D. & Agnes, F. (2017). Using Word Embedding for Cross-Language Plagiarism Detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, (EACL 2017). Association for Computational Linguistics, Valencia, Spain, volume 2 (pp. 415-421).

15. Page, L., Brin, S., Motwani, R., Winograd, T. (1998). The PageRank Citation Ranking: Bringing Order to the Web. In: Technical Report. Stanford University, Stanford, 1998. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Submit manuscript Download PDF
Text

To cite

Confirmation

Регистрация