APPROACHES TO ASSESSING THE SEMANTIC SIMILARITY OF TEXTS IN A MULTILINGUAL SPACE
Abstract and keywords
Abstract (English):
This paper is devoted to the development of a methodology for evaluating the semantic similarity of any texts in different languages is developed. The study is based on the hypothesis that the proximity of vector representations of terms in semantic space can be interpreted as a semantic similarity in the cross-lingual environment. Each text will be associated with a vector in a single multilingual semantic vector space. The measure of the semantic similarity of texts will be determined by the measure of the proximity of the corresponding vectors. We propose a quantitative indicator called Index of Semantic Textual Similarity (ISTS) that measures the degree of semantic similarity of multilingual texts on the basis of identified cross-lingual semantic implicit links. The setting of parameters is based on the correlation with the presence of a formal reference between documents. The measure of semantic similarity expresses the existence of two common terms, phrases or word combinations. Optimal parameters of the algorithm for identifying implicit links are selected on the thematic collection by maximizing the correlation of explicit and implicit connections. The developed algorithm can facilitate the search for close documents in the analysis of multilingual patent documentation.

Keywords:
cross-lingual semantic similarity, semantic textual similarity measure, semantic implicit links, collection of documents, measure of similarity of texts, method of relevant phrases, vector representations for words
Text
Publication text (PDF): Read Download
References

1. Jarmasz, M., Szpakowicz, S. (2003). Roget’s Thesaurus and Semantic Similarity. Recent Adv. Nat. Lang. Process. III Sel. Pap. from RANLP 2003, vol. 111, 2004.

2. Islam, A., Inkpen, D. (2012). Unsupervised Near-Synonym Choice using the Google Web 1T. ACM Trans. Knowl. Discov. Data, vol. V, no. June, pp. 1-19.

3. Li, H., Xu, J. (2014). Semantic matching in search. Foundations and Trends in Information Retrieval, 7(5):343-469.

4. Aliguliyev R. M. (2009). A new sentence similarity measure and sentence based extractive technique for automatic text summarization. Expert Systems with Applications. 36. 7764-7772. 10.1016/j.eswa.2008.11.022.

5. Wäschle, K. (2015). Quantifying Cross-lingual Semantic Similarity for Natural Language Processing Applications. Heidelberg. – 139 r.

6. Wäschle, K. and Riezler, S. (2012). Structural and topical dimensions in multi-task patent translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL).Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 818–828, Avignon, France, April 23 - 27, 2012

7. Andersson, L., Hanbury, A. and Rauber, A. (2017). The Portability of Three Types of Text Mining Techniques into the Patent Text Genre, chapter 9, pages 241–280. Springer Berlin. Heidelberg, Berlin, Heidelberg. ISBN 978-3-662-53817-3.

8. Eneko, A., Enrique, A., Keith, H., Jana, K., Marius, P., & Aitor, S. (2009). A study on similarity and relatedness using distributional and WordNet-based approaches. Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 19-27). Boulder, Colorado: Association for Computational Linguistics

9. Zou, W. Y., Socher, R., Cer, D.M. and Manning C.D. (2013). Bilingual word embeddings for phrase-based machine translation. In Proceedings of EMNLP (pp. 1393-1398).

10. de Melo, G. (2015). Wiktionary-based word embeddings. Proceedings of MT Summit XV (pp. 346-359).

11. Ammar, W., Mulcaire, G., Tsvetkov, Y., Lample, G., Dyer, C. and Smith, N.A. (2016). Massively multilingual word embeddings. arXiv preprint arXiv:1602.01925.

12. Michael, J. F., Alon, Y. H., & David, M. (2005). From databases to data spaces: A new abstraction for information management. SIGMOD Record, 34(4), 27-33

13. Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P. and Stein, B. (2014). Overview of the 6th International Competition on Plagiarism Detection. In PAN at CLEF 2014. Sheffield, UK (pp. 845-876).

14. Ferrero, J., Besacier, L., Schwab, D. & Agnes, F. (2017). Using Word Embedding for Cross-Language Plagiarism Detection. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, (EACL 2017). Association for Computational Linguistics, Valencia, Spain, volume 2 (pp. 415-421).

15. Page, L., Brin, S., Motwani, R., Winograd, T. (1998). The PageRank Citation Ranking: Bringing Order to the Web. In: Technical Report. Stanford University, Stanford, 1998. http://ilpubs.stanford.edu:8090/422/1/1999-66.pdf

Login or Create
* Forgot password?