A new weighting scheme for document ranking based on the modified word-embedding method
Main Article Content
Abstract
Finding documents related to a search query or similar to a specific document is among the important duties of information retrieval. The vector space model has fundamental techniques, including the bag-of-words model and the TF-IDF model. These techniques are the main strategies for determining the documents' similarities. Another method for producing a document vector is using word vectors. Thanks to recent advancements in distributed meaning, word vectors can be created with significant volumes of unlabeled textual input, primarily through artificial neural network (ANN)-based methods. A semantic space is built using this data, and word-embedding vectors represent words in this semantic space. The present study examines various approaches for transforming word-embedded vectors into document vectors and offers a new approach. Ad-hoc retrieval is one of the information retrieval tasks to employ these techniques. In this research, the metrics of mean average precision (MAP) and normalized discounted cumulative gain (NDCG) are used to assess the algorithm, followed by comparing various approaches using these two measures. The findings of this investigation demonstrate that the suggested TAW-TFIDF method outperforms alternative weighting methodologies.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
Gomaa WH, Fahmy AA. A survey of text similarity approaches. Int J Comput Appl. 2013;68(13):13-8.
Sitikhu P, Pahi K, Thapa P, Shakya S. A comparison of semantic similarity methods for maximum human interpretability. 2019 Artificial intelligence for transforming business and society (AITB); 2019 Nov 5; Kathmandu, Nepal. USA: IEEE; 2019. p. 1-4.
Khorasani F, Zanjireh MM, Bahaghighat M, Xin Q. A tradeoff between accuracy and speed for k-means seed determination. Comput Syst Sci Eng. 2022;40(3):1085-98.
Amouee E, Zanjireh MM, Bahaghighat M, Ghorbani M. A new anomalous text detection approach using unsupervised methods. Facta Univ Electron Energ. 2020;33(4):631-53.
Bozorgi M, Zanjireh MM, Bahaghighat M, Xin Q. A time-efficient and exploratory algorithm for the rectangle packing problem. Intell Autom Soft Comput. 2022;31(2):885-98.
Manning CD. An introduction to information retrieval. United Kingdom: Cambridge University Press; 2009.
Mitra B, Nalisnick E, Craswell N, Caruana R. A dual embedding space model for document ranking. arXiv:1602.01137. 2016:1-10.
Plansangket S. New weighting schemes for document ranking and ranked query suggestion [dissertation]. England: University of Essex; 2017.
Alvarez JE. A review of word embedding and document similarity algorithms applied to academic text [thesis]. Germany: University of Freiburg; 2017.
Aggarwal CC. Machine learning for text: an introduction. USA: Springer International Publishing; 2018.
Kusner M, Sun Y, Kolkin N, Weinberger K. From word embeddings to document distances. Proceedings of the 32nd international conference on machine learning; 2015 Jun 1; Lille, France. United States: MLR Press; 2015. p. 957-66.
George B. Document reranking with deep learning in information retrieval [thesis]. Greece: Athens University of Economics; 2018.
Garcia M. Embeddings in natural language processing: theory and advances in vector representations of meaning. Comput Linguist. 2021;47(3):699-701.
Jayashree R, Christy A. Improving the enhanced recommended system using Bayesian approximation method and normalized discounted cumulative gain. Procedia Comput Sci. 2015;50:216-22.
Galke L, Saleh A, Scherp A. Word embeddings for practical information retrieval. INFORMATIK 2017. Deep Learning in heterogenen Datenbeständen; 2017 Sep 25-29; Chemnitz, Germany. Bonn: Gesellschaft für Informatik; 2017. p. 2155-67.
Roy D, Ganguly D, Bhatia S, Bedathur S, Mitra M. Using word embeddings for information retrieval: how collection and term normalization choices affect performance. Proceedings of the 27th ACM international conference on information and knowledge management; 2018 Oct 17; Torino, Italy. New York: ACM; 2018. p. 1835-8.
Singh TD. Combined word and network embeddings: an analysis framework of user opinions on social media [dissertation]. USA: The University of North Carolina; 2020.
Goldberg Y. Neural network methods for natural language processing. USA: Springer Nature; 2022.
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems 26 (NIPS 2013). USA: Curran Associates Inc; 2013. p. 3111-9.
Wu L, Yen IEH, Xu K, Xu F, Balakrishnan A, Chen PY, et al. Word mover's embedding: from word2vec to document embedding. arXiv:1811.01713. 2018:1-15.
Church KW. Word2Vec. Nat Lang Eng. 2017;23(1):155-62.
Le Q, Mikolov T. Distributed representations of sentences and documents. Proceedings of the 31st International Conference on Machine Learning; 2014 Jun 18; Lille, France. United States: MLR Press; 2014. p. 1188-96.
Clinchant S, Perronnin F. Textual similarity with a bag-of-embedded-words model. Proceedings of the 2013 Conference on the Theory of Information Retrieval; 2013 Sep 29 - Oct 2; Copenhagen, Denmark. New York: ACM; 2013. p. 117-20.
Craswell N, Mitra B, Yilmaz E, Campos D, Voorhees EM. Overview of the TREC 2019 deep learning track. arXiv:2003.07820. 2020:1-22.
Mitra B, Craswell N. An introduction to neural information retrieval. Boston: Now Publishers; 2018.
Bahaghighat M, Xin Q, Motamedi SA, Zanjireh MM, Vacavant A. Estimation of wind turbine angular velocity remotely found on video mining and convolutional neural network. Appl Sci. 2020;10(10):3544.
Bahaghighat M, Abedini F, Xin Q, Zanjireh MM, Mirjalili S. Using machine learning and computer vision to estimate the angular velocity of wind turbines in smart grids remotely. Energy Reports. 2021;7:8561-76.
Ghorbani M, Bahaghighat M, Xin Q, Özen F. ConvLSTMConv network: a deep learning approach for sentiment analysis in cloud computing. J Cloud Comp. 2020;9:16.
Abedini F, Bahaghighat M, S’hoyan M. Wind turbine tower detection using feature descriptors and deep learning. Facta Univ Electron Energ. 2020;33(1):133-53.
Hajikarimi A, Bahaghighat M. Optimum outlier detection in internet of things industries using autoencoder. In: Khosravy M, Gupta N, Patel N, editors. Frontiers in Nature-Inspired Industrial Optimization. Springer Tracts in Nature-Inspired Computing. Singapore: Springer; 2022. p. 77-92.