Optimizing Product Matching in E-Commerce with DOC2VEC: Leveraging Hierarchical Clustering Parameters Based on Product Titles
Main Article Content
Abstract
Information technology is pivotal in increasing efficiency and effectiveness in online retail, particularly in product matching. This research delves into the challenges associated with product matching in the e-commerce sector, addressing issues related to the diversity and ambiguity of product titles and the fast-paced introduction of new products to the market. As a solution, we implement a neural network-based approach. The main contribution of this research is the implementation and validation of the Doc2Vec method in the context of product matching for e-commerce products. Additionally, this study successfully identifies the optimal parameter combinations for Hierarchical Clustering, which has been tested and validated on 4,000 product title data points. The data for learning and evaluation comes from an online retail platform and includes 34,000 product names from various sectors. The research compares two Doc2Vec architectures for feature extraction from product titles and then integrates them with a Hierarchical Clustering approach to group similar products. The results indicate that the Doc2Vec model with the DBOW (Distributed Bag of Words) architecture yields a better average NMI (Normalized Mutual Information) Score than the DM (Distributed Memory) architecture.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
References
N. Binsaif, “Application of information technology to e-commerce,” Int. J. Comput. Appl. Technol., vol. 68, no. 3, pp. 305–311, 2022.
A. Bhatti, H. Akram, H. M. Basit, A. U. Khan, S. M. Raza, and M. B. Naqvi, “E-commerce trends during COVID-19 Pandemic,” Int. J. Futur. Gener. Commun. Netw., vol. 13, no. 2, pp. 1449–1452, 2020.
M. K. Susmitha, “Impact of COVID-19 on E-Commerce,” J. Interdiscipl. Cycle Res., vol. 12, no. 9, pp. 1161–1165, 2021.
A.-L. Scutariu, S, tefa ̆nit, ̆a S, u ̧su, C.-E. HuidumacPetrescu, and R.-M. Gogonea, “A cluster analysis concerning the behavior of enterprises with e-commerce activity in the context of the COVID-19 pandemic,” J. Theor. Appl. Electron. Commer. Res., vol. 17, no. 1, pp. 47–68, 2021.
M. Takayanagi, O. Fukuda, N. Yamaguchi, H. Okumura and A. N. Handayani, “Vision-based Scene Recognition for Product Search,” 2021 7th International Conference on Electrical, Electronics and Information Engineering (ICEEIE), Malang, Indonesia, pp. 1-5, 2021.
P. Ristoski, P. Petrovski, P. Mika, and H. Paulheim, “A machine learning approach for product matching and categorization,” Semant. Web, vol. 9, no. 5, pp. 707–728, 2018.
N. Kertkeidkachorn and R. Ichise, “PMap: Ensemble Pre-training Models for Product Matching.,” in MWPD@ ISWC, 2020.
H. W. Herwanto, A. N. Handayani, A. P. Wibawa, K. L. Chandrika and K. Arai, “Comparison of Min-Max, Z-Score and Decimal Scaling Normalization for Zoning Feature Extraction on Javanese Character Recognition,” 2021 7th International Conference on Electrical, Electronics and Information Engineering (ICEEIE), Malang, Indonesia, pp. 1-3, 2021.
K. Dedes, A. B. P. Utama, A. P. Wibawa, A. N. Afandi, A. N. Handayani, and L. Hernandez, “Neural Machine Translation of Spanish-English Food Recipes Using LSTM,” JOIV Int. J. Informatics Vis., vol. 6, no. 2, pp. 290–297, 2022.
E. I. Setiawan, A. Ferdianto, J. Santoso, Y. Kristian, S. Sumpeno, and M. H. Purnomo, “Analisis Pendapat Masyarakat terhadap Berita Kesehatan Indonesia menggunakan Pemodelan Kalimat berbasis LSTM (Indonesian Stance Analysis of Healthcare News using Sentence Embedding Based on LSTM),” J. Nas. Tek. Elektro dan Teknol. Inf, vol. 9, no. 1, pp. 8–17, 2020.
A. Rachmadany, Y. M. Pranoto, and G. Gunawan, “Classification of Words of Wisdom in Indonesian on Twitter Using Na ̈ıve Bayes and Multinomial Naive Bayes,” Acad. Open, vol. 3, pp. 10–21070, 2020.
S. Mudgal et al., “Deep learning for entity matching: A design space exploration,” in Proceedings of the 2018 International Conference on Management of Data, pp. 19–34, 2018.
L. Akritidis, A. Fevgas, P. Bozanis, and C. Makris, “A self-verifying clustering approach to unsupervised matching of product titles,” Artif. Intell. Rev., vol. 53, no. 7, pp. 4777–4820, Oct. 2020.
A. Alabdullatif and M. Aloud, “AraProdMatch: A Machine Learning Approach for Product Matching in E-Commerce,” Int. J. Comput. Sci. Netw. Secur., vol. 21, no. 4, pp. 214–222, 2021.
L. Akritidis and P. Bozanis, “Effective Unsupervised Matching of Product Titles with kCombinations and Permutations,” 2018 Innovations in Intelligent Systems and Applications (INISTA), Thessaloniki, Greece, pp. 1-10, 2018.
K. Shah, S. Kopru, and J. D. Ruvini, “Neural network based extreme classification and similarity models for product matching,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 3 (Industry Papers), pp. 8–15, , 2018.
D. Anggreani, I. A. E. Zaeni, A. N. Handayani, H. Azis and A. R. Manga’, “Multivariate Data Model Prediction Analysis Using Backpropagation Neural Network Method,” 2021 3rd East Indonesia Conference on Computer and Information Technology (EIConCIT), Surabaya, Indonesia, pp. 239-243, 2021.
A. N. Handayani, M. I. Akbar, H. Ar-Rosyid, M. Ilham, R. A. Asmara and O. Fukuda, “Design of SIBI Sign Language Recognition Using Artificial Neural Network Backpropagation,” 2022 2nd International Conference on Intelligent Cybernetics Technology & Applications (ICICyTA), Bandung, Indonesia, pp. 192-197, 2022.
Y. M. Pranoto, A. N. Handayani, and Y. Kristian, “Marketplace Product Image Grouping Using Transfer Learning of Deep Convolutional Neural Network in COVID-19 Post-Pandemic Situation,” in The Spirit of Recovery, CRC Press, pp. 55–63, 2023.
R. Shrivastava and D. S. Sisodia, “Product Recommendations Using Textual Similarity Based Learning Models,” 2019 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, pp. 1-7, 2019.
Q. Chen and M. Sokolova, “Specialists, scientists, and sentiments: Word2Vec and Doc2Vec in the analysis of scientific and medical texts,” SN Comput. Sci., vol. 2, pp. 1–11, 2021.
A. Habib, M. Akram, and C. Kahraman, “Minimum spanning tree hierarchical clustering algorithm: A new Pythagorean fuzzy similarity measure for analyzing functional brain networks,” Expert Syst. Appl., vol. 201, p. 117016, 2022.
E. S. Darmawan, “Ipsos Research Results: Shopee Named Most Used E-Commerce Platform in 2021.” Kompas. Com. https://money.kompas.Com/read/2022/01/31/204500426, 2022.
Suresh, “Shopee Train Images WithLabels Dataset. Retrieved June 24, 2022, from https://www.kaggle.com/datasets/dharmiksv/shopee-train-images-withlabels.,” 2021.
H. Lee and Y. Yoon, “Engineering doc2vec for automatic classification of product descriptions on O2O applications,” Electron. Commer. Res., vol. 18, pp. 433–456, 2018.
H. B. Dogru, S. Tilki, A. Jamil and A. Ali Hameed, “Deep Learning-Based Classification of News Texts Using Doc2Vec Model,” 2021 1st International Conference on Artificial Intelligence and Data Analytics (CAIDA), Riyadh, Saudi Arabia, pp. 91-96, 2021.
T. H. J. Hidayat, Y. Ruldeviyani, A. R. Aditama, G. R. Madya, A. W. Nugraha, and M. W. Adisaputra, “Sentiment analysis of Twitter data related to Rinca Island development using Doc2Vec and SVM and logistic regression as classifier,” Procedia Comput. Sci., vol. 197, pp. 660–667, 2022.
M. Hanifi, H. Chibane, R. Houssin, and D. Cavallucci, ““Problem formulation in inventive design using Doc2vec and Cosine Similarity as Artificial Intelligence methods and Scientific Papers,” Eng. Appl. Artif. Intell., vol. 109, p. 104661, 2022.
M. S El-Rahmany, E. Hussein Mohamed, and M. H Haggag, “Semantic detection of targeted attacks using DOC2VEC embedding,” J. Commun. Softw. Syst., vol. 17, no. 4, pp. 334–341, 2021.
G. Wang and S. W. H. Kwok, “Using K-Means Clustering Method with Doc2Vec to Understand the Twitter Users’ Opinions on COVID-19 Vaccination,” 2021 IEEE EMBS International Conference on Biomedical and Health Informatics (BHI), Athens, Greece, pp. 1-4, 2021.
P. Shetty and S. Singh, “Hierarchical clustering: a survey,” Int. J. Appl. Res., vol. 7, no. 4, pp. 178–181, 2021.
T. Li, A. Rezaeipanah, and E. M. T. El Din, “An ensemble agglomerative hierarchical clustering algorithm based on clusters clustering technique and the novel similarity measurement,” J. King Saud Univ. Inf. Sci., vol. 34, no. 6, pp. 3828–3842, 2022.
A. Dogan and D. Birant, “K-centroid link: a novel hierarchical clustering linkage method,” Appl. Intell., pp. 1–24, 2022.
L. L. Gao, J. Bien, and D. Witten, “Selective inference for hierarchical clustering,” J. Am. Stat. Assoc., pp. 1–11, 2022.
Vijaya, S. Sharma and N. Batra, “Comparative Study of Single Linkage, Complete Linkage, and Ward Method of Agglomerative Clustering,” 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, pp. 568-573, 2019.
S. Abbasi, S. Nejatian, H. Parvin, V. Rezaie, and K. Bagherifard, “Clustering ensemble selection considering quality and diversity,” Artif. Intell. Rev., vol. 52, no. 2, pp. 1311–1340, 2019.
X. Yang, J. Yan, Y. Cheng and Y. Zhang, “Learning Deep Generative Clustering via Mutual Information Maximization,” in IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 9, pp. 6263-6275, Sept. 2023.
M. Rahmanian and E. G. Mansoori, “An unsupervised gene selection method based on multivariate normalized mutual information of genes,” Chemom. Intell. Lab. Syst., vol. 222, p. 104512, 2022.