Narcotic-related tweet classification in Asia using sentence vector of word embedding with feature extension

Main Article Content

Narongsak Chayangkoon
Anongnart Srivihok

Abstract

Currently, Asia faces a narcotic drug addiction problem. In social networking services, such as Twitter, some drug addicted users converse about behaviours related to narcotic drugs. This research proposes a new Narcotic-related Tweet Classification Model (NTCM) that uses data preprocessing. Two new data preprocessing methods, Sentence Vector of Word Embedding (SVWE) and Sentence Vector of Word Embedding with Feature Extension (SWEF), are introduced to prepare data for the NTCM. The proposed data preprocessing method uses the reduction of the dataset to produce an SVWE. Word embedding is generated by deep neural networks using the skip-gram model. The authors further extended some features to SVWE to produce a new dataset called SWEF; these datasets were used for the dataset in the NTCM. The authors collected data with keywords related to narcotic drugs from Twitter in Asia. The authors investigated a text classification model using a Support Vector Machine, Logistic Regression, a Decision Tree, and a Convolutional Neural Network. Logistic Regression with the SWEF provided the best approach for the NTCM compared with state-of-the-art methods. The proposed NTCM showed correctness and fitness by accuracy (0.8964), F-Measure (0.895), AUC (0.949), Kappa (0.7131), MCC (0.714), and low running time performance (1.04 seconds).

Article Details

How to Cite
Chayangkoon, N., & Srivihok, A. . (2021). Narcotic-related tweet classification in Asia using sentence vector of word embedding with feature extension. Engineering and Applied Science Research, 48(5), 547–559. Retrieved from https://ph01.tci-thaijo.org/index.php/easr/article/view/243616
Section
ORIGINAL RESEARCH

References

United nations office on drugs and crime [UNODC]. Independent in-depth cluster evaluation of global research projects of the research and trend analysis branch. Vienna, Austria: United Nations Publication; 2018.

Phan N, Chun SA, Bhole M, Geller J. Enabling real-time drug abuse detection in tweets. 2017 IEEE 33rd international conference on data engineering (ICDE); 2017 Apr 19-22; San Diego, USA. New York: IEEE; 2017. p. 1510-4.

Deng X, Li Y, Weng J, Zhang J. Feature selection for text classification: a review. Multimed Tools Appl. 2019;78(3):3797-816.

Kowsari K, Jafari Meimandi K, Heidarysafa M, Mendu S, Barnes L, Brown D. Text classification algorithms: a survey. Inform. 2019;10(4):1-68.

Johns BT, Jamieson RK. A large‐scale analysis of variance in written language. Cogn Sci. 2018;42(4):1360-74.

Chayangkoon N, Srivihok A. Feature reduction of short text classification by using bag of words and word embedding. Int J Control Autom. 2019;12(2):1-16.

Fathi E, Shoja BM. Deep neural networks for natural language processing. In: Gudivada VN, Rao CR, editors. Handbook of statistics. Netherlands: Elsevier; 2018. p. 229-316.

Dhariyal B, Ravi V, Ravi K. Sentiment analysis Via Doc2Vec and convolutional neural network hybrids. 2018 IEEE symposium series on computational intelligence (SSCI); 2018 Nov 18-21; Bangalore, India. New York: IEEE; 2018. p. 666-71.

Chen H, McKeever S, Delany SJ. A comparison of classical versus deep learning techniques for abusive content detection on social media sites. In: Staab S, Koltsova O, Ignatov D, editors. International conference on social informatics; 2018 Sep 25-28; Petersburg, Russia. Berlin: Springer; 2018. p. 117-33.

Ahmad M, Aftab S, Bashir MS, Hameed N, Ali I, Nawaz Z. SVM Optimization for sentiment analysis. Int J Adv Comput Sci Appl. 2018;9(4):393-8.

Burel G, Alani H. Crisis Event Extraction Service (CREES)-automatic detection and classification of crisis-related content on social media. In: Boersma K, Tomaszewski B, editors. Proceedings of 15th international conference on information systems for crisis response and management; 2018 May 20-23; Rochester, USA. p. 1-13.

Rameshbhai CJ, Paulose J. Opinion mining on newspaper headlines using SVM and NLP. Int J Electr Comput Eng. 2019;9(3):2152-63.

Pimpalkar AP, Raj RJ. Influence of pre-processing strategies on the performance of ML classifiers exploiting TF-IDF and BOW features. ADCAIJ. 2020;9(2):49-68.

Soleimani BH, Matwin S. Spectral word embedding with negative sampling. Thirty-second AAAI conference on artificial intelligence; 2017 Feb 2-7; Louisiana, USA. California: AAAI Press; 2018. p. 5481-7.

Grzegorczyk K, Kurdziel M. Disambiguated skip-gram model. Proceedings of the 2018 conference on empirical methods in natural language processing; 2018 Oct 31 – Nov 4; Brussels, Belgium. Stroudsburg: Association for Computational Linguistics; 2018. p. 1445-54.

Merchant K, Pande Y. NLP based latent semantic analysis for legal text summarization. Proceedings of international conference on advances in computing, communications and informatics (ICACCI); 2018 Sep 19-22; Bangalore, India. New York: IEEE; 2018. p. 1803-7.

Park EL, Cho S, Kang P. Supervised paragraph vector: distributed representations of words, documents and class labels. IEEE Access. 2019;7:29051-64.

Guo H, Wang W. Granular support vector machine: a review. Artif Intell Rev. 2019;51(1):19-32.

Zhang Z, Mo L, Huang C, Xu P. Binary logistic regression modeling with TensorFlow™. Ann Transl Med. 2019;7(20):591.

Adnan M, Sarno R, Sungkono KR. Sentiment analysis of restaurant review with classification approach in the decision tree-J48 Algorithm. 2019 international seminar on application for technology of information and communication (iSemantic 2019); 2019 Sep 21-22; Semarang, Indonesia. New York: IEEE; 2019. p. 121-6.

Georgakopoulos SV, Tasoulis SK, Vrahatis AG, Plagianakos VP. Convolutional neural networks for toxic comment classification. Proceedings of the 10th Hellenic conference on artificial intelligence; 2018 Jul 9-12; Greece. New York: Association for Computing Machinery; 2018. p. 1-6.

National center on addiction and substance abuse, NCASA. Commonly used illegal drugs [Internet]. 2019 [cited 2019 Feb 2]. Available from: https://www.centeronaddiction.org/addiction/commonly-used-illegal-drugs.

Telegraph Media Group. Police given 3,000 Word 'a to z of drugs slang’ to stay ahead of criminals [Internet]. 2019 [cited 2019 Jan 5]. Available from: http://www.telegraph.co.uk/news/uknews/law-and-order/6519172/Police-given-3000-word-A-to-Z-of-drugs-slang-to-stay-ahead-of-criminals.html.

American Addiction Centers. Slang and nicknames for meth [Internet]. 2019 [cited 2019 Feb 6]. Available from: https://americanaddictioncenters.org/meth-treatment/slang-names.

Google Code Project. Word2vec-Googlenews-Vectors [Internet]. 2018 [cited 2019 Jan 9]. Available from: https://www.kaggle.com/leadbest/googlenewsvectorsnegative300.

R Core Teams. R: a language and environment for statistical computing, R Foundation for Statistical Computing [Internet]. 2019 [cited 2019 Jan 2]. Available from: http://www.R-project.org/.

Xu Y, Goodacre R. On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. J Anal Test. 2018;2(3):249-62.

Malkawi R, Saifan AA, Alhendawi N, BaniIsmaeel A. Data mining tools evaluation based on their quality attributes. Int J Adv Sci Tech. 2020;29(3):13867-90.

Too J, Abdullah AR, Mohd Saad N, Tee W. EMG feature selection and classification using a Pbest-guide binary particle swarm optimization. Comput. 2019;7(1):1-20.

Larner AJ. New unitary metrics for dementia test accuracy studies. Prog Neurol Psychiatry. 2019;23(3):21-5.

Jabbar MA. Breast cancer data classification using ensemble machine learning. Eng Appl Sci Res. 2021;48(1):65-72.

Hand D, Christen P. A note on using the F-Measure for evaluating record linkage algorithms. Stat Comput. 2018;28(3):539-47.

Mingote V, Miguel A, Ortega A, Lleida E. Optimization of the area under the ROC curve using neural network supervectors for text-dependent speaker verification. Comput Speech Lang. 2020;63:1-16.

De Raadt A, Warrens MJ, Bosker RJ, Kiers HA. Kappa coefficients for missing data. Educ Psychol Meas. 2019;79(3):558-76.

Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC genom. 2020;21(1):1-13.

Kasemtaweechok C, Suwannik W. Adaptive geometric median prototype selection method for K-Nearest neighbors classification. Intell Data Anal. 2019;23(4):855-76.