The Development of Geo-Names Extraction from Twitter Texts Data by Conditional Random Fields
Main Article Content
Abstract
Navigation systems and online maps, Mobile application, and other platforms, are becoming increasingly important due to increasing users and providers. Place names or geonames (geographic names) are essential sources of information that users tend to use as keywords in their searches. Including storing these data in different categories. This research aims to create a model capable of extracting geonames and automatically categorizing them from the social media source of Twitter, one of the popular platforms in Thailand. It is a fast and always up-to-date information source, providing the opportunity to discover new geographic locations and helpful in gathering geospatial information without needing a field survey. Named-entity recognition standard tool cannot be used directly because of the classification of name entities that are not categorized by geographic names. As for the model, the conditional random field algorithm is applied to linguistic features such as place prepositions (near, far, next, next to, etc.) and prefixes, for instance, school, market, temples, villages, etc. This study, the Corpus was created from 28,082 Twitter messages, representing 80 percent of the 22,445 training set and 20 percent of the test set of 5,617 messages. According to the algorithm used to word tokenize, the experiment was designed into two main groups. The study result of the model with the highest overall accuracy (F1) was 0.946, which provided sufficient overall accuracy for relevant applications both on the web browser.
Article Details
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
Article Accepting Policy
The editorial board of Thai-Nichi Institute of Technology is pleased to receive articles from lecturers and experts in the fields of business administration, languages, engineering and technology written in Thai or English. The academic work submitted for publication must not be published in any other publication before and must not be under consideration of other journal submissions. Therefore, those interested in participating in the dissemination of work and knowledge can submit their article to the editorial board for further submission to the screening committee to consider publishing in the journal. The articles that can be published include solely research articles. Interested persons can prepare their articles by reviewing recommendations for article authors.
Copyright infringement is solely the responsibility of the author(s) of the article. Articles that have been published must be screened and reviewed for quality from qualified experts approved by the editorial board.
The text that appears within each article published in this research journal is a personal opinion of each author, nothing related to Thai-Nichi Institute of Technology, and other faculty members in the institution in any way. Responsibilities and accuracy for the content of each article are owned by each author. If there is any mistake, each author will be responsible for his/her own article(s).
The editorial board reserves the right not to bring any content, views or comments of articles in the Journal of Thai-Nichi Institute of Technology to publish before receiving permission from the authorized author(s) in writing. The published work is the copyright of the Journal of Thai-Nichi Institute of Technology.
References
S. Kemp. “DATAREPORTAL.” DATAREPORTAL.com https://datareportal.com/reports/digital-2021-thailand (accessed Jan. 2, 2022).
S. Hahmann and D. Burghardt, “How much information is geospatially referenced? Networks and cognition,” Int. J. Geogr. Inf. Sci., vol. 27, no. 6, pp. 1171–1189, 2013, doi: 10.1080/13658816.2012.743664.
M. Gritta, M. T. Pilehvar, N. Limsopatham, and N. Collier, “What’s missing in geographical parsing?,” Lang. Resour. Eval., vol. 52, no. 2, pp. 603–623, 2018, doi: 10.1007/s10579-017-9385-8.
W. Phatthiyaphaibun, K. Chaovavanich, C. Polpanumas, A. Suriyawongkul, L. Lowphansirikul, and P. Chormai, “PyThaiNLP: Thai Natural language processing in Python.” 2016. Distributed by Zenodo. doi: 10.5281/zenodo.3519354.
J. Lingad, S. Karimi, and J. Yin, “Location extraction from disaster-related microblogs,” in Proc. 22nd Int. Conf. World Wide Web, Rio de Janeiro, Brazil, May 2013, pp. 1017–1020.
J. Wang, Y. Hu, and K. Joseph, “NeuroTPR: A neuro-net toponym recognition model for extracting locations from social media messages,” Trans. GIS, vol. 24, no. 3, pp. 719–735, 2020.
J. A. de Bruijn, H. de Moel, B. Jongman, J. Wagemaker, and J. C. J. H. Aerts, “TAGGS: Grouping tweets to improve global geoparsing for disaster response,” J. Geovis. Spat. Anal., vol. 2, 2017, doi: 10.1007/s41651-017-0010-6.
M. D. Lieberman and H. Samet, “Multifaceted toponym recognition for streaming news,” in Proc. 34th Int. ACM SIGIR Conf. Res. and Develop. Inf. Retrieval, Beijing, China, Jul. 2011, pp. 843–852.
J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems by gibbs sampling,” in Proc. 43rd Annu. Meeting Assoc. for Comput. Linguistics, Ann Arbor, MI, USA, Jun. 2005, pp. 363–370.
R. Chasin, D. Woodward, J. Witmer, and J. Kalita, “Extracting and displaying temporal and geospatial entities from articles on historical events,” Comput. J., vol. 57, no. 3, pp. 403–426, Mar. 2014.
M. Sagcan and P. Karagoz, “Toponym recognition in social media for estimating the location of events,” in Proc. IEEE Int. Conf. Data Mining Workshop (ICDMW), Atlantic City, NJ, USA, Nov. 2015, pp. 33–39.
H. Chanlekha, A. Kawtrakul, P. Varasrai, and I. Mulasas, “Statistical and heuristic rule based model for Thai named entity recognition,” 2002. [Online]. Available: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.295.7088
H. Chanlekha and A. Kawtrakul, “Thai named entity extraction by incoperating maximum entropy model with simple heuristic information,” 2004. [Online]. Available: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.1449
N. Tirasaroj and W. Aroonmanakun, “Thai named entity recognition based on conditional random fields,” in Proc. 8th Int. Symp. Natural Lang. Process., Oct. 2009, pp. 216–220, doi: 10.1109/SNLP.2009.5340913.
S. Thattinaphanich and S. Prom-on, “Thai named entity recognition using Bi-LSTM-CRF with word and character representation,” in 4th Int. Conf. Inf. Technol. (InCIT), Bangkok, Thailand, Oct. 2019, pp. 149–154.
C. Sutton and A. McCallum, “An introduction to conditional random fields,” Found. Trends Mach. Learn., vol. 4, no. 4, pp. 267–373, Apr. 2012.
P. Chormai, P. Prasertsom, J. Cheevaprawatdomrong, and A. Rutherford, “Syllable-based neural Thai word segmentation,” in Proc. 28th Int. Conf. Comput. Linguistics, Barcelona, Spain, Dec. 2020, pp. 4619–4637.
L. Ramshaw and M. Marcus, “Text chunking using transformation-based learning,” in Proc. 3rd Workshop on Very Large Corpora, Cambridge, MA, USA, Jun. 1995, pp. 82–94.
A. Ekwonganan, “Identification of Thai and transliterated words by N-gram models,” M.S. thesis, Linguistics Dept., Chulalongkorn Univ., Bangkok, Thailand, 2005.