The Development of Geo-Names Extraction from Twitter Texts Data by Conditional Random Fields

Main Article Content

Tuvachit Chalamkate
Chanin Thinnachote
Attapol Thamrongrattanarit

Abstract

Navigation systems and online maps, Mobile application, and other platforms, are becoming increasingly important due to increasing users and providers. Place names or geonames (geographic names) are essential sources of information that users tend to use as keywords in their searches. Including storing these data in different categories. This research aims to create a model capable of extracting geonames and automatically categorizing them from the social media source of Twitter, one of the popular platforms in Thailand. It is a fast and always up-to-date information source, providing the opportunity to discover new geographic locations and helpful in gathering geospatial information without needing a field survey. Named-entity recognition standard tool cannot be used directly because of the classification of name entities that are not categorized by geographic names. As for the model, the conditional random field algorithm is applied to linguistic features such as place prepositions (near, far, next, next to, etc.) and prefixes, for instance, school, market, temples, villages, etc. This study, the Corpus was created from 28,082 Twitter messages, representing 80 percent of the 22,445 training set and 20 percent of the test set of 5,617 messages. According to the algorithm used to word tokenize, the experiment was designed into two main groups. The study result of the model with the highest overall accuracy (F1) was 0.946, which provided sufficient overall accuracy for relevant applications both on the web browser.

Article Details

Section
Research Article

References

S. Kemp. “DATAREPORTAL.” DATAREPORTAL.com https://datareportal.com/reports/digital-2021-thailand (accessed Jan. 2, 2022).

S. Hahmann and D. Burghardt, “How much information is geospatially referenced? Networks and cognition,” Int. J. Geogr. Inf. Sci., vol. 27, no. 6, pp. 1171–1189, 2013, doi: 10.1080/13658816.2012.743664.

M. Gritta, M. T. Pilehvar, N. Limsopatham, and N. Collier, “What’s missing in geographical parsing?,” Lang. Resour. Eval., vol. 52, no. 2, pp. 603–623, 2018, doi: 10.1007/s10579-017-9385-8.

W. Phatthiyaphaibun, K. Chaovavanich, C. Polpanumas, A. Suriyawongkul, L. Lowphansirikul, and P. Chormai, “PyThaiNLP: Thai Natural language processing in Python.” 2016. Distributed by Zenodo. doi: 10.5281/zenodo.3519354.

J. Lingad, S. Karimi, and J. Yin, “Location extraction from disaster-related microblogs,” in Proc. 22nd Int. Conf. World Wide Web, Rio de Janeiro, Brazil, May 2013, pp. 1017–1020.

J. Wang, Y. Hu, and K. Joseph, “NeuroTPR: A neuro-net toponym recognition model for extracting locations from social media messages,” Trans. GIS, vol. 24, no. 3, pp. 719–735, 2020.

J. A. de Bruijn, H. de Moel, B. Jongman, J. Wagemaker, and J. C. J. H. Aerts, “TAGGS: Grouping tweets to improve global geoparsing for disaster response,” J. Geovis. Spat. Anal., vol. 2, 2017, doi: 10.1007/s41651-017-0010-6.

M. D. Lieberman and H. Samet, “Multifaceted toponym recognition for streaming news,” in Proc. 34th Int. ACM SIGIR Conf. Res. and Develop. Inf. Retrieval, Beijing, China, Jul. 2011, pp. 843–852.

J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems by gibbs sampling,” in Proc. 43rd Annu. Meeting Assoc. for Comput. Linguistics, Ann Arbor, MI, USA, Jun. 2005, pp. 363–370.

R. Chasin, D. Woodward, J. Witmer, and J. Kalita, “Extracting and displaying temporal and geospatial entities from articles on historical events,” Comput. J., vol. 57, no. 3, pp. 403–426, Mar. 2014.

M. Sagcan and P. Karagoz, “Toponym recognition in social media for estimating the location of events,” in Proc. IEEE Int. Conf. Data Mining Workshop (ICDMW), Atlantic City, NJ, USA, Nov. 2015, pp. 33–39.

H. Chanlekha, A. Kawtrakul, P. Varasrai, and I. Mulasas, “Statistical and heuristic rule based model for Thai named entity recognition,” 2002. [Online]. Available: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.295.7088

H. Chanlekha and A. Kawtrakul, “Thai named entity extraction by incoperating maximum entropy model with simple heuristic information,” 2004. [Online]. Available: https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.64.1449

N. Tirasaroj and W. Aroonmanakun, “Thai named entity recognition based on conditional random fields,” in Proc. 8th Int. Symp. Natural Lang. Process., Oct. 2009, pp. 216–220, doi: 10.1109/SNLP.2009.5340913.

S. Thattinaphanich and S. Prom-on, “Thai named entity recognition using Bi-LSTM-CRF with word and character representation,” in 4th Int. Conf. Inf. Technol. (InCIT), Bangkok, Thailand, Oct. 2019, pp. 149–154.

C. Sutton and A. McCallum, “An introduction to conditional random fields,” Found. Trends Mach. Learn., vol. 4, no. 4, pp. 267–373, Apr. 2012.

P. Chormai, P. Prasertsom, J. Cheevaprawatdomrong, and A. Rutherford, “Syllable-based neural Thai word segmentation,” in Proc. 28th Int. Conf. Comput. Linguistics, Barcelona, Spain, Dec. 2020, pp. 4619–4637.

L. Ramshaw and M. Marcus, “Text chunking using transformation-based learning,” in Proc. 3rd Workshop on Very Large Corpora, Cambridge, MA, USA, Jun. 1995, pp. 82–94.

A. Ekwonganan, “Identification of Thai and transliterated words by N-gram models,” M.S. thesis, Linguistics Dept., Chulalongkorn Univ., Bangkok, Thailand, 2005.