Parallel Similarity Join with Data Partitioning for Prex Filtering

Jaruloj Chongstitvatana; Methus Bhirakit

doi:10.37936/ecti-cit.201591.54410

PDF

DOI: https://doi.org/10.37936/ecti-cit.201591.54410

Keywords:

Similarity Join Prefix Filtering Parallel Join Data Partitioning

Jaruloj Chongstitvatana

Department of Mathematics and Computer Science, Faculty of Science, Chulalongkorn University, Thailand

Methus Bhirakit

Sirimedia Co., Ltd., Thailand

Abstract

Similarity join is necessary for many applications, such as text search and data preparation. Measuring the similarity between two strings is expensive because inexact match is allowed and strings in databases are long. To reduce the cost of similarity join, the filtering-and-verify approach reduces the number of string pairs which require the computation of the similarity function. Prefix filtering is a filterand- verify method that filters out dissimilar strings by examining only their prefixes. The effectiveness of prefix filtering depends on the length of the examined prefix. An adaptive method is proposed to find a suitable prefix length for filtering. Based on this concept, we propose to divide a dataset into partitions, and a suitable prefix length is determined for each partition. It also allows similarity join to run in parallel for each data partition. From our experiment, the proposed method achieves higher performance because the number of candidates is reduced and the program can execute in parallel. Moreover, the performance of the proposed method depends on the number of data partitions. If the data partition is too small, the chosen prefix length for each partition may not be optimal.

How to Cite

[1]

J. Chongstitvatana and M. Bhirakit, “Parallel Similarity Join with Data Partitioning for Prex Filtering”, ECTI-CIT Transactions, vol. 9, no. 1, pp. 22–28, Apr. 2016.

Issue

Vol. 9 No. 1 (2015): ECTI Transaction on CIT (May 2015)

Section

Artificial Intelligence and Machine Learning (AI)

Article Sidebar

Main Article Content

Abstract

Article Details