Cluster Analysis to Find Sets of High-frequency Queries for Filtering in Similarity Join
Main Article Content
Abstract
Similarity search and similarity join are important operations in text databases. In some situations, some similar queries, called high-frequency queries, are repeated over a period of time. High-frequencyqueries-based filter is used to facilitate this type of queries. However, the performance of this method depends mostly on the chosen high-frequency queries. This paper proposes methods, which are based on DBSCAN and agglomerative hierarchical-based clustering algorithm, to find high-frequency queries for the filter, called DBRAN and DBSM. For evaluation, both DBRAN and DBSM are applied on various sets of queries to find high-frequency queries for three datasets. It is found that DBSM performs better than DBRAN when the variation among highfrequency queries is high. However, when the variation among high-frequency queries is low, the performance of both DBRAN and DBSM are about the same.