An efficient similarity join algorithm with cosine similarity predicate

Lee, Dongjoo; Park, Jaehui; Shim, Junho; Lee, Sang-goo

doi:10.1007/978-3-642-15251-1_33

상세 보기

An efficient similarity join algorithm with cosine similarity predicate

Lee, Dongjoo;
Park, Jaehui;
Shim, Junho ;
Lee, Sang-goo

Citations

WEB OF SCIENCE

0

Citations

SCOPUS

27

초록

Given a large collection of objects, finding all pairs of similar objects, namely similarity join, is widely used to solve various problems in many application domains.Computation time of similarity join is critical issue, since similarity join requires computing similarity values for all possible pairs of objects. Several existing algorithms adopt prefix filtering to avoid unnecessary similarity computation; however, existing algorithms implementing the prefix filtering have inefficiency in filtering out object pairs, in particular, when aggregate weighted similarity function, such as cosine similarity, is used to quantify similarity values between objects. This is mostly caused by large prefixes the algorithms select. In this paper, we propose an alternative method to select small prefixes by exploiting the relationship between arithmetic mean and geometric mean of elements' weights. A new algorithm, MMJoin, implementing the proposed methods dramatically reduces the average size of prefixes without much overhead. Finally, it saves much computation time. We demonstrate that our algorithm outperforms a state-of-the-art one with empirical evaluation on large-scale real world datasets. © 2010 Springer-Verlag.

키워드

Alternative methods; Arithmetic mean; Average size; Computation time; Cosine similarity; Critical issues; Empirical evaluations; Geometric mean; Real-world datasets; Similarity computation; Similarity functions; Similarity join; Expert systems; Problem solving; Algorithms

제목: An efficient similarity join algorithm with cosine similarity predicate

저자: Lee, Dongjoo; Park, Jaehui; Shim, Junho ; Lee, Sang-goo

DOI: 10.1007/978-3-642-15251-1_33

발행일: 2010-08

유형: Conference Paper

저널명: Lecture Notes in Computer Science

권: 6262 LNCS

호: PART 2

페이지: 422 ~ 436

ScholarWorks@숙명여자대학교

상세 보기

초록

키워드