Blocking and filtering techniques for entity resolution: A survey

G Papadakis, D Skoutas, E Thanos… - ACM Computing Surveys …, 2020 - dl.acm.org
Entity Resolution (ER), a core task of Data Integration, detects different entity profiles that
correspond to the same real-world object. Due to its inherently quadratic complexity, a series …

String similarity search and join: a survey

M Yu, G Li, D Deng, J Feng - Frontiers of Computer Science, 2016 - Springer
String similarity search and join are two important operations in data cleaning and
integration, which extend traditional exact search and exact join operations in databases by …

String similarity joins: An experimental evaluation

Y Jiang, G Li, J Feng, WS Li - Proceedings of the VLDB Endowment, 2014 - dl.acm.org
String similarity join is an important operation in data integration and cleansing that finds
similar string pairs from two collections of strings. More than ten algorithms have been …

Massjoin: A mapreduce-based method for scalable string similarity joins

D Deng, G Li, S Hao, J Wang… - 2014 IEEE 30th …, 2014 - ieeexplore.ieee.org
String similarity join is an essential operation in data integration. The era of big data calls for
scalable algorithms to support large-scale string similarity joins. In this paper, we study …

Embedjoin: Efficient edit similarity joins via embeddings

H Zhang, Q Zhang - Proceedings of the 23rd ACM SIGKDD international …, 2017 - dl.acm.org
We study the problem of edit similarity joins, where given a set of strings and a threshold
value K, we want to output all pairs of strings whose edit distances are at most K. Edit …

A pivotal prefix based filtering algorithm for string similarity search

D Deng, G Li, J Feng - Proceedings of the 2014 ACM SIGMOD …, 2014 - dl.acm.org
We study the string similarity search problem with edit-distance constraints, which, given a
set of data strings and a query string, finds the similar strings to the query. Existing …

Efficient processing of graph similarity queries with edit distance constraints

X Zhao, C Xiao, X Lin, W Wang, Y Ishikawa - The VLDB Journal, 2013 - Springer
Graphs are widely used to model complicated data semantics in many applications in
bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to …

Efficient similarity join and search on multi-attribute data

G Li, J He, D Deng, J Li - Proceedings of the 2015 ACM SIGMOD …, 2015 - dl.acm.org
In this paper we study similarity join and search on multi-attribute data. Traditional methods
on single-attribute data have pruning power only on single attributes and cannot efficiently …

A survey of blocking and filtering techniques for entity resolution

G Papadakis, D Skoutas, E Thanos… - arXiv preprint arXiv …, 2019 - arxiv.org
Efficiency techniques are an integral part of Entity Resolution, since its infancy. In this
survey, we organized the bulk of works in the field into Blocking, Filtering and hybrid …

Similarity query support in big data management systems

T Kim, W Li, A Behm, I Cetindil, R Vernica, V Borkar… - Information Systems, 2020 - Elsevier
Similarity query processing is becoming increasingly important in many applications such as
data cleaning, record linkage, Web search, and document analytics. In this paper we study …