Duplicate detection or record linkage is a variety of automated methods that can be used to identify cases in records that represent the same object in the real world. This is necessary, for example, when merging multiple data sources (deduplication) or when cleaning data.
Duplicates can arise, for example, due to input and transmission errors, due to different spellings and abbreviations, or due to different data schemas. For example, addresses from different sources can be included in an address database, and the same address of a person can be included multiple times with variations. By means of duplicate detection, these duplicates are now to be found out and the actual addressees identified as objects.
There are two types of duplicates: identical duplicates, where all values are identical, and non-identical duplicates, where one or more values differ. Detection and cleanup is trivial in the first case, the surplus duplicates can be easily deleted without loss of information. The second case can be more difficult and complex, as the duplicates cannot be identified by a simple as-is comparison as in the first case. For this reason, heuristics must be applied. In the second case, the surplus records cannot simply be deleted, they must first be consolidated and the values combined.
The process of detecting and consolidating duplicates can be done in the following four steps:
- Pre-processing of data
- Data partitioning
- Detection of duplicates and
- Consolidation into one data set.
Various similarity measures are used to detect duplicates, such as the Levenshtein distance or the typewriter distance. The tuples are usually categorized into three classes: the duplicates, the non-duplicates, and the potential duplicates; In other words, duplicates whose classification is not unique and therefore have to be reclassified manually.
A distinction is made between two general approaches to duplicate detection:
- Rule-based approach: Here, tuples of a certain similarity are classified as duplicates. To do this, rules are defined based on the pairwise similarities that indicate whether a tuple is a duplicate or not. The rules are mostly based on domain knowledge.
- Machine learning: This usually requires previously classified tuples as training data. This data is then used to machine learn rules and test their accuracy. In contrast to the rules-based approach, no domain knowledge (except for classifying the training data) is required here.
Since it is usually not possible to compare every data set with every other for cost reasons, there are methods such as the sorted neighborhood, in which only potentially similar data sets are checked to see if they are duplicates.
There are phonetic algorithms that assign a string to words according to their speech sound, the phonetic code to implement a similarity search.