Data fusion is the process of merging and completing incomplete data sets. It is an important part of information integration. Data in a recipient record is supplemented with the help of a donor record. The donor record consists of variables and the recipient record from variables. The variables are therefore present in both data sets. Based on the donor data set, a model for calculating the values of the from the variables are created. This model is applied to the recipient record to create a new, merged record. The statistical methods used are summarized under the term statistical matching and are partly related to the methods of the imputation of missing values.
While duplicate detection is largely complete and has only small discrepancies, data fusion requires combining several partially incomplete data sets. Before the fusion of data from two sources is possible, they may need to be brought to a common schema (schema integration). Non-existent attributes are populated with NULL (for “no value”). As a rule, a common identifying attribute as an identifier is also necessary – this may have been determined previously, for example, by duplicate detection. A simple method of data fusion is to merge one record with another if it lacks more attributes and it matches the other record in all existing attributes (MINIMUM UNION). The record with more missing attributes is subsumed by the more complete record.
If related data sets not only lack individual attribute values but differ from each other, data conflicts are also referred to. For example, data conflicts can be due to typos, different spellings and encodings, errors in calculations and automatic text recognition, and outdated data. To resolve data conflicts by aggregating, preferences or other conflict resolution functions (for example, the average of different numbers) must be specified. The records are first grouped by duplicates (see duplicate detection) and then aggregated within the duplicates.