Information integration is the merging of information from different data sets (data sources) with usually different data structures into a common uniform data structure. In this process, heterogeneous sources are to be brought together as completely and efficiently as possible into a structured unit that can be used more effectively than would be possible with direct access to the individual sources. Information integration is particularly necessary where several grown systems are to be connected, for example when merging companies, workflows and applications or when searching for information on the Internet. The integration of more complex systems only came into the focus of computer science research in the 1990s and is thus still under development.
The rapid development in the technology of databases since the 1960s led to the need to share and combine existing data. This combination can take place at a variety of levels in the database structure. A popular solution is based on the principle of the data warehouse, which extracts the data from heterogeneous sources, transforms it and loads it into a unified system.
Since 2009, the trend of information integration has been towards standardized query interfaces to query the data in real-time. This allows the data to be retrieved directly from heterogeneous sources, which provides an advantage in the timeliness of the data but requires increased access times. Since 2010, some research in this field has been dealing with the problem of semantic integration. This is less concerned with the structure of the architecture of different databases than with the resolution of semantic conflicts between heterogeneous data sources.
The models for data processing that have existed since 2011 lead to data isolation in the form of data islands of scattered data. These islands are an unwanted artefact, due to the methodology of data modelling, which leads to unequal data sets. To counteract this problem, methods have been developed to avoid data isolation artefacts and to integrate them into the data structure.
Information integration becomes significant in several different situations, both commercial and scientific. Examples of the practical application of information integration can be found in the integration of product information from manufacturer specifications and the retrieval of this information by product search engines or in the evaluation of various geological data sets to determine cross-border surface properties.
In the case of redundancy between the data of different sources, togetherness can be determined in some cases automatically and used for the completion of data sets (data fusion). For example, the entries of a telephone list and an employee directory can be combined if personal names match. Since more information about individual objects is available, this is also referred to as compaction.
The goal of the integration is to provide a consistent global view of all data sources. Redundant data sources can be used for verification. The merging of intentionally redundant sources leads to higher coverage and the completion of data sets with extensional redundancy of sources to a higher density.
Materialized Versus Virtual Integration
Two types of integration can be distinguished:
Materialized or physical integration: Data from different data sources – usually with different data structures – is transformed into the target structure and copied to a central database, where it is then available for evaluation. This principle can be found, for example, in data warehouses or the project for data exchange of the Open Archives Initiative.
Virtual or logical integration: The data remains in the different sources and the integration only takes place at a request (Federated Information System).
Materialized Integration Architectures
Data Warehouses (DWH) are the most important representatives of materialized database systems. The data required for a company’s information needs is persisted directly in a central data warehouse to provide a global, unified view of the relevant data. To integrate the source data into the data warehouse base database, an integration layer must be implemented for this purpose (ETL process).
Operational Data Stores (ODS): While data warehouse systems are primarily adapted to the requirements of corporate management and thus the available information serves the strategic decision-making processes, in operational data stores, the integrated data is available to operational business processes. This already implies that the data stored in a central data warehouse should be used operationally, i.e. after the integration has been completed (import, cleansing, storage), this data is subject to change. Therefore, the focus of ODS systems is not on historical, but primarily current data. In this respect, another essential distinguishing feature to DWH arises, since the synchronization to the source data must take place either for requests or at least at frequent, regular intervals. ODS is mostly used by companies in those business areas in which the timeliness of the data plays an essential role, such as in customer and supplier communication areas and warehouse management processes. With the trend towards a real-time data warehouse and more powerful database management systems, the operational data store is likely to merge into the data warehouse.
Virtual Integration Architectures
Unlike materialized systems, data in virtual database systems are not stored in the integrated system itself, but remains physically in the data sources and is only loaded into the integration system during requests (virtual data store).
Federated Database Systems (FDBS) is a global conceptual schema. This schema provides the interface to the local, distributed databases and their local schemas, and on the other hand, it provides requesting applications with an integrated global view of the federated source data utilizing appropriate services. FDBS are usually created by the union of several database systems (multi-database systems) with the aim of a central (federated) coordination of common tasks.Tagged With Information integration