Data which can be stored almost limitless in today’s world. Knowledge discovery in databases is a good way to extract knowledge from data. The aim of this article is to inform business related peoples about Knowledge Discovery in Databases, without the need to have greater knowledge in the areas of foundation basics in computer science. Commercial devices and scientific instruments, such as scanners, cash registers and telescopes are generating ever-increasing amounts of data. These data contain potential knowledge that far exceeds human capacity.
This is where knowledge discovery in databases (KDD) is used. The knowledge discovery in databases has the motivation to filter out the potential knowledge from this large amount of data. Knowledge discovery in databases works at the interface of statistics and database systems. In addition, knowledge discovery is a subset of machine learning. Knowledge discovery in databases is the process of semi-automatic extraction of knowledge from databases, which is valid data, previously unknown and is potentially useful for a given purpose.
For the sake of this article, initially the basics of statistics and database systems for knowledge discovery in databases will be explained. After discussing the basics, the specific details of the topic will be discussed and the aims and methods of the knowledge discovery in databases (KDD) will be explained. Later, the steps in the Knowledge Discovery process flow into databases will be explained. After the basic process has been explained, which methods are used in the KDD that will be discussed. In the final analysis, a conclusion will be drawn and a number of reasons will be discussed to point out why this area will continue to gain importance in the coming future.
Fundamentals of Database Systems
A database system is an electronic data management system. The main task is to save large amounts of data efficiently, consistently and permanently. This is called both persistent storage and consistent (correct) storage.
Database systems consist of a database and a database management system.
The database (DB) is a collection of all existing data with their descriptions or attributes.
Database Management System
The database management system (DBMS) is a program for managing the database. With this program the access to the data is controlled and the existing data can be supplemented or changed. Database systems have three levels of abstraction. There are the external level, the conceptual level and the internal level. This is also called the ANSI 3-plane model.
The external level allows different users or user groups to have a different view of the data stock. Not every user may or wants to see all data.
The conceptual level specifies the overall logical view of all data. The overall view describes the relationships and objects between data. Their goal is to ensure completely redundancy-free storage of all data in the database. The data is stored in the third form of normalization according to the relational database schema.
The internal level deals with the physical implementation of the Conceptual Scheme. Information about the physical memory data structures and access mechanisms are part of the internal layer. Therefore, the internal level is also called the physical level. In addition, base tables are often provided in non-normalized form to speed up access to the data. Performance intensive aggregation tables are usually calculated overnight and stored in additional tables. This allows the user to retrieve the results very quickly during the day.
As a relational database language, the language SQL has prevailed in the industry standard. SQL consists of three parts:
Data Definition Language (DDL) : The DDL is the part of the language of SQL to create databases, tables, views, etc. (CREATE), make structure changes (ALTER) or delete databases or database objects (DROP).
Data Manipulation Language (DML) : The DML is the part of the SQL language that makes it possible to insert records (INSERT), to change (UPDATE), to delete (DELETE) or to query (SELECT).
Data Control Language (DCL) : The DCL is the part of the language of SQL to assign access rights (GRANT) or withhold (REVOKE).
Fundamentals of Statistics and Big Data
Statistics is divided into three basic tasks. The descriptive statistics, inductive statistics and the exploratory statistics.
In descriptive statistics, the available data are described and graphically compressed in tables, graphs or figures. These contain parameters such as mean and scatter as well as diagrams and curves. With this statistic slightly inconsistent data can be filtered out. The descriptive statistics work with the question: How can one describe a distribution of a feature?
The inductive statistics try to determine from a small part of the data mass characteristics of the population over which one wants to make a statement. This is analyzed and implemented using stochastic models. Inductive statistics work with the question: How can a sample be identified as a property for all relevant population data?
Exploratory statistics is an intermediate form of descriptive and inductive statistics. It uses descriptive methods and inductive test methods to detect possible relationships or differences between data. The results found are called hypotheses. The explorative statistics work with the question: What is remarkable or unusual about a distribution of a feature?
A data warehouse (DW) is a persistent, integrated collection of data from multiple sources for the purpose of analysis and decision support. For a more comprehensive introduction to the topic, read our topics in this website on fundamentals of Big Data, Data Warehouse, Data Lake, Data Mining by using the search function.
Architecture of a Data Warehouse
The Data Warehouse collects the data from various operational database systems, such as purchasing, order processing and marketing. The complex process of integrating these operational database systems creates metadata for describing the data warehouse. Data marts are extracted from the data warehouse, which represent certain partial views/extracts as a copy of the actual data. The Data Warehouse is used to process ad hoc requests, for online analytical processing and for data mining, thus laying the foundation for data mining.
Fundamentals of Knowledge Discovery in Databases
The aim of the Knowledge Discovery in Databases is to independently discover decision-relevant, but previously unknown relationships and links in the data of large amounts of data and present them to the analyst or the user in a clear format. These gained relationships provide a knowledge gain. A key statement is that it is a non-trivial process whose purpose is to extract patterns from large data sets. These patterns should also have the properties that they are valid for a large part of the database (valid) and describe previously unknown (potentially useful) and easily understandable (ultimately understandable) relationships within the database.
Conclusion of Part I
In this first part, we introduced the readers to Knowledge Discovery in Databases (KDD). In the second part of this article, we will discuss the process of the Knowledge Discovery in Databases and Methods of the Knowledge Discovery in Databases.