Data mining is the systematic application of statistical methods to large data sets with the aim of creating new Identify cross-connections and trends. Due to their size, such databases are processed using computer-aided methods. In practice, the sub-term data mining has been applied to the entire process of so-called “knowledge discovery in databases“. KDD), which also includes steps such as pre-processing and evaluation, while data mining in the narrower sense only refers to the actual processing step of the process.
The term data mining is somewhat misleading, because it is about extracting knowledge from already existing data and not about generating data itself. Nevertheless, the concise designation has prevailed. The mere collection, storage and processing of large amounts of data is also sometimes referred to as buzzword data mining. In a scientific context, it primarily refers to the extraction of knowledge that is “valid (in the statistical sense), hitherto unknown and potentially useful” “for the determination of certain regularities, regularities and hidden relationships”. It is defined as “a step in the KDD process that consists of applying data analysis and discovery algorithms that provide a special collection of patterns (or models) of the data under acceptable efficiency limitations.” Inferring data from (hypothetical) models is called statistical inference.
Many of the methods used in data mining actually originate from statistics, especially multivariate statistics, and are often only adapted in their complexity for the application in data mining, often approximated to the detriment of accuracy. The loss of accuracy is often accompanied by a loss of statistical validity, so that from a purely statistical point of view, the procedures can sometimes even be “wrong”. For data mining applications, however, experimentally verified benefits and acceptable runtime are often more important than statistically proven correctness.
The topic of machine learning is also closely related, but in data mining the focus is on finding new patterns, while in machine learning the primary aim is to automatically recognize known patterns by the computer in new data. However, a simple separation is not always possible here: If, for example, association rules are extracted from the data, this is a process that corresponds to typical data mining tasks; however, the extracted rules also meet the goals of machine learning. Conversely, the subfield of unsupervised learning from machine learning is very closely related to data mining. Machine learning methods are often used in data mining and vice versa.
Research in the field of database systems, especially index structures, plays a major role in data mining when it comes to reducing complexity. Typical tasks such as searching for nearest neighbors can be significantly accelerated with the help of a suitable database index and the runtime of a data mining algorithm can be improved as a result.
Information retrieval (IR) is another field that benefits from the insights of data mining. To put it simply, this is about the computer-aided search for complex content, but also about the presentation for the user. Data mining methods such as cluster analysis are used here to improve the search results and their presentation to the user, for example by grouping similar search results. Text mining and web mining are two specializations of data mining that are closely related to information retrieval.
Data collection, i.e. the collection of information in a systematic manner, is an important prerequisite for obtaining valid results with the help of data mining. If the data was collected in a statistically improper manner, there may be a systematic error in the data, which is then found in the data mining step. The result may not be a consequence of the observed objects, but caused by the way in which the data was collected.
Process of Data Mining
Data mining is the actual analysis step of the Knowledge Discovery in Databases process. The steps of the iterative process are roughly outlined:
- Focus: data collection and selection, but also the determination of existing knowledge
- Preprocessing: Data cleansing, which integrates sources and eliminates inconsistencies, for example by removing or adding incomplete data sets.
- Transformation into the appropriate format for the analysis step, for example by selecting attributes or discretizing the values
- Data mining, the actual analysis step
- Evaluation of the patterns found by the expert and control of the achieved goals
In further iterations, knowledge that has already been found can now be used (“integrated into the process”) to obtain additional or more accurate results in a new run.