As explained in earlier articles, data mining is the systematic application of statistical methods to large data sets (especially “big data“) with the aim of creating new Identify cross-connections and trends. Due to their size, such databases are processed using computer-aided methods. In practice, the sub-term data mining has been applied to the entire process of so-called “knowledge discovery in databases“, KDD), which also includes steps such as pre-processing and evaluation, while data mining in the narrower sense only refers to the actual processing step of the process.
Data Mining Tasks
Typical tasks of data mining are:
- Outlier Detection: Identification of unusual records: outliers, errors, changes
- Cluster Analysis: Grouping Objects Based on Similarities
- Classification: Elements that have not yet been assigned to classes are assigned to the existing classes.
- Association analysis: Identification of relationships and dependencies in the data in the form of rules such as “A and B usually follow C”.
- Regression analysis: Identification of relationships between (several) dependent and independent variables
- Summary: Reduction of the data set to a more compact description without significant loss of information
These tasks can be roughly divided into observation problems (outlier detection, cluster analysis) and forecasting problems (classification, regression analysis).
This task looks for data objects that are inconsistent with the rest of the data, for example, by having unusual attribute values or deviating from a general trend. For example, the Local Outlier Factor method searches for objects that have a density that differs significantly from their neighbors, this is referred to as “density-based outlier detection”.
Identified outliers are often subsequently manually verified and hidden from the dataset, as they can worsen the results of other procedures. In some use cases, such as fraud detection, however, it is precisely the outliers that are the objects of interest.
Cluster analysis is about identifying groups of objects that are in some way more similar to each other than other groups. Often these are accumulations in the data room, which is where the term cluster comes from. However, in a densely connected cluster analysis such as DBSCAN or OPTICS, the clusters can take any shape. Other methods, such as the EM algorithm or k-means algorithm, prefer spherical clusters.
Objects that have not been assigned to a cluster can be interpreted as outliers in the sense of the outlier detection mentioned above.
Similar to cluster analysis, classification is about assigning objects to groups (here referred to as classes). In contrast to cluster analysis, however, the classes are usually predefined (e.g. bicycles, cars) and machine learning methods are used to assign previously unassigned objects to these classes.
In association analysis, frequent correlations in the data sets are searched for and usually formulated as inferential rules.
Regression analysis is used to model the statistical relationship between different attributes. This allows, among other things, the prediction of missing attribute values, but also the analysis of the deviation analogous to outlier detection. Using insights from cluster analysis and calculating separate models for each cluster, better forecasts can typically be made. If a strong correlation is established, this knowledge can also be used well for the summary.
Since data mining is often applied to large and complex amounts of data, an important task is also to reduce this data to a manageable amount for the user. In particular, outlier detection identifies individual objects that may be important; Cluster analysis identifies groups of objects that are often sufficient to examine only on the basis of a sample, which significantly reduces the number of data objects to be examined. Regression analysis allows redundant information to be removed, thus reducing the complexity of the data. Classification, association analysis and regression analysis (in some cases also cluster analysis) also provide more abstract models of the data.
With the help of these approaches, both the analysis of the data and, for example, its visualization (through sampling and reduced complexity) are simplified.