In our earlier discussions, we have clarified the data mining tasks. While most data mining methods try to be able to deal with data that is as general as possible, there are also specializations for more specialized data types.
- Text mining: Text mining is about the analysis of large textual data sets. This can be used, for example, to detect plagiarism or to classify the text inventory.
- Web mining: Web mining is about the analysis of distributed data as it is represented by websites. In order to detect clusters and outliers, however, not only the pages themselves are considered here, but also the relationships (hyperlinks) of the pages to each other. Due to the constantly changing content and the non-guaranteed availability of data, additional challenges arise. This topic area is also closely related to information retrieval.
- Time series analysis: In time series analysis, temporal aspects and relationships play a major role. Here, existing data mining methods can be used by means of special distance functions such as dynamic time warping distance, but specialized methods are also being developed. An important challenge is to identify series with a similar trajectory, even if it is slightly offset in time, but still has similar characteristics.

Next set of questions arise around issues with data mining.
What Are the Issues of Data Mining
Many of the problems with data mining stem from inadequate pre-processing of the data, or from systematic errors and biases in its collection. These problems are often statistical in nature and need to be solved at the time of collection: representative results cannot be obtained from non-representative data. Similar aspects must be taken into account here as when creating a representative sample.
---
The algorithms used in data mining often have several parameters that are suitable to choose. With all parameters, they provide valid results, and choosing the parameters in such a way that the results are also useful is a task of the user.
The evaluation of data mining results presents the user with the problem that on the one hand he wants to gain new insights, on the other hand it is difficult to evaluate processes automatically. For forecasting problems such as classification, regression analysis, and association analysis, the forecast on new data can be used for evaluation. This is more difficult for description issues such as outlier detection and cluster analysis. Clusters are usually evaluated internally or externally, i.e. based on their mathematical compactness or their agreement with known classes. The results of outlier detection methods are compared with known outliers. In both cases, however, the question arises as to whether this evaluation really fits the task of the “new findings” and does not ultimately evaluate the “reproduction of old knowledge”.
As statistical methods, the algorithms analyze the data without any background knowledge about its meaning. Therefore, the methods can usually only provide simple models such as groups or mean values. Often, the results are no longer comprehensible as such. However, these machine-generated results must then be interpreted by the user before they can really be called knowledge.