English philosopher Francis Bacon said “Knowledge is power”. It is still as true today as it was almost 420 years ago. However, in the past, it has always been associated with luck or tremendous diligence to acquire this knowledge, so we are experiencing a true heyday in recent years due to technological advances. In the past, if you had to roll over tons of books or spend enormous amount of research, information technologies, which are gaining in importance in the course of globalization, offer us this knowledge on a silver platter. A major role in this context plays the computer. Initially used to solve complex computing tasks, billions of people worldwide use it today for all sorts of private and professional activities.
The resulting flood of data offers enormous potential. Since the early 1990s, efforts have been made to use software to examine these data to derive new insights that were previously unrecognizable. This process is referred to as knowledge discovery and is divided into different categories based on the type (eg, tables, text, etc.) and the source (Internet, intranet, etc.) of the data. Due to the increasing relevance of this topic, both for science and for the economy, this series of articles will examine one of these disciplines, that is Text Mining, and present the application possibilities in relation to the so-called Web Content Mining.
The term text mining, first appeared in 1999 as part of a Pacific Asia Knowledge Discovery and Data Mining workshop. Although these works are already 15 or 19 years old, there is still no clear definition. Thus, there are a total of six different names for this research field after Mehler and Wolff, which differ depending on the task assignment :
- Text Mining
- Text Data Mining
- Textual Data Mining
- Text Knowledge Engineering
- Knowledge discovery in texts
- Kowledge Discovery in Textual Databases
Four perspectives on text mining can be derived from this variety of terms :
- The information retrieval perspective, which considers that text mining is merely a further development of information retrieval (IR).
- The data mining perspective, which see text mining as data mining on textual data.
- The methodological perspective, which defines text mining simply as a collection of methods for evaluating texts.
- The knowledge-oriented perspective, which aims to generate new insights and information from existing data.
Generally speaking, IR merely describes the approach to finding existing knowledge and not gaining new insights, as is commonly the case with text mining, which is why this perspective is no longer pursued today.
The intersection of the remaining perspectives is that text mining is seen as a discipline of Knowledge Discovery. This is based on the premise of gaining new insights from data or providing information that users did not previously know was contained in the processed data .
Delimitation of Data Mining
Text Mining can be classified as Knowledge Discovery (KD). Data mining and text mining are not nearly identical steps within the KD, we can say that text mining is an extension of data mining. For data mining requires a certain structure of the data, while text mining usually extends to weak and unstructured data.
While data mining usually consists of three phases (identification, preparation & function selection, distribution analysis), Text Mining extends this process by the process of filtering out special features from weak or unstructured data.
Web Mining describes the process of applying data mining techniques to the Internet (World Wide Web). From this it can be deduced that web mining, like data and text mining, is a building discipline of the KDD and thus can be used to obtain unknown patterns and new insights from data (on the internet).
Initially, the assumption was that the Internet was too unstructured to use data mining techniques was largely refuted. The so-called “labeling problem”, is the main problem in web mining. By nature, most data mining techniques require a kind of tagging of the data, such as whether a website is a homepage or not. The web mining process itself consists of four sub-processes :
- Resource discovery, i.e. obtaining data that is available either online or offline – compared to IR.
- The information selection & data pre-treatment, ie the pretreatment of the data found in step 1 by, for example, the removal of “stop words” or the like.
- Generalization, the step in which data mining techniques are applied to visualize patterns in the data found. It should be noted that human intervention plays an important role here, since the Web is an interactive medium.
- The analysis or the validation and interpretation of the results.
Web Mining is generally divided into three subcategories that represent the various parts of the Internet that can be examined using the various mining techniques :
- Web usage mining
- Web Structure Mining
- Web Content Mining
Web usage mining describes applying data mining techniques to web data in order to identify usage patterns and to adapt web applications to better suit users’ usage behavior. Web usage mining is divided into three phases :
- pattern recognition
- pattern analysis
In the pretreatment phase, the existing data is prepared so that it can be further processed by mining techniques. First, the usage data, ie data about user and server sessions are treated. These provide information about which user visited, when, how often, which website or part of the page. The content, text, images, scripts, multimedia files, etc. are then converted into a usable format in order to be able to assign specific content to the user’s usage behavior. Subsequently, the structure of the visited pages is pretreated similar to the content.
In the pattern recognition phase, the mining techniques are used to bring previously unknown knowledge to light according to the approach of the KDD. A selection of the techniques used there are statistical analyzes, association rules, clustering, classification, continuous patterns or dependency modeling.
The pattern analysis analyzes the patterns found in Phase 2. Basically, attempts are made to filter out uninteresting patterns and to form relevant patterns, for example by means of SQL queries or OLAP operations, to new findings.
After all, there is not only a scientific interest in web usage mining – the economy has long since recognized the enormous potential of understanding customers better. Web Structure Mining describes the process of applying data mining techniques to the structure of Web data .
The aim of this approach is to gain information about the content based on the structure of web pages and to identify similarities within a collection of data. In general, two types of structures can be assumed :
- Intra-page structure, i.e. data that has a certain structure within a page, such as the arrangement of different HTML or XML tags within a web page, which is mainly used in the area of web content mining.
- Inter-page structure, i.e. data that has a certain structure between several pages, such as hyperlinks that connect pages together.
The beginnings of web-structure mining can be found in the area of social network analysis. There incoming and outgoing links are examined to recognize a pattern within the resulting hierarchy.
Another push in web structure mining was made on assumption that every document, even unstructured text in its own right, in connection with other texts on a similar topic, had similar structures. These two ideas lead to the realization that even unstructured data, as they frequently occur on the Internet, can be analyzed by examining hyperlinks and the use of labels (names). Based on the two types of structures described above, it can further be said that web usage mining is not always strictly separated from web content mining.
Web Content Mining
Web Content Mining focuses on capturing the content of Web pages and, based on this, either improving users’ information in the IR’s sense, or modeling the data using databases such as: Search engines can deliver more effective results. From this, two possible perspectives on web content mining can be deduced :
- agent-based or IR-view
- database view
The first variant uses intelligent search agents to search, organize and interpret relevant information based on domain characteristics and user profiles. It also uses agents that filter or categorize information using IR techniques, and, similar to web structure mining, examine link structures to create cluster hierarchies. The third subset of the IR View uses personalized web agents that can learn user preferences and discover sources of information.
The second variant uses either multi-level databases, which arrange the data according to the degree of structure and generalization or query systems, which summarize, for example, weakly structured data and from it a database of the found Can create information.