Web mining is the transfer of data mining techniques for the (partially) automatic extraction of information from the Internet, especially the World Wide Web. Web mining adopts procedures and methods from the fields of information retrieval, machine learning, statistics, pattern recognition and data mining. Three objects of investigation can be distinguished:
- The content (web content mining) – for example, using information retrieval methods.
- The structure of linking (web structure mining) – for example, using webometry methods. In web structure mining, so-called hubs are used. There are good hubs that link to many valuable pages, and valuable pages that link to many hubs.
- User behaviour (web usage mining) – for example, through the analysis of log files.
The term screen scraping generally encompasses all methods of reading text from computer screens. At present, however, the term is used almost exclusively in relation to web pages (hence web scraping or web harvesting). In this case, screen scraping specifically refers to the techniques used to obtain information by extracting the required data in a targeted manner. Search engines use crawlers to search the World Wide Web, analyze web pages, and collect data, such as web feeds or email addresses. Screen scraping techniques are also used in web mining.
A program for extracting data from web pages is also called a wrapper.
---
After the website has been downloaded, it is first important to extract the data whether the exact location of the data on the website is known (e.g. second table, third column). If this is the case, there are several ways to extract the data. On the one hand, you can interpret the downloaded web pages as character strings and extract the desired data with regular expressions, for example.
If the website is XHTML-compliant, it is a good idea to use an XML parser. There are numerous supporting techniques for accessing XML (SAX, DOM, XPath, XQuery). Often, however, the websites are only delivered in the (possibly even erroneous) HTML format, which does not comply with the XML standard. With a suitable parser, it may still be possible to produce an XML-compliant document. Alternatively, the HTML can be cleaned up with HTML Tidy before parsing. Some screen scrapers use a query language specifically designed for HTML.
One criterion for the quality of the extraction mechanisms is their robustness to changes to the structure of the website. This requires fault-tolerant extraction algorithms. In many cases, however, the structure of the website is unknown (e.g. when using crawlers). Data structures such as purchase price information or time information must then be recognized and interpreted even without fixed specifications.

Types of Web Mining
Web usage mining attempts to detect regularities in the use of websites or web resources. In doing so, all secondary data generated by the user’s interaction with a web resource is processed and analyzed. Web usage mining also includes, for example, the analysis of the customer journey.
Web structure mining attempts to identify the reference structure underlying a web page or domain. Based on the topology of the references (hyperlinks) of the web page, with an optional description of the same, a model is created. This can be useful for the categorization and ranking of a website and allows conclusions to be drawn about similarities between websites and their relationships to each other. For example, content-rich websites (so-called authorities) and overview-like websites (so-called hubs) could be found for a certain topic.
Web content mining deals with the detection of regularities in the content of a web resource. Web content mining is one area of application for text mining. Data on the web consists of unstructured data such as text documents, semi-structured data such as HTML documents, and more structured data such as tables or dynamically generated HTML pages. Basically, the content of a website consists of different types of data, such as texts, images, audio, video, metadata and hyperlinks. Web content mining of multiple data types is referred to as “multimedia data mining” and can be understood as part of web content mining. However, most of the web’s content consists of unstructured text. Text mining can be understood as a manifestation and overarching field of research of web content mining. The methods used are general data mining methods, whereby statistical and computational linguistic methods realize the transformation of the texts into an adequate form (for data mining).