This articles assumes that the reader has read the first part of Text Mining in Web Content Mining. In the light of the methodology of Web Content Mining as second part of the series on Text Mining in Web Content Mining, two processes as well as the technology for this purpose will be explained in more detail.
The basis for any text mining process is based on the existing document collections. A compilation of text documents is the simplest case. The amount of documents it contains can range from a few thousand to several millions. The document collections can be divided into dynamic or static. In the case of dynamic document collections, the high rate of change for the various components of the text mining system may necessitate an adaptation of the components. For the users of text mining database is of great importance due to the scope of documents. The term Information Retrieval is defined differently by each author. Information retrieval usually looking for documents which are usually unstructured in a large collection of documents.
Before information retrieval became known through search engines, it was in use long before the age of the Internet. The earlier search methods were purely keyword-oriented and the respective hits were output in a list. The big boom of information retrieval started with the beginning of the internet. The websites of the search engines, which were very popular at that time, used information retrieval to quickly provide users with an answer to your search. However, one should not equate the earlier search engines with today’s, since previously only a fraction of today’s data could be processed. Our today’s search engines use crawlers, indexing and search methods to process huge amounts of data. The decision about the sorting of the search results was then based on text features. It was thus possible to improve the presentation of the results by tricks upwards. By adding a link analysis, manipulation is no longer possible. So far, it is still not clear whether information retrieval is part of text mining.
Information Retrieval : Document Classification
The Information Retrieval provides the right documents for specific requests. It scans the document collection and filters out non-matching documents. Document classification differs completely in the search from information retrieval. When searching through the document collection, decision criteria for a classification are learned so that they can be used in the classification of other documents. By the similarity in the use of similarity measures suggests Weiss et al, that a new method should be developed, which complements the Information Retrieval to the differences of the Document Classification.
Vector space model
The vector space model was developed to solve the three main tasks of information retrieval in the 60s. It is used for the representation of documents, the representation of inquiries and the retrieval of documents. The problem of representations is solved by a geometric representation. The selected display style also immediately solves the problem of finding documents. For the most efficient processing of a request for data, it should be presented the same as the document. This saves the compilation of the query into an execution plan, as is the case with relational databases. Furthermore, it would improve performance if the search of documents for key terms is not sequential but processed in one operation. The basic prerequisite for this is the representation of a request by a simple object. Each document can be represented by a vector if there is one dimension in the vector space for each keyword contained. The vector gets a one in each dimension if the keyword is included or a zero on no hit. Queries can also be displayed according to the same principle. It is not relevant how many dimensions the vector space has. Based on the term frequencies of the document and the total conceptual frequency of the document collection, the weighting of the individual terms can be calculated differently. if there is one dimension in the vector space for each keyword included. The vector gets a one in each dimension if the keyword is included or a zero on no hit. Queries can also be displayed according to the same principle. It is not relevant how many dimensions the vector space has. Based on the term frequencies of the document and the total conceptual frequency of the document collection, the weighting of the individual terms can be calculated differently. if there is one dimension in the vector space for each keyword included. The vector gets a one in each dimension if the keyword is included or a zero on no hit. Queries can also be displayed according to the same principle. It is not relevant how many dimensions the vector space has. Based on the term frequencies of the document and the total conceptual frequency of the document collection, the weighting of the individual terms can be calculated differently.
Due to the geometric representation of the documents and queries, the interpretation of the similarity of a document to the request is also to be understood geometrically. Typically this is done by measuring the angle between the document lines, the query between the zero of the vector space and the point of the query (see Figure 4 Vector space model). The usual way is to calculate the cosine of the angle of the two vectors.
To filter the request, only results whose angle between the request vector and the document vector is within a specified limit are returned. There is usually a very high amount of terms that can appear in documents, but they do not have to appear in every document. Due to this characteristic, this model is only slightly represented in some areas. Even thinning out by removing stop words (empty words such as articles, conjunctions, or auxiliary verbs) requires an extremely high number of dimensions.
The problem is that different words with the same meaning (synonyms) are mapped in different areas of the vector space model. In return, however, identical words with different meanings are mapped to the same areas.
Latent Semantic Indexing
The field of latent semantic indexing deals with the approach to solve this problem. It is based on the three weaknesses of the normally used approach.
The fact that an index usually contains fewer terms than the user expects results in incomplete indexes. On the one hand, this may be due to the fact that the document does not contain the desired terms or that the filter retrieval system does not take into account certain filter criteria. This problem could be addressed in two different ways. On the one hand, one could include dictionaries that allow only a meaning of words. On the other hand, one could give the user the opportunity to clearly define his request and thus the meaning of the terms. Apart from the problem of limitation to a specific vocabulary, both methods fail in the associated workload, which would be required for the definition of the terms by the user or the creation of dictionaries. In addition, the issue causes the problem that terms that occur regularly with certain other terms are not valued differently, as are the terms that rarely occur with other terms. Often used combinations should be weighted higher as combinations rarely occur. If the rare combinations are overvalued, this would have an unwanted impact on the results. Often used combinations should be weighted higher as a result, as combinations rarely occur. If the rare combinations are overvalued, this would have an unwanted impact on the results. Often used combinations should be weighted higher as a result, as combinations rarely occur. If the rare combinations are overvalued, this would have an unwanted impact on the results.
All 3 mentioned problems are attacked in the Latent Semantic Indexing. This method does not take into account the meaning of a word. Rather, the method uses linear algebra and statistical methods to find term clusters that describe particular concepts. There are no specifications for these clusters, because specifications are given of the probability of the occurrence of a term. Because of this, concepts can now be adapted that describe the terms that occur regularly together. In contrast to before, where all terms needed different dimensions, dimensions are now brought together to a common concept. Merging leads to just as much accuracy that documents are now included in the search query, which contain only a part of the terms. The new concept of dimensions reduces not only the influence of a missing term, but also the previous problems of polysemy. This happens because now groups of statistically connected terms form a common dimension and the meaning of the polysem is explained by the other summarized terms. Reducing the dimensions can diminish the space of the indexes.
The request by the users are filtered by information retrieval using the previously mentioned methods. But despite the filtering, the user is still faced with a gigantic number of relevant documents. Additional tools are needed for analysis to recognize the structures within a text and give the user the opportunity to extract the information they are looking for. For this purpose, the approaches of computational linguistics, statistical language processing and the exploitation of macrostructure in texts has been examined in more detail in the next part of this series.