• Home
  • Archive
  • Tools
  • Contact Us

The Customize Windows

Technology Journal

  • Cloud Computing
  • Computer
  • Digital Photography
  • Windows 7
  • Archive
  • Cloud Computing
  • Virtualization
  • Computer and Internet
  • Digital Photography
  • Android
  • Sysadmin
  • Electronics
  • Big Data
  • Virtualization
  • Downloads
  • Web Development
  • Apple
  • Android
Advertisement
You are here: Home » Uses of Text Mining in Web Content Mining : Part II

By Abhishek Ghosh August 4, 2019 1:53 am Updated on August 5, 2019

Uses of Text Mining in Web Content Mining : Part II

Advertisement

This articles assumes that the reader has read the first part of Text Mining in Web Content Mining. In the light of the methodology of Web Content Mining as second part of the series on Text Mining in Web Content Mining, two processes as well as the technology for this purpose will be explained in more detail.

The basis for any text mining process is based on the existing document collections. A compilation of text documents is the simplest case. The amount of documents it contains can range from a few thousand to several millions. The document collections can be divided into dynamic or static. In the case of dynamic document collections, the high rate of change for the various components of the text mining system may necessitate an adaptation of the components. For the users of text mining database is of great importance due to the scope of documents. The term Information Retrieval is defined differently by each author. Information retrieval usually looking for documents which are usually unstructured in a large collection of documents.

Before information retrieval became known through search engines, it was in use long before the age of the Internet. The earlier search methods were purely keyword-oriented and the respective hits were output in a list. The big boom of information retrieval started with the beginning of the internet. The websites of the search engines, which were very popular at that time, used information retrieval to quickly provide users with an answer to your search. However, one should not equate the earlier search engines with today’s, since previously only a fraction of today’s data could be processed. Our today’s search engines use crawlers, indexing and search methods to process huge amounts of data. The decision about the sorting of the search results was then based on text features. It was thus possible to improve the presentation of the results by tricks upwards. By adding a link analysis, manipulation is no longer possible. So far, it is still not clear whether information retrieval is part of text mining.

Advertisement

---

 

Information Retrieval : Document Classification

 

The Information Retrieval provides the right documents for specific requests. It scans the document collection and filters out non-matching documents. Document classification differs completely in the search from information retrieval. When searching through the document collection, decision criteria for a classification are learned so that they can be used in the classification of other documents. By the similarity in the use of similarity measures suggests Weiss et al, that a new method should be developed, which complements the Information Retrieval to the differences of the Document Classification.

Uses of Text Mining in Web Content Mining Part II

Vector space model

The vector space model was developed to solve the three main tasks of information retrieval in the 60s. It is used for the representation of documents, the representation of inquiries and the retrieval of documents. The problem of representations is solved by a geometric representation. The selected display style also immediately solves the problem of finding documents. For the most efficient processing of a request for data, it should be presented the same as the document. This saves the compilation of the query into an execution plan, as is the case with relational databases. Furthermore, it would improve performance if the search of documents for key terms is not sequential but processed in one operation. The basic prerequisite for this is the representation of a request by a simple object. Each document can be represented by a vector if there is one dimension in the vector space for each keyword contained. The vector gets a one in each dimension if the keyword is included or a zero on no hit. Queries can also be displayed according to the same principle. It is not relevant how many dimensions the vector space has. Based on the term frequencies of the document and the total conceptual frequency of the document collection, the weighting of the individual terms can be calculated differently. if there is one dimension in the vector space for each keyword included. The vector gets a one in each dimension if the keyword is included or a zero on no hit. Queries can also be displayed according to the same principle. It is not relevant how many dimensions the vector space has. Based on the term frequencies of the document and the total conceptual frequency of the document collection, the weighting of the individual terms can be calculated differently. if there is one dimension in the vector space for each keyword included. The vector gets a one in each dimension if the keyword is included or a zero on no hit. Queries can also be displayed according to the same principle. It is not relevant how many dimensions the vector space has. Based on the term frequencies of the document and the total conceptual frequency of the document collection, the weighting of the individual terms can be calculated differently.

Due to the geometric representation of the documents and queries, the interpretation of the similarity of a document to the request is also to be understood geometrically. Typically this is done by measuring the angle between the document lines, the query between the zero of the vector space and the point of the query (see Figure 4 Vector space model). The usual way is to calculate the cosine of the angle of the two vectors.

To filter the request, only results whose angle between the request vector and the document vector is within a specified limit are returned. There is usually a very high amount of terms that can appear in documents, but they do not have to appear in every document. Due to this characteristic, this model is only slightly represented in some areas. Even thinning out by removing stop words (empty words such as articles, conjunctions, or auxiliary verbs) requires an extremely high number of dimensions.

The problem is that different words with the same meaning (synonyms) are mapped in different areas of the vector space model. In return, however, identical words with different meanings are mapped to the same areas.

Latent Semantic Indexing

The field of latent semantic indexing deals with the approach to solve this problem. It is based on the three weaknesses of the normally used approach.

The fact that an index usually contains fewer terms than the user expects results in incomplete indexes. On the one hand, this may be due to the fact that the document does not contain the desired terms or that the filter retrieval system does not take into account certain filter criteria. This problem could be addressed in two different ways. On the one hand, one could include dictionaries that allow only a meaning of words. On the other hand, one could give the user the opportunity to clearly define his request and thus the meaning of the terms. Apart from the problem of limitation to a specific vocabulary, both methods fail in the associated workload, which would be required for the definition of the terms by the user or the creation of dictionaries. In addition, the issue causes the problem that terms that occur regularly with certain other terms are not valued differently, as are the terms that rarely occur with other terms. Often used combinations should be weighted higher as combinations rarely occur. If the rare combinations are overvalued, this would have an unwanted impact on the results. Often used combinations should be weighted higher as a result, as combinations rarely occur. If the rare combinations are overvalued, this would have an unwanted impact on the results. Often used combinations should be weighted higher as a result, as combinations rarely occur. If the rare combinations are overvalued, this would have an unwanted impact on the results.

All 3 mentioned problems are attacked in the Latent Semantic Indexing. This method does not take into account the meaning of a word. Rather, the method uses linear algebra and statistical methods to find term clusters that describe particular concepts. There are no specifications for these clusters, because specifications are given of the probability of the occurrence of a term. Because of this, concepts can now be adapted that describe the terms that occur regularly together. In contrast to before, where all terms needed different dimensions, dimensions are now brought together to a common concept. Merging leads to just as much accuracy that documents are now included in the search query, which contain only a part of the terms. The new concept of dimensions reduces not only the influence of a missing term, but also the previous problems of polysemy. This happens because now groups of statistically connected terms form a common dimension and the meaning of the polysem is explained by the other summarized terms. Reducing the dimensions can diminish the space of the indexes.

The request by the users are filtered by information retrieval using the previously mentioned methods. But despite the filtering, the user is still faced with a gigantic number of relevant documents. Additional tools are needed for analysis to recognize the structures within a text and give the user the opportunity to extract the information they are looking for. For this purpose, the approaches of computational linguistics, statistical language processing and the exploitation of macrostructure in texts has been examined in more detail in the next part of this series.

Tagged With rbups use of vector in text mining , Text Mining and Web Content Mining , text mining website

This Article Has Been Shared 571 Times!

Facebook Twitter Pinterest

Abhishek Ghosh

About Abhishek Ghosh

Abhishek Ghosh is a Businessman, Surgeon, Author and Blogger. You can keep touch with him on Twitter - @AbhishekCTRL.

Here’s what we’ve got for you which might like :

Articles Related to Uses of Text Mining in Web Content Mining : Part II

  • Approaches of Deep Learning : Part 4 (Conclusion)

    This article will end our series on Approaches of Deep Learning which has total four parts – PART 1, PART 2, PART 3 and this current one. After reading so many thousands of words, the reader possibly have some theoretical idea around how to independently learn Deep Learning. This particular article in essence will draw […]

  • Visualization of SQL Data in Jupyter Notebook & Embedding in WordPress Post

    Here is How To On Visualization of SQL Data in Jupyter Notebook & Embedding in WordPress Post in Easy Language, in All Steps.

  • How To Install Apache Phoenix (SQL on HBase)

    We talked about Apache Phoenix in our previous guides and articles such as How To Install Apache HBase and List of Apache Projects For Big Data. Apache Phoenix is a massively parallel, relational database engine supporting OLTP for Hadoop using Apache HBase as store. Phoenix provides a JDBC driver that cleverly hides the noSQL store […]

  • Difference Between Data Mining and Big Data Analytics

    What are the differences between data mining and big data analytics? Big Data is the collection of data. Data Mining is the process to extract usable data.

  • What Are Some Good Tools For Big Data Analytics?

    Big Data analytics provides meaningful analysis of a large set of data. All of these software help in finding current market trends, customer preferences, and other information. While choosing the solutions, we should keep in mind that some Big Data platforms are/were specifically designed for professionals who know how to work with similar platforms. At […]

Additionally, performing a search on this website can help you. Also, we have YouTube Videos.

Take The Conversation Further ...

We'd love to know your thoughts on this article.
Meet the Author over on Twitter to join the conversation right now!

If you want to Advertise on our Article or want a Sponsored Article, you are invited to Contact us.

Contact Us

Subscribe To Our Free Newsletter

Get new posts by email:

Please Confirm the Subscription When Approval Email Will Arrive in Your Email Inbox as Second Step.

Search this website…

 

Popular Articles

Our Homepage is best place to find popular articles!

Here Are Some Good to Read Articles :

  • Cloud Computing Service Models
  • What is Cloud Computing?
  • Cloud Computing and Social Networks in Mobile Space
  • ARM Processor Architecture
  • What Camera Mode to Choose
  • Indispensable MySQL queries for custom fields in WordPress
  • Windows 7 Speech Recognition Scripting Related Tutorials

Social Networks

  • Pinterest (24.3K Followers)
  • Twitter (5.8k Followers)
  • Facebook (5.7k Followers)
  • LinkedIn (3.7k Followers)
  • YouTube (1.3k Followers)
  • GitHub (Repository)
  • GitHub (Gists)
Looking to publish sponsored article on our website?

Contact us

Recent Posts

  • What is Configuration Management February 5, 2023
  • What is ChatGPT? February 3, 2023
  • Zebronics Pixaplay 16 : Entry Level Movie Projector Review February 2, 2023
  • What is Voice User Interface (VUI) January 31, 2023
  • Proxy Server: Design Pattern in Programming January 30, 2023

About This Article

Cite this article as: Abhishek Ghosh, "Uses of Text Mining in Web Content Mining : Part II," in The Customize Windows, August 4, 2019, February 6, 2023, https://thecustomizewindows.com/2019/08/uses-of-text-mining-in-web-content-mining-part-ii/.

Source:The Customize Windows, JiMA.in

PC users can consult Corrine Chorney for Security.

Want to know more about us? Read Notability and Mentions & Our Setup.

Copyright © 2023 - The Customize Windows | dESIGNed by The Customize Windows

Copyright  · Privacy Policy  · Advertising Policy  · Terms of Service  · Refund Policy

We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
Do not sell my personal information.
Cookie SettingsAccept
Manage consent

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytics
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
Others
Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
SAVE & ACCEPT