• Home
  • Archive
  • Tools
  • Contact Us

The Customize Windows

Technology Journal

  • Cloud Computing
  • Computer
  • Digital Photography
  • Windows 7
  • Archive
  • Cloud Computing
  • Virtualization
  • Computer and Internet
  • Digital Photography
  • Android
  • Sysadmin
  • Electronics
  • Big Data
  • Virtualization
  • Downloads
  • Web Development
  • Apple
  • Android
Advertisement
You are here: Home » Uses of Text Mining in Web Content Mining : Part III

By Abhishek Ghosh August 5, 2019 10:38 am Updated on August 8, 2019

Uses of Text Mining in Web Content Mining : Part III

Advertisement

This article is continuation of second part of Text Mining in Web Content Mining. The request by the users has already been filtered by information retrieval using the previously mentioned methods. But despite the filtering, the user is still faced with a gigantic number of relevant documents. The effort to read and edit all documents is still too big for the user. Therefore, additional tools are needed for analysis. These must be able to recognize the structures within a text and give the user the opportunity to extract the information they are looking for.

Often, natural language text is considered unstructured because it does not have the structure familiar from databases. Contrary to the assumptions, natural language texts have a generic structure of words, phrases and sentences that can also be automated with an understanding of how words, phrases and sentences are structured. More effective information can be extracted than previously possible with pattern recognition techniques and string manipulation. The natural language constructed from words is governed by rules on how these words may be ordered. In order to develop effective text mining systems, it is therefore necessary to use these rules, as well as the word meanings for editing texts. In the broad areas of computational linguistics, the areas of great significance for text mining are described in more detail here. These areas include the morphology, syntax, and semantics of texts.

The structure of a word consists of word stem, affixes (prefixes and suffixes) and inflections. The core of a word is the word stem, which is often itself a word. By using Affixes the meaning of the word stem is changed. Inflections are inflections of a word that change the number and time. The analysis of the text mining is supported by the morphological analysis and helps to reduce the complexity of analysis and to represent the word meanings.

Advertisement

---

In order to be able to extract linguistically relevant features (for example words, phrases and texts) from the flow of signs, documents must first of all be prepared for information technology. In order to make this possible, the individual units (tokens) are temporarily removed from the text in the so-called tokenization step. Subsequently, an accumulation of the previously selected tokens is usually carried out around grammatical information. A part-of-speech tagger (POS tagger) classifies each unit of text based on its part of speech. This additional meta-information is appended to the tokens in tags.

Finally, the words thus prepared are combined into phrasal structures by a chunk parser. In this process, complete syntactic structures are not provided, but subordinate chunks are identified. For subsequent techniques, this information is now attached to the text with phrasal tags for further processing.

The complexity of a text analysis is substantially reduced by returning the words to their original word stem. This is due to the fact that the number of individual word occurrences can be specified, which is already a good indicator of how important a topic is in the respective document. Apart from that, morphology allows the incorporation of tools such as dictionaries, encyclopaedias, and the recognition of related words and phrases (for example, multi-word proper names).

In linguistics, the smallest unit is the word that may stand alone in the grammatical sense. Words can be made up of morphemes. They represent the small entity that may carry meaning. Morphes can be divided into two categories. There are the free morphemes and the bound morphs, which do not occur alone as a word in the text, but are bound by prefixes or suffixes to a word. Furthermore, morphemes are distinguished into substantive and functional morphemes. Content morphemes are typically word stems that take their meaning away from grammar. Functional morphemes behave differently, as they help to adapt the grammar. Finally, the morphemes are still divided into whether they are inflected or derived. The conversion of a verb to a noun is done by derived morphemes. Influential morphemes do not create new words, but extend the word stem of a word to fit the grammatical requirements. It has become clear how important the morphology is for text mining. A good frequency analysis is only possible if the word stems can be identified and the individual words can be recognized with their grammatical function. Only the combination of contentual and functional morphemes makes it possible to find the required grammatical information of a word. It has become clear how important the morphology is for text mining. A good frequency analysis is only possible if the word stems can be identified and the individual words can be recognized with their grammatical function. Only the combination of contentual and functional morphemes makes it possible to find the required grammatical information of a word. It has become clear how important the morphology is for text mining. A good frequency analysis is only possible if the word stems can be identified and the individual words can be recognized with their grammatical function. Only the combination of contentual and functional morphemes makes it possible to find the required grammatical information of a word.

Much as morphology has practical implications for the analysis of words, the syntax may have the same function for phrases and sentences. This is possible because the rules of linguistics describe how words can be put together in phrases and sentences. It can recognize both nouns, verbs, prepositional and adjective phrases. The formation of complicated phrases and sentences is made possible by these phrases. In combination with syntactic rules, it is possible to bring a hierarchical form into the individual phrases that represent how they relate to each other and to each other.

In analyzing the relationship between verb and noun phrases, the role of nouns in a sentence can be determined. Comparable to how word stems in words have the ability to affect usable affixes, verbs in sentences limit the number and type of nouns that can be used for a sentence. The result of this the so-called case assignments are stored in lexicons and used to search for patterns. In combination with syntactic rules, morphological information can be used to recognize structures in the form of word and phrase patterns, thus providing the basis for semantic analysis.

Part of linguistics is semantics, which deals with the meaning of natural language expressions. Unlike the syntax, semantics deals with the meaning of words (lexical semantics), sentences (sentence semantics) and texts (discourse semantics).

Aiming at the representation of meanings is a space-efficient solution that allows programs to make an immediate decision. Subject to these requirements, one possible approach is the use of semantic networks. Semantic networks use nodes and arrows to represent connections between objects, events, and concepts. For classifying and generalizing topics, this type of network has been useful in searching for topics instead of key words. The high time required for generalized networks with a rich vocabulary treasure opens up a problem, because only in limited domains can you work with semantic networks.

 

Statistical language processing

 

Text editing technologies based on morphology, syntax and semantics are powerful tools for extracting information from texts. They allow you to find documents based on topics or keywords, and texts can be scanned for reasonable phrase patterns, allowing you to extract key features and their relationships. In addition, documents can be stored in a simple way, which allows easy navigation, which goes far beyond the possibilities of information retrieval techniques and further extraction of information.

Like much else in life, these featured techniques have their limitations. The problem with these techniques is, among other things, the correct recognition of roles of identified noun phrases that influences a correct extraction of information and the orderly representation of abstract concepts. Semantic networks are well-suited for representing component (compositions and aggregations) and subset relations (inheritance). It proves to be much more difficult to deduce derivations without exceeding too great a degree of complexity. Synonyms and specialized domains, in which many different concepts of very similar concepts are described, also prove to be problematic. To use a general classification system, too many concepts would be needed to really be able to classify all kinds of topics. Due to the increase in concepts, simultaneous representation would no longer be possible. By using statistical techniques, a handful of these problems can be eliminated by combining the results of the linguistic analysis with simple statistical measures.

The normal tasks of text mining include the automated creation of text summaries. As already explained, this task can be solved by simply finding the most significant concepts. In order to circumvent the semantic networks and their limitations, word frequencies can be used to find the most essential concepts of a text. The significance of a word can then already be determined by the simple counting of shared word stems, with the result that at the same time the importance of a sentence can be determined by the importance of the words contained therein. By extracting these sentences, a simple but effective summary of a text is possible. The combination of linguistics with statistical techniques results in a simplification of the semantic networks. In semantic networks, each node usually represents a word or term. The description of the relationship between the nodes is made by arrows. Those arrows are used to calculate the degree of correlation between the nodes. The correlation shows how often words are used side by side. Although the full meaning of a text can not be represented in this method, it does provide information about the importance of topics to each other. Simply counting word frequencies could be replaced by applications that additionally consider how other terms are related to them. The linguistic approaches are completely sufficient for individual message texts. Statistical methods, in turn, are particularly well suited for large text collections such as newsgroups or newspaper archives.

Uses of Text Mining in Web Content Mining

 

Macrostructures

 

The techniques presented so far treat each part of a text consistently. This behavior can become problematic as soon as a larger text is examined from different sections with different contents and emphases. As a result of this analysis, only the sentences with the most commonly used terms would be considered in the summary. Consequently, as a result, significantly more information would be available about the longer sections in the summary. However, the importance of sections can not be determined simply by length. Furthermore, so far ignored that in some texts, the information in the various target groups have a different weighting. Examples of such texts are for example memos or reports.

In contrast to the mostly artificially created macrostructures, which are used to better structure large volumes of texts, microstructures provide the language level of a text. The macro structures include subdivisions such as chapters and headings, as well as presentation and meaning information of text elements that are present in tags such as HTML or XML. By references such as hyperlinks in documents on the Internet, which are used for easy navigation between relevant documents and can also be used to analyze the importance of a document. This is done by measuring how many references from a document to the so-called hub point to other documents, and how many references from other documents to that document show to the Authority. The most well-known search engine Google uses this technology under the name Pagerank. It refines the search results in addition to weighting with the frequency of search terms with the importance of the existing link structure.

 

Result presentation

 

For the presentation of the results in text mining, an output via the browser is usually used. Due to the volume of the result set, easy navigation through the documents must be possible. The simplified representation of the information makes it possible to perform a pattern recognition faster, which is why visualization tools play an ever-increasing role. The user finds it easier to recognize keywords and to decide between the documents through the visualization. At the beginning of text mining, the user could not interact with the graphics offered. It was therefore very cumbersome for users to bring in new findings in search. This issue has been resolved by interactive graphics, the user can make a selection by simple mouse clicks, which refines or changes the search. In some text mining systems, the user has the option from the outset of designing their own query dialogs. In the forth part article of this series, we have discussed the areas of application or tasks of text mining.

This Article Has Been Shared 769 Times!

Facebook Twitter Pinterest

Abhishek Ghosh

About Abhishek Ghosh

Abhishek Ghosh is a Businessman, Surgeon, Author and Blogger. You can keep touch with him on Twitter - @AbhishekCTRL.

Here’s what we’ve got for you which might like :

Articles Related to Uses of Text Mining in Web Content Mining : Part III

  • Social Impact of Big Data : Part 1

    It is needless to say that discussions, thought on social impact of Big Data has some practical need by both the IT & non-IT individuals.

  • Facebook Analytics Cognitive Data Analysis : Jupyter Notebook & IBM Watson

    As Example, We Can Pull CSV Data From Facebook Analytics to Jupyter Notebook For Cognitive Data Analysis With IBM Watson & Jupyter Notebook.

  • How To Learn Big Data For Beginners

    Many Are Interested to Learn Big Data Analytics and Other Works But Clueless Where From to Start. Here is How To Learn Big Data For Beginners.

  • How to Install Apache Ignite on Ubuntu Server

    How to Install Apache Ignite on Ubuntu Depends on the Purpose. Apache Ignite is a distributed database, also is a caching and processing platform.

  • Knowledge Discovery in Databases : Part II

    In Part I of Knowledge Discovery in Databases, we discussed about the database systems, fundamentals of statistics and Big Data and fundamentals of knowledge discovery in databases. In this second part of Knowledge Discovery in Databases, we will discuss the process of the Knowledge Discovery in Databases and Methods of the Knowledge Discovery in Databases. […]

Additionally, performing a search on this website can help you. Also, we have YouTube Videos.

Take The Conversation Further ...

We'd love to know your thoughts on this article.
Meet the Author over on Twitter to join the conversation right now!

If you want to Advertise on our Article or want a Sponsored Article, you are invited to Contact us.

Contact Us

Subscribe To Our Free Newsletter

Get new posts by email:

Please Confirm the Subscription When Approval Email Will Arrive in Your Email Inbox as Second Step.

Search this website…

 

Popular Articles

Our Homepage is best place to find popular articles!

Here Are Some Good to Read Articles :

  • Cloud Computing Service Models
  • What is Cloud Computing?
  • Cloud Computing and Social Networks in Mobile Space
  • ARM Processor Architecture
  • What Camera Mode to Choose
  • Indispensable MySQL queries for custom fields in WordPress
  • Windows 7 Speech Recognition Scripting Related Tutorials

Social Networks

  • Pinterest (24.3K Followers)
  • Twitter (5.8k Followers)
  • Facebook (5.7k Followers)
  • LinkedIn (3.7k Followers)
  • YouTube (1.3k Followers)
  • GitHub (Repository)
  • GitHub (Gists)
Looking to publish sponsored article on our website?

Contact us

Recent Posts

  • Proxy Server: Design Pattern in Programming January 30, 2023
  • Cyberpunk Aesthetics: What’s in it Special January 27, 2023
  • How to Do Electrical Layout Plan for Adding Smart Switches January 26, 2023
  • What is a Data Mesh? January 25, 2023
  • What is Vehicular Ad-Hoc Network? January 24, 2023

About This Article

Cite this article as: Abhishek Ghosh, "Uses of Text Mining in Web Content Mining : Part III," in The Customize Windows, August 5, 2019, January 31, 2023, https://thecustomizewindows.com/2019/08/uses-of-text-mining-in-web-content-mining-part-iii/.

Source:The Customize Windows, JiMA.in

PC users can consult Corrine Chorney for Security.

Want to know more about us? Read Notability and Mentions & Our Setup.

Copyright © 2023 - The Customize Windows | dESIGNed by The Customize Windows

Copyright  · Privacy Policy  · Advertising Policy  · Terms of Service  · Refund Policy

We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
Do not sell my personal information.
Cookie SettingsAccept
Manage consent

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytics
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
Others
Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
SAVE & ACCEPT