• Home
  • Archive
  • Tools
  • Contact Us

The Customize Windows

Technology Journal

  • Cloud Computing
  • Computer
  • Digital Photography
  • Windows 7
  • Archive
  • Cloud Computing
  • Virtualization
  • Computer and Internet
  • Digital Photography
  • Android
  • Sysadmin
  • Electronics
  • Big Data
  • Virtualization
  • Downloads
  • Web Development
  • Apple
  • Android
Advertisement
You are here:Home » What is Data Refining in Big Data?

By Abhishek Ghosh July 18, 2017 4:30 pm Updated on July 18, 2017

What is Data Refining in Big Data?

Advertisement

Most commonly new developers, particularly who are interested in data analysis face some terminologies which have more to do with theoretical and practical part of engineering and analytical sciences. The developers can be from a variety of domains and the phrases often confuses them. The question what is data refining in big data such an obvious question and answer is commonly written for those who are related to statistics and analytical sciences. In data refining we refine disparate data to increase understanding of the data & remove data variability. We can understand that the meaning is not quite clear to many.

 

What is Data Refining in Big Data in Plain English?

 

No, refining is not a new terminology or buzzword. So as data refining. Refinement is a generic term in computer science which actually describes various approaches with the goal of producing computer understandable corrected programs and simplifying programs. Data refinement converts raw data to the specification of needed format by a software or implementable program. Possibly still the meaning is not clear.

In our 2 years ago published guide on wrong concepts around auto restart MySQL, we have shown you MySQL log on Github as gist :

Advertisement

---

Vim
1
https://gist.github.com/AbhishekGhosh/66f3da024340c3fc3f1b

First line is :

Vim
1
151011 10:38:40 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

Fifth line is this :

Vim
1
2015-10-11 10:38:41 0 [Warning] 'ERROR_FOR_DIVISION_BY_ZERO' is deprecated and will be removed in a future release.

33rd line is this :

Vim
1
2
2015-10-11 10:44:56 14344 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. INSERT... ON DUPLICATE KEY UPDATE  on a table with more than one UNIQUE KEY is unsafe Statement: INSERT INTO wp_stt2_meta ( `post_id`,`meta_value`,`meta_count` ) VALUES ( '17878', 'learning in artificial network ann', 1 )
ON DUPLICATE KEY UPDATE `meta_count` = `meta_count` + 1

If I want to develop a chart from the [Warning], that 5500 line log is not in proper condition for a computer to understand. The log essentially is intended to be human readable. This was part of data collection phase.

Another good example is Fail2Ban log. We have shown how to setup Fail2Ban log analytics graph with badips.com. Here is gist with fail2ban log :

Vim
1
https://gist.github.com/AbhishekGhosh/48d84c020bdea9d8c8b96eec0a58a9f7

If you notice these few lines :

Vim
1
2
3
4
5
6
2017-07-17 05:46:05,824 fail2ban.actions        [935]: NOTICE  [sshd] Unban 97.79.239.20
2017-07-17 05:49:49,237 fail2ban.actions        [935]: NOTICE  [sshd] Unban 61.166.73.121
2017-07-17 06:43:41,419 fail2ban.filter         [935]: INFO    [sshd] Found 90.189.242.131
2017-07-17 06:43:41,423 fail2ban.filter         [935]: INFO    [sshd] Found 90.189.242.131
2017-07-17 06:43:44,428 fail2ban.filter         [935]: INFO    [sshd] Found 90.189.242.131
2017-07-17 06:43:44,790 fail2ban.actions        [935]: NOTICE  [sshd] Ban 90.189.242.131

You’ll understand that there is usable information but not possible for any system to easily compute. For that purpose as basic way we have shown bash commands to execute on fail2ban log to get valuable information using simple GNU tools. Obviously we have bash script with the commands for fail2ban on other guide. When we are running that bash script we are getting this output :

Vim
1
2
3
4
5
Bad IPs from only from /var/log/fail2ban.log alone :
---Number-----IP-------------------------------------------------------------
      1 p19229-ipngn10401marunouchi.tokyo.ocn.ne.jp (114.175.118.229)
      6 182.23.66.171 (182.23.66.171)
      6 78-58-187-40.static.zebra.lt (78.58.187.40)

That IP 78.58.187.40 needs ban via iptables. The script or commands converting data to human readable output for data analysis. 78.58.187.40 was 6 time banned and unbanned by fail2ban.

But when we are using software like Apache Hadoop with Spark, we do not need the human readable format but format which Apache Hadoop with Spark can process. This is berry basic, easy example of data refinement. The adobe log may be understandable to a analytics system in this format :

Vim
1
2
3
4
5
6
2017-07-17, 05:46:05, 824, Unban, 97.79.239.20
2017-07-17, 05:49:49, 237, Unban, 61.166.73.121
2017-07-17, 06:43:41, 419, Found, 90.189.242.131
2017-07-17, 06:43:41, 423, Found, 90.189.242.131
2017-07-17, 06:43:44, 428, Found, 90.189.242.131
2017-07-17, 06:43:44, 790, Ban, 90.189.242.131

On 17th July between 05:46 to 06:43 the IP 90.189.242.131 attacked 3 times and fail2ban banned it.

In a data warehouse, there is a collective process called Extract, Transform, and Load (ETL). Data extracting is the process of gathering data from data sources. The data then will then be Transformed in order to fit the need. Then the data has to be made to abide the rules of data architecture framework, then it will be loaded into the data warehouse.

What is Data Refining in Big Data

What I have shown example with the logs with commands is towards retrenchment not refining. Retrenchment uses formal Methods to address the perceived limitations of formal refinement for situations in which refinement is practically unusable. It bears no meaning unless the script or article is read :

Vim
1
2
3
      1 p19229-ipngn10401marunouchi.tokyo.ocn.ne.jp (114.175.118.229)
      6 182.23.66.171 (182.23.66.171)
      6 78-58-187-40.static.zebra.lt (78.58.187.40)

But this set is towards meaningful :

Vim
1
2
3
4
5
6
2017-07-17, 05:46:05, 824, Unban, 97.79.239.20
2017-07-17, 05:49:49, 237, Unban, 61.166.73.121
2017-07-17, 06:43:41, 419, Found, 90.189.242.131
2017-07-17, 06:43:41, 423, Found, 90.189.242.131
2017-07-17, 06:43:44, 428, Found, 90.189.242.131
2017-07-17, 06:43:44, 790, Ban, 90.189.242.131

There is free software like OpenRefine which is useful for some purpose :

Vim
1
https://github.com/OpenRefine/OpenRefine

If we go with logs as data source, we have shown how to merge many log files in to one big file in simple way for test purpose.

Tagged With paperuri:(5a3c8fdfb8851727457b529b8938b2e0) , Big Data Analysis Methods , big data refining , intellecty x california big data and data refining , refining data collection process , refining how the data is coded , what is big data
Facebook Twitter Pinterest

Abhishek Ghosh

About Abhishek Ghosh

Abhishek Ghosh is a Businessman, Surgeon, Author and Blogger. You can keep touch with him on Twitter - @AbhishekCTRL.

Here’s what we’ve got for you which might like :

Articles Related to What is Data Refining in Big Data?

  • Configure Apache With Fail2Ban on Ubuntu 18.04

    Here is How To Configure Apache With Fail2Ban on Ubuntu 18.04 to block more types of malicious attempts towards server to create a practical firewall.

  • Configure Fail2Ban With Mod Security And Other Filters

    Here is How To Configure Fail2Ban With Mod Security & Others On Apache Server To Protect From PHP And Other Exploits. Config Files Included.

  • Join/Merge Multiple Log Files For Big Data Analysis

    Here Are The Ways To Join/Merge Multiple Log Files For Big Data Analysis, Store Them To OpenStack Based Cloud Storage And Delete Old Files.

  • SSH Commands For Fail2Ban Log Analysis

    This Guide SSH Commands For Fail2Ban Log Analysis Shows Some One Liner Complex Commands For Quick Analysis Works Like Grouping & Sorting IPs.

performing a search on this website can help you. Also, we have YouTube Videos.

Take The Conversation Further ...

We'd love to know your thoughts on this article.
Meet the Author over on Twitter to join the conversation right now!

If you want to Advertise on our Article or want a Sponsored Article, you are invited to Contact us.

Contact Us

Subscribe To Our Free Newsletter

Get new posts by email:

Please Confirm the Subscription When Approval Email Will Arrive in Your Email Inbox as Second Step.

Search this website…

 

Popular Articles

Our Homepage is best place to find popular articles!

Here Are Some Good to Read Articles :

  • Cloud Computing Service Models
  • What is Cloud Computing?
  • Cloud Computing and Social Networks in Mobile Space
  • ARM Processor Architecture
  • What Camera Mode to Choose
  • Indispensable MySQL queries for custom fields in WordPress
  • Windows 7 Speech Recognition Scripting Related Tutorials

Social Networks

  • Pinterest (24.3K Followers)
  • Twitter (5.8k Followers)
  • Facebook (5.7k Followers)
  • LinkedIn (3.7k Followers)
  • YouTube (1.3k Followers)
  • GitHub (Repository)
  • GitHub (Gists)
Looking to publish sponsored article on our website?

Contact us
Page Visits Alerts

Recent Posts

  • Advantages and Disadvantages of Ubuntu Server DistributionJune 2, 2023
  • Typography on the WebJune 2, 2023
  • How to Use JuliaMono Font in Urvanov/Crayon Syntax HighlighterJune 1, 2023
  • What Is a Sales Funnel?June 1, 2023
  • The 6G Network: 100 Times Faster than 5GMay 31, 2023

About This Article

Cite this article as: Abhishek Ghosh, "What is Data Refining in Big Data?," in The Customize Windows, July 18, 2017, June 3, 2023, https://thecustomizewindows.com/2017/07/data-refining-big-data/.

Source:The Customize Windows, JiMA.in

PC users can consult Corrine Chorney for Security.

Want to know more about us? Read Notability and Mentions & Our Setup.

Copyright © 2023 - The Customize Windows | dESIGNed by The Customize Windows

Copyright  · Privacy Policy  · Advertising Policy  · Terms of Service  · Refund Policy

We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
Do not sell my personal information.
Cookie SettingsAccept
Manage consent

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary
Always Enabled
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
CookieDurationDescription
cookielawinfo-checkbox-analytics11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional11 monthsThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance11 monthsThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy11 monthsThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytics
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.
Others
Other uncategorized cookies are those that are being analyzed and have not been classified into a category as yet.
SAVE & ACCEPT