• Home
  • Archive
  • Tools
  • Contact Us

The Customize Windows

Technology Journal

  • Cloud Computing
  • Computer
  • Digital Photography
  • Windows 7
  • Archive
  • Cloud Computing
  • Virtualization
  • Computer and Internet
  • Digital Photography
  • Android
  • Sysadmin
  • Electronics
  • Big Data
  • Virtualization
  • Downloads
  • Web Development
  • Apple
  • Android
Advertisement
You are here: Home » What is Data Refining in Big Data?

By Abhishek Ghosh July 18, 2017 4:30 pm Updated on July 18, 2017

What is Data Refining in Big Data?

Advertisement

Most commonly new developers, particularly who are interested in data analysis face some terminologies which have more to do with theoretical and practical part of engineering and analytical sciences. The developers can be from a variety of domains and the phrases often confuses them. The question what is data refining in big data such an obvious question and answer is commonly written for those who are related to statistics and analytical sciences. In data refining we refine disparate data to increase understanding of the data & remove data variability. We can understand that the meaning is not quite clear to many.

 

What is Data Refining in Big Data in Plain English?

 

No, refining is not a new terminology or buzzword. So as data refining. Refinement is a generic term in computer science which actually describes various approaches with the goal of producing computer understandable corrected programs and simplifying programs. Data refinement converts raw data to the specification of needed format by a software or implementable program. Possibly still the meaning is not clear.

In our 2 years ago published guide on wrong concepts around auto restart MySQL, we have shown you MySQL log on Github as gist :

Advertisement

---

Vim
1
https://gist.github.com/AbhishekGhosh/66f3da024340c3fc3f1b

First line is :

Vim
1
151011 10:38:40 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

Fifth line is this :

Vim
1
2015-10-11 10:38:41 0 [Warning] 'ERROR_FOR_DIVISION_BY_ZERO' is deprecated and will be removed in a future release.

33rd line is this :

Vim
1
2
2015-10-11 10:44:56 14344 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. INSERT... ON DUPLICATE KEY UPDATE  on a table with more than one UNIQUE KEY is unsafe Statement: INSERT INTO wp_stt2_meta ( `post_id`,`meta_value`,`meta_count` ) VALUES ( '17878', 'learning in artificial network ann', 1 )
ON DUPLICATE KEY UPDATE `meta_count` = `meta_count` + 1

If I want to develop a chart from the [Warning], that 5500 line log is not in proper condition for a computer to understand. The log essentially is intended to be human readable. This was part of data collection phase.

Another good example is Fail2Ban log. We have shown how to setup Fail2Ban log analytics graph with badips.com. Here is gist with fail2ban log :

Vim
1
https://gist.github.com/AbhishekGhosh/48d84c020bdea9d8c8b96eec0a58a9f7

If you notice these few lines :

Vim
1
2
3
4
5
6
2017-07-17 05:46:05,824 fail2ban.actions        [935]: NOTICE  [sshd] Unban 97.79.239.20
2017-07-17 05:49:49,237 fail2ban.actions        [935]: NOTICE  [sshd] Unban 61.166.73.121
2017-07-17 06:43:41,419 fail2ban.filter         [935]: INFO    [sshd] Found 90.189.242.131
2017-07-17 06:43:41,423 fail2ban.filter         [935]: INFO    [sshd] Found 90.189.242.131
2017-07-17 06:43:44,428 fail2ban.filter         [935]: INFO    [sshd] Found 90.189.242.131
2017-07-17 06:43:44,790 fail2ban.actions        [935]: NOTICE  [sshd] Ban 90.189.242.131

You’ll understand that there is usable information but not possible for any system to easily compute. For that purpose as basic way we have shown bash commands to execute on fail2ban log to get valuable information using simple GNU tools. Obviously we have bash script with the commands for fail2ban on other guide. When we are running that bash script we are getting this output :

Vim
1
2
3
4
5
Bad IPs from only from /var/log/fail2ban.log alone :
---Number-----IP-------------------------------------------------------------
      1 p19229-ipngn10401marunouchi.tokyo.ocn.ne.jp (114.175.118.229)
      6 182.23.66.171 (182.23.66.171)
      6 78-58-187-40.static.zebra.lt (78.58.187.40)

That IP 78.58.187.40 needs ban via iptables. The script or commands converting data to human readable output for data analysis. 78.58.187.40 was 6 time banned and unbanned by fail2ban.

But when we are using software like Apache Hadoop with Spark, we do not need the human readable format but format which Apache Hadoop with Spark can process. This is berry basic, easy example of data refinement. The adobe log may be understandable to a analytics system in this format :

Vim
1
2
3
4
5
6
2017-07-17, 05:46:05, 824, Unban, 97.79.239.20
2017-07-17, 05:49:49, 237, Unban, 61.166.73.121
2017-07-17, 06:43:41, 419, Found, 90.189.242.131
2017-07-17, 06:43:41, 423, Found, 90.189.242.131
2017-07-17, 06:43:44, 428, Found, 90.189.242.131
2017-07-17, 06:43:44, 790, Ban, 90.189.242.131

On 17th July between 05:46 to 06:43 the IP 90.189.242.131 attacked 3 times and fail2ban banned it.

In a data warehouse, there is a collective process called Extract, Transform, and Load (ETL). Data extracting is the process of gathering data from data sources. The data then will then be Transformed in order to fit the need. Then the data has to be made to abide the rules of data architecture framework, then it will be loaded into the data warehouse.

What is Data Refining in Big Data

What I have shown example with the logs with commands is towards retrenchment not refining. Retrenchment uses formal Methods to address the perceived limitations of formal refinement for situations in which refinement is practically unusable. It bears no meaning unless the script or article is read :

Vim
1
2
3
      1 p19229-ipngn10401marunouchi.tokyo.ocn.ne.jp (114.175.118.229)
      6 182.23.66.171 (182.23.66.171)
      6 78-58-187-40.static.zebra.lt (78.58.187.40)

But this set is towards meaningful :

Vim
1
2
3
4
5
6
2017-07-17, 05:46:05, 824, Unban, 97.79.239.20
2017-07-17, 05:49:49, 237, Unban, 61.166.73.121
2017-07-17, 06:43:41, 419, Found, 90.189.242.131
2017-07-17, 06:43:41, 423, Found, 90.189.242.131
2017-07-17, 06:43:44, 428, Found, 90.189.242.131
2017-07-17, 06:43:44, 790, Ban, 90.189.242.131

There is free software like OpenRefine which is useful for some purpose :

Vim
1
https://github.com/OpenRefine/OpenRefine

If we go with logs as data source, we have shown how to merge many log files in to one big file in simple way for test purpose.

Tagged With paperuri:(5a3c8fdfb8851727457b529b8938b2e0) , Big Data Analysis Methods , big data refining , intellecty x california big data and data refining , refining data collection process , refining how the data is coded , what is big data

This Article Has Been Shared 872 Times!

Facebook Twitter Pinterest
Abhishek Ghosh

About Abhishek Ghosh

Abhishek Ghosh is a Businessman, Orthopaedic Surgeon, Author and Blogger. You can keep touch with him on Twitter - @AbhishekCTRL.

Here’s what we’ve got for you which might like :

Articles Related to What is Data Refining in Big Data?

  • Difference Between Data Warehouse And Data Lake

    What Is The Difference Between Data Warehouse And Data Lake? Data warehouses is four decade old established concept. Data lake is a new idea.

  • How To Install Apache Mesos With Marathon On Ubuntu 16.04 LTS

    Here Is How To Install Apache MeOS With Marathon On Ubuntu 16.04 LTS In Order To Integrate,Manage Multiple Servers Or Multi Cloud Environment.

  • How To Install Apache NiFi On Ubuntu 16.04 LTS

    Apache NiFi Enables Automation of Real Time Data Flow Between Systems. Here Is How To Install Apache NiFi On Ubuntu 16.04 LTS on Cloud Server.

  • Real-time Big Data Analytics in Health Care Using Tools From IBM

    Goal of the article Real-time Big Data Analytics in Health Care Using Tools From IBM is to provide understanding of big data in the health.

  • Create Data Science Environment on Cloud Server With Docker

    Here Are the Steps, Commands to Create Data Science Environment on Cloud Server For Data Analysis Starting With a Blank Server With SSH.

Additionally, performing a search on this website can help you. Also, we have YouTube Videos.

Take The Conversation Further ...

We'd love to know your thoughts on this article.
Meet the Author over on Twitter to join the conversation right now!

If you want to Advertise on our Article or want a Sponsored Article, you are invited to Contact us.

Contact Us

Subscribe To Our Free Newsletter

You can subscribe to our Free Once a Day, Regular Newsletter by clicking the subscribe button below.

Click To Subscribe

Please Confirm the Subscription When Approval Email Will Arrive in Your Email Inbox as Second Step.

Search this website…

 

Popular Articles

Our Homepage is best place to find popular articles!

Here Are Some Good to Read Articles :

  • Cloud Computing Service Models
  • What is Cloud Computing?
  • Cloud Computing and Social Networks in Mobile Space
  • ARM Processor Architecture
  • What Camera Mode to Choose
  • Indispensable MySQL queries for custom fields in WordPress
  • Windows 7 Speech Recognition Scripting Related Tutorials

Social Networks

  • Pinterest (21K Followers)
  • Twitter (5.3k Followers)
  • Facebook (5.7k Followers)
  • LinkedIn (3.7k Followers)
  • YouTube (1.3k Followers)
  • GitHub (Repository)
  • GitHub (Gists)
Looking to publish sponsored article on our website?

Contact us

Recent Posts

  • Why Not to Use Your Host for Email Marketing March 5, 2021
  • What You Need to Know About the Microservices March 4, 2021
  • Fix Missing/Bad FileProvider for Freshchat (Android error code 354) March 3, 2021
  • Basics of Data Protection on the Internet March 2, 2021
  • What is Standard Software February 28, 2021

 

About This Article

Cite this article as: Abhishek Ghosh, "What is Data Refining in Big Data?," in The Customize Windows, July 18, 2017, March 6, 2021, https://thecustomizewindows.com/2017/07/data-refining-big-data/.

Source:The Customize Windows, JiMA.in

 

This website uses cookies. If you do not want to allow us to use cookies and/or non-personalized Ads, kindly clear browser cookies after closing this webpage.

Read Cookie Policy.

PC users can consult Corrine Chorney for Security.

Want to know more about us? Read Notability and Mentions & Our Setup.

Copyright © 2021 - The Customize Windows | dESIGNed by The Customize Windows

Copyright  · Privacy Policy  · Advertising Policy  · Terms of Service  · Refund Policy