• Home
  • Archive
  • Tools
  • Contact Us

The Customize Windows

Technology Journal

  • Cloud Computing
  • Computer
  • Digital Photography
  • Windows 7
  • Archive
  • Cloud Computing
  • Virtualization
  • Computer and Internet
  • Digital Photography
  • Android
  • Sysadmin
  • Electronics
  • Big Data
  • Virtualization
  • Downloads
  • Web Development
  • Apple
  • Android
Advertisement
You are here:Home » What is Data Refining in Big Data?

By Abhishek Ghosh July 18, 2017 4:30 pm Updated on July 18, 2017

What is Data Refining in Big Data?

Advertisement

Most commonly new developers, particularly who are interested in data analysis face some terminologies which have more to do with theoretical and practical part of engineering and analytical sciences. The developers can be from a variety of domains and the phrases often confuses them. The question what is data refining in big data such an obvious question and answer is commonly written for those who are related to statistics and analytical sciences. In data refining we refine disparate data to increase understanding of the data & remove data variability. We can understand that the meaning is not quite clear to many.

 

What is Data Refining in Big Data in Plain English?

 

No, refining is not a new terminology or buzzword. So as data refining. Refinement is a generic term in computer science which actually describes various approaches with the goal of producing computer understandable corrected programs and simplifying programs. Data refinement converts raw data to the specification of needed format by a software or implementable program. Possibly still the meaning is not clear.

In our 2 years ago published guide on wrong concepts around auto restart MySQL, we have shown you MySQL log on Github as gist :

Advertisement

---

Vim
1
https://gist.github.com/AbhishekGhosh/66f3da024340c3fc3f1b

First line is :

Vim
1
151011 10:38:40 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

Fifth line is this :

Vim
1
2015-10-11 10:38:41 0 [Warning] 'ERROR_FOR_DIVISION_BY_ZERO' is deprecated and will be removed in a future release.

33rd line is this :

Vim
1
2
2015-10-11 10:44:56 14344 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. INSERT... ON DUPLICATE KEY UPDATE  on a table with more than one UNIQUE KEY is unsafe Statement: INSERT INTO wp_stt2_meta ( `post_id`,`meta_value`,`meta_count` ) VALUES ( '17878', 'learning in artificial network ann', 1 )
ON DUPLICATE KEY UPDATE `meta_count` = `meta_count` + 1

If I want to develop a chart from the [Warning], that 5500 line log is not in proper condition for a computer to understand. The log essentially is intended to be human readable. This was part of data collection phase.

Another good example is Fail2Ban log. We have shown how to setup Fail2Ban log analytics graph with badips.com. Here is gist with fail2ban log :

Vim
1
https://gist.github.com/AbhishekGhosh/48d84c020bdea9d8c8b96eec0a58a9f7

If you notice these few lines :

Vim
1
2
3
4
5
6
2017-07-17 05:46:05,824 fail2ban.actions        [935]: NOTICE  [sshd] Unban 97.79.239.20
2017-07-17 05:49:49,237 fail2ban.actions        [935]: NOTICE  [sshd] Unban 61.166.73.121
2017-07-17 06:43:41,419 fail2ban.filter         [935]: INFO    [sshd] Found 90.189.242.131
2017-07-17 06:43:41,423 fail2ban.filter         [935]: INFO    [sshd] Found 90.189.242.131
2017-07-17 06:43:44,428 fail2ban.filter         [935]: INFO    [sshd] Found 90.189.242.131
2017-07-17 06:43:44,790 fail2ban.actions        [935]: NOTICE  [sshd] Ban 90.189.242.131

You’ll understand that there is usable information but not possible for any system to easily compute. For that purpose as basic way we have shown bash commands to execute on fail2ban log to get valuable information using simple GNU tools. Obviously we have bash script with the commands for fail2ban on other guide. When we are running that bash script we are getting this output :

Vim
1
2
3
4
5
Bad IPs from only from /var/log/fail2ban.log alone :
---Number-----IP-------------------------------------------------------------
      1 p19229-ipngn10401marunouchi.tokyo.ocn.ne.jp (114.175.118.229)
      6 182.23.66.171 (182.23.66.171)
      6 78-58-187-40.static.zebra.lt (78.58.187.40)

That IP 78.58.187.40 needs ban via iptables. The script or commands converting data to human readable output for data analysis. 78.58.187.40 was 6 time banned and unbanned by fail2ban.

But when we are using software like Apache Hadoop with Spark, we do not need the human readable format but format which Apache Hadoop with Spark can process. This is berry basic, easy example of data refinement. The adobe log may be understandable to a analytics system in this format :

Vim
1
2
3
4
5
6
2017-07-17, 05:46:05, 824, Unban, 97.79.239.20
2017-07-17, 05:49:49, 237, Unban, 61.166.73.121
2017-07-17, 06:43:41, 419, Found, 90.189.242.131
2017-07-17, 06:43:41, 423, Found, 90.189.242.131
2017-07-17, 06:43:44, 428, Found, 90.189.242.131
2017-07-17, 06:43:44, 790, Ban, 90.189.242.131

On 17th July between 05:46 to 06:43 the IP 90.189.242.131 attacked 3 times and fail2ban banned it.

In a data warehouse, there is a collective process called Extract, Transform, and Load (ETL). Data extracting is the process of gathering data from data sources. The data then will then be Transformed in order to fit the need. Then the data has to be made to abide the rules of data architecture framework, then it will be loaded into the data warehouse.

What is Data Refining in Big Data

What I have shown example with the logs with commands is towards retrenchment not refining. Retrenchment uses formal Methods to address the perceived limitations of formal refinement for situations in which refinement is practically unusable. It bears no meaning unless the script or article is read :

Vim
1
2
3
      1 p19229-ipngn10401marunouchi.tokyo.ocn.ne.jp (114.175.118.229)
      6 182.23.66.171 (182.23.66.171)
      6 78-58-187-40.static.zebra.lt (78.58.187.40)

But this set is towards meaningful :

Vim
1
2
3
4
5
6
2017-07-17, 05:46:05, 824, Unban, 97.79.239.20
2017-07-17, 05:49:49, 237, Unban, 61.166.73.121
2017-07-17, 06:43:41, 419, Found, 90.189.242.131
2017-07-17, 06:43:41, 423, Found, 90.189.242.131
2017-07-17, 06:43:44, 428, Found, 90.189.242.131
2017-07-17, 06:43:44, 790, Ban, 90.189.242.131

There is free software like OpenRefine which is useful for some purpose :

Vim
1
https://github.com/OpenRefine/OpenRefine

If we go with logs as data source, we have shown how to merge many log files in to one big file in simple way for test purpose.

Tagged With paperuri:(5a3c8fdfb8851727457b529b8938b2e0) , Big Data Analysis Methods , big data refining , intellecty x california big data and data refining , refining data collection process , refining how the data is coded , what is big data
Facebook Twitter Pinterest

Abhishek Ghosh

About Abhishek Ghosh

Abhishek Ghosh is a Businessman, Surgeon, Author and Blogger. You can keep touch with him on Twitter - @AbhishekCTRL.

Here’s what we’ve got for you which might like :

Articles Related to What is Data Refining in Big Data?

  • Configure Apache With Fail2Ban on Ubuntu 18.04

    Here is How To Configure Apache With Fail2Ban on Ubuntu 18.04 to block more types of malicious attempts towards server to create a practical firewall.

  • Configure Fail2Ban With Mod Security And Other Filters

    Here is How To Configure Fail2Ban With Mod Security & Others On Apache Server To Protect From PHP And Other Exploits. Config Files Included.

  • Join/Merge Multiple Log Files For Big Data Analysis

    Here Are The Ways To Join/Merge Multiple Log Files For Big Data Analysis, Store Them To OpenStack Based Cloud Storage And Delete Old Files.

  • SSH Commands For Fail2Ban Log Analysis

    This Guide SSH Commands For Fail2Ban Log Analysis Shows Some One Liner Complex Commands For Quick Analysis Works Like Grouping & Sorting IPs.

performing a search on this website can help you. Also, we have YouTube Videos.

Take The Conversation Further ...

We'd love to know your thoughts on this article.
Meet the Author over on Twitter to join the conversation right now!

If you want to Advertise on our Article or want a Sponsored Article, you are invited to Contact us.

Contact Us

Subscribe To Our Free Newsletter

Get new posts by email:

Please Confirm the Subscription When Approval Email Will Arrive in Your Email Inbox as Second Step.

Search this website…

 

Popular Articles

Our Homepage is best place to find popular articles!

Here Are Some Good to Read Articles :

  • Cloud Computing Service Models
  • What is Cloud Computing?
  • Cloud Computing and Social Networks in Mobile Space
  • ARM Processor Architecture
  • What Camera Mode to Choose
  • Indispensable MySQL queries for custom fields in WordPress
  • Windows 7 Speech Recognition Scripting Related Tutorials

Social Networks

  • Pinterest (24.3K Followers)
  • Twitter (5.8k Followers)
  • Facebook (5.7k Followers)
  • LinkedIn (3.7k Followers)
  • YouTube (1.3k Followers)
  • GitHub (Repository)
  • GitHub (Gists)
Looking to publish sponsored article on our website?

Contact us

Recent Posts

  • Problems of Search EngineDecember 1, 2023
  • How Search Engine WorksNovember 30, 2023
  • Data Mining: An OverviewNovember 30, 2023
  • What is Meant by Doxing?November 29, 2023
  • What is Market ShareNovember 29, 2023
PC users can consult Corrine Chorney for Security.

Want to know more about us?

Read Notability and Mentions & Our Setup.

Copyright © 2023 - The Customize Windows | dESIGNed by The Customize Windows

Copyright  · Privacy Policy  · Advertising Policy  · Terms of Service  · Refund Policy