Apache Hadoop Framework is a free, written in Java, framework for scalable, distributed working. It is based on the well-known MapReduce algorithm of Google Inc. as well as proposals from the Google file system. Apache Hadoop Framework allows intensive computing processes with large amounts of data (Big Data in petabyte range) on clusters computer.
Apache Hadoop Framework : Basics
Apache Hadoop was originally was created by Doug Cutting and Mike Cafarella in 2005 for Yahoo. On 23 January 2008, it became the top-level project of the Apache Software Foundation. Users include Facebook, a9.com, AOL, Baidu, IBM, ImageShack.
Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and the Hadoop Distributed File System (HDFS). Hadoop’s HDFS is a highly available, high-performance file system for storing very large amounts of data on the file systems of multiple computers (nodes). Files are divided into data blocks with a fixed length and distributes them redundantly on the participating nodes. HDFS is pursuing a master-slave approach. A master node , called the NameNode, processes incoming data requests, organized storage of files in the slave node and stores resulting metadata . HDFS supports file systems with several 100 million files. Both file block length and degree of redundancy are configurable.
Apache Hadoop Framework : Commercial Support and commercial Forks
Hadoop implements the MapReduce algorithm with configurable classes for Map, Reduce and Combine phases. HBase is a scalable, simple database to manage very large amounts of data within a Hadoop cluster. The HBase database is based on a free implementation of Google’s BigTable . This data structure is suitable for data which is rarely changed, but very often updated. With HBase billions of rows can be distributed and efficiently can be managed.
Hadoop Hive extended to data warehouse functionalities, namely the query language HiveQL and indexes. HiveQL is on SQL -based query language and allows the developer to use a SQL-like syntax. In the summer of 2008, Facebook , the original developer of Hive, made the project open source. Hadoop database used by Facebook is one with a little more than 100 petabytes (August 2012) and is the largest in size in the world.
With Pig MapReduce programs can be written in high-level language Pig Latin for Hadoop. Chukwa enables real-time monitoring of very large distributed systems. ZooKeeper is the configuration of distributed systems.
As the use of Hadoop is particularly interesting for companies, there are a number of companies that commercial support or Forks of Hadoop:
- Cloudera provides CDH with an “enterprise ready” Open Source Distribution for Hadoop
- Microsoft integrates Hadoop currently in Windows Azure and SQL Server
- The Google App Engine support MapReduce programs.
- IBM InfoSphere product BigInsights based on Hadoop.
- EMC ² offers Greenplum HD Hadoop as part of a product package.
- Yahoo! provides Hadoop on branded Hortonworks.