The goal of this article, Overview of Cloud Based Big Data Platforms And Tools From IBM is to classify the Big Data tools offered by IBM. Also, it will provide the readers an idea around what the big data tools IBM has. As our experience is as user, this article possibly will help the prospective users and developers from easily testing their tools or plan a complete setup. Among so many products offered by IBM with various brand names, essentially it may feel difficult to a new user to identify which is Hadoop or understand IBM’s logic of arrangement of their product ranges. A comprehensive research has been carried out, from the publications by IBM as well as from the external sources that contain reputable and objective information.
Big Data is a new approach to data processing. Characteristically the analysis of large amounts of data has made possible only by the technological progress in recent years. Through Big Data we can gain insights and understand the connections which can go far beyond the possibilities of the existing technologies. According to current calculations, the volume of data available worldwide doubled every two years. The enormous data growth results from the digitization of content, from the areas of Internet and mobile communications, industries, traffic, sources such as social media, credit and customer cards, surveillance cameras, flight and vehicles, intelligent home control systems, sensor systems for controlling production facilities and so on. These result in new applications and business models in social networks, in the financial industry (financial transactions, exchange data), as well as in the energy sector (consumption data) and in healthcare (gene analysis, telemonitoring). Also, in many areas of science (e.g. geology, genetics, proteomics, climate research and nuclear physics) large data volumes are being worked on to create model calculations and evaluations. The linked examples towards our old articles mostly are around analysis is server log, which is readily available to majority of the developers. Furthermore, other websites also have plenty amount of scripts (like Apache Pig scripts) for more testing.
|Table of Contents|
Relationship with Our Previously Published Articles
In our previously published articles, we have discussed around the advantages and disadvantages of Big Data tools offered by IBM, Real-time Big Data Analytics in Health Care and how to build Big Data solution on the cloud. For quick recapitulation, in our previous articles we discussed about our observed strategies adapted by IBM, how IBM seems to be focused on the solutions which enable growth, make the things easy, cheaper and the workforce of employees dynamically adapt and evolve to the ever-changing needs. To achieve these goals, the following principles IBM possibly considers:
- Wider usage of Open Source software
- Innovation through cloud computing, business analytics, acquisitions and other strategic initiatives
- Enterprise-wide automation and integration of business processes
- IBM hardware specific optimizations for better performance
Our Classification of Cloud Based Big Data Platforms and Tools From IBM
We can classify the tools offered by IBM in to:
- IBM Big Data Platform
- Analytic Applications
From the point of usage, IBM Big Data Platform can be further divided in to six sub-types which is our main area of interest:
- Hadoop System
- Stream Computing
- Data Warehousing
- Information Integration and Governance
- User Interfaces
Big Data Platform’s big part is the Hadoop System. IBM’s InfoSphere BigInsights as product is open source Hadoop with storage, security, performance optimization, development tooling, visualization. There are reasons to mention the points. As example, the storage is IBM Shared-Nothing Cluster parallel file system which can replace HDFS with full POSIX compliance. IBM Big SQL is a great and ready to use platform with almost all commonly used Open Source Big Data tools installed which includes Hadoop, Spark, Pig to MySQL.
From the point of usage, the Analytic Applications can be further divided in to:
- Functional App
- Industry App
- Predictive Analytics
- Content Analytics
Free To Test BigSQL/Demo Cloud From IBM
For practical reasons, we need to test any service before committing to pay as regular customer. For testing the IBM Big Data tools we have discovered two free ways:
For the advanced users, if the need is only Big Data Analytics tools, in such case we felt Demo Cloud to be superior, quick option over Bluemix for testing purpose as it readily offers easy SSH access and also offers web UI. Readers can check our practical articles on that free Demo Cloud (that is how we tested) like our initial guide on Demo Cloud, our example tutorial illustrating using Apache Pig on Demo Cloud for server log analysis and so on.
The disadvantage of Demo Cloud is easy – data will be flashed on regular interval, which in case of BlueMix is predictable. BlueMix does have everything but it can take time to understand. Furthermore, Demo Cloud has some Question Answer support on IBM’s site as well as on StackOverflow.
It took good time for us to realize what is included inside Bluemix and why the things like BigSQL are also separate products from IBM. Separate products are enterprise grade services.
Regarding the Analytic Applications, we actually have some easy example of implementation of Analytic Applications, like with WordPress as WordPress Plugin which can analyze emotion. That plugin was well appreciated in the WordPress community. As time will proceed, more community developed free plugins will be common to find.
Obviously, there is some ongoing cost for using the tools from IBM and it is practical to compare the cost benefit ratio with self-hosted Open Source big data software. As cost of virtual servers, dedicated servers, bandwidth are steadily decreasing, the parameters to compare becoming lesser.
Moreover, there are definite reasons behind not publishing guides on Rackspace or HP cloud! Many things can not be written but as an intelligent user or developer, you need to understand the reasons of our preference at one point of time. Rackspace now charges extra $50 for having an account. How can we talk about Rackspace Cloud Files anymore? We had to write Python scripts to upload images on HP Cloud’s object storage. Those stories are only around webhosting, not complicated matters like having a Big Data analysis platform.
IBM is selling their hardware and networking resources with pre-installed software as a service. The software is mainly well known Open Source big data software. It is not difficult for a system administrator to measure the performance and benchmark to compare self-hosted Hadoop installation with IBM hosted Hadoop on cloud.
A big extra gain of using Hadoop like Big Data tools as a service from IBM over self-hosted is the availability of IBM’s experienced skilled employees to answer questions. Obviously, other matters like least burden of server sysadmin works to maintain own servers running Hadoop, possibility of least downtime for own server maintenance becomes other decisive factors to choose the tools offered by IBM.
Obviously, IBM optimized tools have extra advantage of faster to deploy a project without the need of setup, distributed-ness reliability through redundancy, hardware optimization, specialized product like IBM Watson, ready to use visualizations etc.
Will you use IBM’s Big SQL or use a cheap dedicated server with huge RAM and install the needed Open Source software following standard guides like we have? Answer really depends on your need. We never seen any cheap dedicated server to provide instant help for troubleshooting hardware issues. Depending on need, we accept that inconvenience for cost reduction. Not always that cheap way works.
If you have enough number of sysadmins who are used with data analysis, possibly self-hosting is not a bad choice. But basically, these kind of “Software as a Service” always have the advantage of being a managed, ready to use service with near zero chance of downtime and network security management. IBM’s Big Data tools are good just like these days we prefer DNS or Email as “Software as a Service”. One big reason to prefer these services is to save ourselves from the hackers.