Objective of this article Big Data analytics solutions on-premise versus in the cloud is not limited to comparing on-premise and in the cloud big data analytic solutions but to serve some other goals like clarification of our limitation for the existing guides on localhost or test setup.
Although in the context of Big Data analytics solutions, conventionally various websites compare on-premise versus in the cloud as if they are only need of the enterprise, in-reality that is not valid. There are many types of users who needs different Big Data analytics solutions for different need, like as example for free software development.
We regularly publish guides on how-to install and configure various Big Data tools and at the same time we also discuss about the cloud based Big Data tools from vendors like IBM Bluemix. The latest way of using this service is towards “in the cloud” whereas earlier it was “On-premise”. Our “how to install Apache Hadoop” guides are for learning the basics by the readers, they are not exactly for production purpose.
In this article, we will discuss the selection matter of “ready-to-use service” versus “how to install Apache Hadoop” for production purpose. The production setup can be for various needs, starting from development to a solution for an enterprise. Whatever the need is, three situations can arise if we divide data and computing power :
- All computation and data on-premise
- All computation and data on a specific cloud
- Some computation and data on-premise and some computation and data on a specific cloud
Big Data Analytics Solutions: Why We Publish How to Install Guides
Our how-to install and configure various Big Data tools are comparable to our guides on installing LAMP or LEMP server (like for the need of running WordPress). No real ready-to-use platform is practical in that use case. So, in case of webhosting, it is normal for one to select some server with root access, which depending on the need can be cloud server or VPS or dedicated server or colocation server. But universally all will be following one guide (or own guide!) to create a basic setup and slowly follow the ways to optimize many aspects of that LAMP or LEMP server. It is true that there are ready to use “WordPress hosting” exists. But no way they are closest to offer the latest tweaks or components and intended audience is newbie who are willing to pay higher. That “webhosting” is a separate niche from Big Data tools although when we publish “how to” guides it appears as similar. In case of LAMP or LEMP server we do publish guides on how to tweak zillions of components (compare – we do have guides on how to enable OSCP stapling on Apache, how to enable TLS False Start on Nginx, how to enable caching on Percona MySQL and so on). We never publish guides on how to optimize a Hadoop or Spark installation. Our maximum guides are around Docker-based deployments.
- We regularly publish guides on how-to install and configure various servers for webhosting for production environment.
- We regularly publish guides on how-to install and configure various Big Data tools but on “commodity hardware” for developmental need of various peoples.
- Our guides to install the free Big Data software is not what is “Big Data”.
Our website’s one part is assisting the sysadmin works. There is difference between sysadmin works and works of a data scientist. Ultimately, one can easily end the list of updated blogs on Big Data niche from the providers themselves. IBM’s Data Science Courses or BigDataHub are kind of brand-free.
Ready to Use Distributions and Appliances for Data Science is Common
Now as for developmental need and testing like “on-premise”, there are minimum two type of options:
- Guides using the original source code
- Guides using Clouera or Hortonworks like ready to use distributions and appliances For Data Science
The second situation is not very popular for configuring LAMP or LEMP server. Well known scripts like Centminmod actually use the “original” source code derived solutions. All webhosts deliver guides on how to install, configure LAMP server because each of them has different types of clients, different types of hardware configuration and total setup is fully open to the public internet via port 80, port 443 at minimum. In case of Data Science, it is common to use company A’s optimized package by company B. IBM using and promoting Clouera or Hortonworks or Oracle’s tools may sound odd but that is practical instead of forking an already tweaked software. In case of Big Data Analytics, unlike LAMP/LEMP server the optimization part is super complicated and time taking. The only segment where the similar situation arises for LAMP or LEMP server is the database software part, where there are three options for MySQL:
- Original MySQL
- Percona MySQL
- MariaDB MySQL
Well, that MySQL is starting of “Big Data”, if you look at the service catalogue of IBM Bluemix, MySQL will go towards Cloud Big Data Services. Obviously like Big Data there are Database as a Service for websites and mobile platforms.
So as like for Big Data softwares, there are resemblance of tweaking MySQL:
- MySQL has various engines
- MySQL has various tools like offered by Percona to optimize the settings
- MySQL delivers the highest possible headache to the WordPress users
- MySQL performance compared while running on optimized hardware
Before the advent of Docker, hand installing total system was gold standard for LAMP or LEMP server and still it is standard. LAMP/LEMP is kind of in-between segment as why the webhosting platform is made that is not obvious – normally we configure for WordPress as that is most common use of LAMP/LEMP and we deliberately make the matter easy to one who is learning works from SSH.
Comparison of Cloud Based Big Data Analytics Solutions with PaaS
The architectural requirements for building a Big Data Analytics solution is not mere software packages or installing them. The tweaking part involves minor or major in-house coding. That need of minor or major in-house coding resulting in huge number of Open Source software contribution in the field of Big Data. Platform as a Service (PaaS) is quite comparable with Cloud Based Big Data Analytics Solutions. But, Platform as a Service (PaaS) mostly not for running a high load production website but developing, but probably better for testing or using as backend hidden from the public. As Platform as a Service (PaaS) had/has limitations for typical LAMP/LEMP, various newer ways like FaaS are evolving.
The reason to introduce the topic PaaS is for a big reason – IBM Bluemix. At present IBM Bluemix is not only a PaaS but a PaaS looking platform delivering various cloud services. IBM Bluemix part uses Open Source software for delivering their own Platform as a Service (PaaS) and a separate service. In case of PaaS like IBM Bluemix, usually the platforms often have all the solutions for their users which automatically forces their cloud based Big Data analytics solutions to compare with PaaS where they are towards SaaS. Big Data software solutions on IBM Bluemix is not exactly comparable with old IBM Bluemix PaaS which used to mainly provide an application hosting platform. We talked about this part as often old guides on same website confuses the new users.
Comparison of On-Premise versus Cloud Big Data Analytics Solutions
From the above points, it is obvious to favor the usage of cloud big data analytics solutions by the developers as the need of on-premise resources such as own servers, own IT team dedicated for the job is too much higher. The same will be true for the small to medium size companies.
The public cloud always offers benefits for Big Data deployments – self-service, agility, elasticity, and a pay-as-you-go model. The CapEx model for on-premises deployments takes weeks or months to get in to production out of legal issues, procuring servers and racks, configure storage and networking, allocation power backup system to name a few. By comparison, the on-demand model of cloud big data analytics solutions can be very attractive, if not really subjected to test.
A cloud Data analytics solution with instant integration systems can be set up rapidly for all data sources. However, all like any cloud-based services, the cloud based big data analytics solutions has dependency on the provider for uptime and other software related matters. IBM Bluemix specially has lot of documentations, dedicated blog on data science (like we mentioned above), community support on StackExchange like sites, free Demo Cloud like services which basically makes then fit for many of the users. The names of cloud based big data analytics solution providers become limited to:
- Who has enough free resources to use
- Who has pay-as-you-go model of service
- Who has lot of documentations
- Who has lot of examples or demo stuffs on Github
And the number of such vendors is not exactly many on this earth.
Apart from purely cloud solution, like we listed three ways at the beginning of this article, there is hybrid cloud strategy for big data & analytics. The connection of on-premise environments and/or dedicated cloud with public cloud can control the matters with data security and convenience. Data movement on hybrid cloud is one of the biggest challenges for a hybrid cloud strategy and special considerations must be taken to reduce latency and maintain performance. Location, dedicated connections streaming, traffic optimization, workload optimization are few points to consider.
No single cloud environment optimizes all criterion. Here is where IBM has an optimized provisioning worksheet which balances the trade-offs between public, private, and hybrid cloud architectures :
Any cloud based has minor inherited risks even provided by the best provider on this earth. Frankly, regular backup does the trick for a service or company who are not of a size which is exactly large. The case of larger enterprise with own datacenter hugely varies with an independent developer. Their CIO needs to calculate whether insourcing a third-party service will return better ROI than cost of purchasing the hardware, time to setup, recruiting skilled and semi-skilled manpower. We really cannot run advocacy for a specialized segment regardless of their size, like healthcare or governmental agency where question of adherence to various standards and geolocation of data may be important in a country. Such segment demands paid consultancy to decide and arrange their own solution.