No automated procedure yet exists for the preparation of inhomogeneous data. Here is the main difficulty for the dissemination of Big Data. We all know that the technological revolutions have their times, they are sometimes very slow to spread, especially at the beginning, many sacrifices and work by the researchers and companies who believe in an unborn technology. This is an universal truth that affects all areas of human knowledge and economic activities and there is no reason to think that the Big Data somehow got excluded from this initial difficulty.
Big Data : Still Demands Manual Work and Little Automation
That is how one of the largest iceberg to overcome for the dissemination of Big Data analysis, which is the hardest work and manual labor is required to standardize the disorderly data, which is abundant in amount and arising from the other technologies available today.
The technological revolution of Big Data has its own bottleneck within the invisible labor that human minds have to invest to collect, clean and organize the data from the point of view of the formatting the content of information.
Many marketers are excited to focus on the results which we can achieve with the use of analysis of Big Data, but little focus is on the work of preparation of that data, which must undergo preparation before being processed.
The abundance of data available today from the Web, sensors, smartphones, from the corporate databases makes it even more complicated and delicate this stage of data organization, it is in fact to such an extent that it requires human intervention and in particular, specialists in the industry.
Big Data Still Require a Significant Manual Labor For Data Preparation
There is not yet an automated procedure that is able to understand the different formats in which data sets are available and above all, that is able to convert and unify data even from the semantic point of view, partly because of the ambiguity of human language. An example of this could be given from the pharmaceutical industry.
The various national and international organizations working in the field of pharmacovigilance and many pharmaceutical companies who manufacture or sell drugs, describe the side effects by using terms that are synonyms of each other which, although phonetically different; refer to the same episode.
For example, terms such as “sleepy”, “tired” and “fatigue” are three terms which can be considered synonymous with each other, but they actually have a different valence. A human understands how to properly interpret the nuances of the meaning associated with each cited words, but a software algorithm must be finely programmed so that it can recognize the difference and classify correctly the level of information.
These difficulties make the data scientists one of the most need persons in the workplace, both in large multinationals, in the startups which offer themselves as partners in solving this problem introduced by Big Data.
These professionals carry out work very traditional, going 50-80 percent of their time working in a laboratory with instruments dedicated to the collection and preparation of digital data to feed in their specialized analysis software. At the time of publishing this article, therefore, the idea of ‹‹sending raw data to any processing algorithm is a simple myth.
And in this sense we could say that the great challenge of Big Data is a part of a path which already has been traced by the other technologies which have emerged in computer science. At the beginning, a new technology emerges and is available for a small elite group. With time, with human labor and with the right investments, technology improves and the tools and business practices are adapted to allow for a democratization of technology hitherto reserved for the few. It will be also the truth for Big Data.