Everything which has a starting has an end. This article with the title Theoretical Foundations of Big Data Part 3 is the third and final part of our series of articles. If you are curious about the previous two articles, you can read the Part 1 and Part 2 before reading this article. We are resuming the topic from analytical methods. The sub-headers do not claim to be complete in the thematic context, so that the points on which the points are addressed may be seen as examples only, but not as exclusive possibilities or unique features.
|Table of Contents|
Theoretical Foundations of Big Data : Analysis
In the following sub-headers, we will discuss some examples of analytical methods that are capable of evaluating the extremely large amounts of data that can be generated in the Big Data area.
Data mining is the extraction of implicit, non-trivial, and useful knowledge from large, dynamic, relatively complexly structured data sets. From the definition, it can be deduced that data mining is a method for gaining knowledge, the data base of which is characterized by very fast changes and large amounts of data. In addition, the data structure is usually so complex that an analysis of the data by conventional methods is only possible with extremely great effort by persons with very great expertise. Furthermore, the knowledge gained from the data is not obvious, that is, it can not be derived directly from the data.
Data Mining is used to search for patterns in the data, that is to say relationships between data sets, in order to transform them into rules, and to find records representative of this rule. Data mining is often used to automate this process. This presents the data analyzer with some challenges, which are of a technical nature, on the other hand, to the data and the person performing the evaluations, i.e. to the data analysts.The technical challenges are that the processing times can be greatly increased by too complex queries or too large a data volume, and that the results can be falsified by erroneous or incomplete data, most of which can be avoided in several ways, on the other hand, by means of preventive measures such as the previous cleanup of the data stock or plausibility checks can be improved. However, it should be mentioned that the processing times are shortened considerably by the technical further developments, such as faster processors and approaches to the parallelization of the computation processes and thus represent an ever smaller problem.
A further challenge with respect to data mining is that a certain amount of expertise must be available in the evaluating person for the evaluation of the generated data. On the one hand, it is imperative that the person understands the generated data; on the other hand, the person can be biased by the expert knowledge. Furthermore, in order to implement the system, a certain amount of expertise must be incorporated into the analysis process. This can influence the latter in such a way that rather familiar patterns in the data are generated as output and new hitherto unimagined patterns are ignored.
The last problem area relates to the generated data itself. On the one hand, the security of the data is indispensable in order to be able to use the data for business decision-making processes. On the other hand, the data mining is above all interesting or of great value, Findings, in particular, erroneous or incomplete data, as well as a data base that is too small, can have a negative effect on the representativeness of the data, making it useless, but so strongly different from the data so far that this data is regarded as important. To mitigate this problem, it is common to prevent various data that deviate from influencing the outcome of the various procedures that search for and pattern the data in the data. Thus, the so-called probability measures, such as standard deviations or mismatches. Furthermore, the calculated data can be trivial.
The process of data mining is based mostly on an in-house data warehouse, from which it is often the data via ODBC (Open Data Base Connectivity) and imported into its own database ODBC is an interface of Windows systems, which allows applications to access databases.
The scope of the data mining is composed of various areas. On the one hand, for example, buyer profiles are identified or market segmentation data are generated; on the other hand, shopping analyzes or similar analyzes are often carried out. In addition, data mining is often used to generate prognosis statements.
According to this, Predictive Analytics is a data analysis process that supports decisions in both the management and the workforce of a company.Predictive Analytics is an extension to data mining, which allows conclusions to be drawn on future developments. The so-called predicators, which are characteristics of a person or group of persons or of elements which are evaluated in order to draw conclusions about their future behavior or further development, are regarded as a central element. These predicators are combined to provide a reasonable picture of future developments.
An example of this is the combination of age, gender, athletic activity, daytime and possible addiction behavior to measure the risk of developing health restrictions in a person. Here, both structured data and unstructured data, ie flow texts and the like, can be processed and analyzed.
Thus, the process provides an opportunity for companies and similar organizations to effectively use the data collected in Big Data projects to derive insights that enable the company or organization to act proactively on the future and to act proactively such decisions are based on assumptions or have to react to current developments.
The approach of Predictive Analytics is used in various fields. These include crime and fraud prevention and prediction, meteorology, the oil and gas industry, insurance and the travel industry. In addition, the procedure is often used in the financial sector, for example, to predict share prices, the health care sector and the sale, for example to optimize the ordering of goods based on weather data or historical sales. An example of this is the order of food from a food retailer. These are usually characterized by the weather forecast and the season.
OLAP (Online Analytical Processing) is a concept for the summarization and presentation of management-relevant data. This means OLAP is a data processing method that allows the user to extract data from the underlying database and to display it from different perspectives. The data that is used for this purpose are often imported via ODBC from a data warehouse to the database used by the OLAP system. The data is stored in a multidimensional database in which each attribute represents a new data dimension. The data base of the OLAP software is, however, usually not as large as the actual data warehouse, since, for example, data on individual sales are less relevant than the total sales figures for a product or product group.
In this case, data are mostly used which, like for example revenue data, are based on facts, but a high number of dependencies with other data such as, for example, the location or the product to which a sales is assigned. These dependencies are also called dimensional features. The dimension features can be arranged hierarchically. For example, sales can be attributed to a subsidiary of a company, which results from the sales of the individual sales sites assigned to the subsidiary. The sales of the individual sales locations also consist of the sales figures of the individual product groups.
By means of an OLAP software, intersections between the individual dimensions can be found and therefore conclusions can be obtained which are based on a combination of data that has not been considered so far. This opens up the possibility of recognizing relationships between the individual data elements which might not have been perceived without this type of evaluation. OLAP is divided into two main subtypes. The MOLAP (Multidimensional OLAP) stores the data in multidimensional databases. The data is stored in a proprietary format in the database. The ROLAP method (Relational OLAP), on the other hand, is based on the concept of storing data in conventional relational database systems.
MOLAP is known for high access speeds and is able to perform complex calculations, which are already carried out in advance by the server. However, the amount of data that can be evaluated with the MOLAP method is limited, since all calculations are performed when the cube is created. Furthermore, as the data is stored in databases – it is usually necessary to make extra investments. ROLAP, on the other hand, is capable of evaluating large amounts of data, the limitations of the data sets depend on the data volume of the serving database. Furthermore, the use of relational database systems makes it possible to use the functions of the selected database management system. In the normal case, this means that a much larger number of functionalities can be used. However, the performance of a ROLAP system is usually slower compared to a MOLAP system and it is not possible to perform complex calculations since relational database systems or their engines are not designed for the calculation of data, but only their administration. The manufacturers of ROLAP software often try to compensate for this limitation by moving the calculations to the calling application. However, this further weakens the performance of the system. HOLAP (Hybrid OLAP) is a hybrid between MOLAP and ROLAP, which attempts to combine the advantages of both systems.
Known providers of OLAP software include IBM, Microsoft or SAP.
Theoretical Foundations of Big Data : Technical Implementations
This chapter deals with the basics of the various techniques that have become necessary and necessary for the implementation of big data. It also contains a demarcation from the current procedures in the field of databases, which are the standard in practice due to the existing hardware limitations and their suitability in practice. It is not to be understood as a replacement of the existing systems, but as an extension. The technical implementation forms the basis for the effective use of Big Data, which depends on the performance and the usage experience.
The theory includes exemplary possibilities which contribute to the meaningful use of Big Data, then techniques are used to master the data flow through Big Data. The basic characteristics of the techniques are explained in this section.
The term “NoSQL databases” refers to databases which do not use the database’s usual standard schema of rigid rows and columns and do not rely on transactions, which speeds up access to the data several times. The NoSQL databases are also referred to as unstructured databases. The name Not only SQL implies an extension of the previous relational SQL databases. NoSQL is not intended to replace existing database systems. These are the best choice for fixed structures and the assignment of data among each other.
Transactions belong to the relational databases. NoSQL databases synchronize data volumes at short intervals to achieve the required consistency Due to this procedure, the use of transactions is not necessary. The installation of NoSQL databases is performed on several servers. These so-called nodes communicate with one another and exchange the information necessary for a consistent data retention. Relational databases, on the other hand, write each of their changes to a transaction log to obtain the required consistency with respect to the data. The structure of the database system in nodes allows a high level of reliability and simplifies system resources. The failure of individual nodes does not affect the functionality of the entire system. Adding additional nodes increases the available resources. Thus it is no longer necessary to use a single very powerful system. Nodes can be added arbitrarily so that any performance bottlenecks can be compensated quickly. NoSQL allows for more flexible storage of the data, either in a given way in the database or is left entirely to the application. Unstructured storage of the data is thus possible. In addition, the NoSQL features enable the storage of video, audio and image files. NoSQL is useful in areas where the data can not be ported into the relational database structure.
NoSQL databases can be divided into four different categories. For example, when choosing NoSQL, one must decide which method is to be used by NoSQL. The document-oriented databases are specially designed for storing any lengthy texts, documents, or unstructured content. The individual texts do not require identical fields. So it is possible to define different fields and then search with a query the corresponding documents. In relational databases, it is only possible to add additional fields that are not already existing. In the unstructured database, only the documents containing the desired field with the desired value are then found. The graph data are specialized in mapping relations. This type defines individual nodes and relationships. These relationships link individual nodes together. This link takes place once when you insert it into the database. For relational databases, the tables must be linked using JOINS, which requires performance from the CPU (Central Processing Unit) and memory because the foreign keys must be found for each JOIN. Graph databases are read more frequently than written, the load on the storage medium is rather low. When reading, the links are read by each node, and the nodes are navigated under a uniform load. Key-value databases link keys to values. These values can be both string and list or set. This form of databases is particularly suitable for simple systems with one-sided relationships, where they take advantage of the speed particularly well. They are often used when values need to be assigned to a certain user, as is the case with apps or online games. The user’s user name is the key. Furthermore, the cost increases constantly with the size. For relational databases, the prices are increasing significantly because of the requirements, since a linear lock is not possible. The column-oriented databases store the data associated with both the row and the column. This makes it possible to obtain information from individual lines as well as from individual columns. This has the advantage of computing operations, which are only related to data from a column, faster and with fewer input output actions. Column-oriented databases allow to read information from the database without loading the rest of the information, the other columns, in addition. NoSQL is used in areas where the relational databases come to their limits. They are ideal for processing large amounts of data efficiently and efficiently and are also optimized for special applications. This scenario is mainly found in the Big Data area.
In-memory databases, are databases which are fully loaded into memory. This process has become necessary because much more information should be available in a much shorter time. Furthermore, the hardware in the area of the processors has always achieved higher speeds while the transmission speeds of hard disks have not increased to this extent. The price, the capacity and the speed of the individual memory modules as well as the capacity of the memory, which can manage a system has additionally increased. These factors make the in-memory technology only useful.
Relational databases had previously had to load the data from the hard disk into the working memory, process the request there and save these changes on the hard disk. If the database is stored in memory only, the load and save process only occurs when the database is started and stopped. There is differences in the speed of HDDs, SSDs, and random access memory to the media. It can be seen that the random access to the memory runs more than 100000 faster than on an HDD.
The procedure of the in-memory databases is not new. The relational database systems regularly load a part of their database into memory to process requests more quickly, but this is not sufficient for the mass of data in the big data area. In addition, systems with a high capacity of memory and sufficient CPU power are required to meet the requirements of the in-memory technology. This acquisition is usually cost-intensive. A disadvantage of this technique is that the working memory is considered a volatile medium. If the system is affected by a power failure, all data stored in memory is lost. The systems on which the in-memory technology is applied should therefore be adequately protected against power failures, otherwise only the state that existed when the database was started. There are software-related security measures that save the database changes to the hard disks in certain time intervals.
Database compression deals with minimizing data within a database. This method can be applied to both relational and unstructured databases. Compression allows databases to perform queries more quickly because more data does not overload the memory at the same time, and the smaller sizes can be read faster from the hard disk and written back to disk. Compression can recognize certain symbol sequences as patterns in a line-based process and store these patterns in a shorter symbol sequence, each less signifies less space required.
Tiering is a division into levels. With this method, the access to each individual record is evaluated and so a ranking of data that is used extremely frequently up to data that, almost never be called. Using this rankings, the software can save the records in different areas corresponding to the call. Thus, frequently used data can be stored on faster media such as solid state disks (SSDs) and rarely used data sets on slower hard drives (HDDs).
This process has been adopted by storage systems with differently fast media. Storage systems interpret databases without additional software as a file and thus can not perform the tiering for the individual data sets. In this case, tiering was implemented within the database, and storage manufacturers developed techniques for hardwares-based tiering.
Hadoop was programmed as a framework for accessing Big Data in Java. It has been designed to run on a computer cluster and is thus easily scalable and provides a high level of protection against failures.
With Hadoop, huge amounts of data are imported in most cases. These data sets are segmented into packets, from which the relevant information is then extracted with the aid of the software and its algorithms. The method used by Hadoop is called Map Reduce. These data sets are stored on the individual server in a special file system developed for Hadoop. The HDFS has special features that make it usable for the use of a huge number of servers. Each server contains some of the data.
One of the biggest difficulties in this area is the failure of the hardware. With a Hadoop farm with 1000 servers, it can be assumed that a server is not working, but a separate error detection and an automatic correction of these errors was developed. The complete HDFS was developed for batch processing. This means that a process is started which should run until the end before the next one starts. The system is designed for a high flow rate.
Cassandra was originally developed for Facebook and should make the Inbox search for the multitude of users effective and fast. It was then released as open source software and is now used by many large companies such as CERN, eBay, HP, IBM and Netflix for various applications. It consists of the Amazon DynamoDB and the Google BigTable, and the Hadoop technology has been integrated so that large amounts of data can be processed with the Map Reduce method. Cassandra is one of the column-oriented, as well as accents to the key-value NoSQL databases and was developed in Java.
Cassandra is installed on multiple nodes, all of which are equals. In order to provide Cassandra more resources, an installation is copied to another computer, everything else is automated. Furthermore, all data are stored redundantly in order to avoid the loss of data that is caused by a server failure. This results in a highly available, fast system, which can be supplemented as desired by further nodes and reorganizes itself. Cassandra does not immediately write everything on the hard disks, but has a mechanism of the first time in memory buffer buffered and from a certain level this then accumulates on the hard drives writes. This mechanism allows for more efficient utilization of the hard drive’s write and read performance.
Cassandra belongs to the group of ultimately consistent applications. This means that in certain time windows, not all users have the same view on the data. This is, however, accepted by the operators for the corresponding performance that Cassandra offers.
Examples of application
The purpose of this chapter is to illustrate the possible applications of Big Data and the associated analyzes in individual scenarios. The development is only at an early stage and scenarios, which are considered unimaginable, could become reality in a few years. Big Data is much more the development of previous techniques and the possibility to save data cost-effectively. Many data which were previously only available in analog form and had to be entered into IT systems with difficulty, are now directly available in digital form. The Internet is mainly responsible for this.
One of the most interesting targets for the economy in the area of Big Data is the behavioral forecasting in order to recognize the non-existent needs and to create them at the customer’s. But the state is also interested in this technology and is testing systems. Using the data, the amount of fresh food for a supermarket can be calculated to minimize the risk of loss through spoiled goods. Big Data is also used by telephone companies to determine possible cancellations.
In America, systems with behavioral prediction are partially supplied with data from traffic and surveillance cameras, as well as from social media channels such as Facebook and Twitter. The systems are thus able to generate a list of people who are considered potential threats. These people receive as a preventive measure a call to prevent them from committing offenses.
In climate research, immense amounts of data are collected daily by satellites and measuring stations. All these data should be processed and processed. With specialised computers it is possible to calculate concrete models of currents in the deepest sea and the atmosphere to predict the development of the climate and to determine certain influences on the climate. These calculations require a lot of data and therefore also a high computing power in order to evaluate the entirety of the data and to put it in relation to each other. These data are necessary to develop systems to make predictions about hurricanes or other climate phenomena.
The medical treatment of diseases causes costs in the billions. These costs are paid in many countries by the governments which means public. In this area, Big Data offers the possibility to prevent and minimize the costs before the onset of the disease or even prevent the outbreak.
Through the use of fitness and tracking applications and the connected devices such as Smartwatch or various other devices, such as the weighing scales and bracelets with sensors for sleep and pulse analysis, a large number of data, which are not yet fully utilized, is produced. The profile of the physicians about their patients could be much more comprehensive, if not only the current blood pressure value, but of the several months. Furthermore, this can prevent misdiagnosis and thus further costs for the health system. This ultimately benefits the working population and the economy, as the health insurance contributions are reduced and possible failures of employees can be recognized earlier and even avoided. This possibility is aimed at diseases caused by the possible misconduct of the person concerned. These include obesity, diabetes, strokes, heart attacks, or disc herniations.
Because of the sheer volume of areas that Big Data has an influence on, Big Data remains a concept that can be sketched only in a limited way. It is a trend that no one can close. Both companies and consumers are equally affected by the technical developments and the potential of the collected data. The collection, processing and evaluation of all possible data can bring enormous economic potential for optimization. At the same time, the privacy of each individual can be severely restricted if the access and processing of personal data is not clearly regulated. Big Data is not based on any technical innovation, it is the networking of all the data that has been available for years, but because of the encapsulation of the systems so far, could not be linked in this form, which laid the foundation for the development of the Big Data concept. This is mainly due to the Internet and the increasing networking of all areas. Furthermore, the availability of data by the digital revolution has greatly improved, but not all areas have penetrated.
Big Data projects are also highly specialized and differ significantly depending on the application, so that the company has a large entrance door. This hurdle includes the technical expertise as well as the expenses associated with the introduction of Big Data solutions for a company. Depending on the application, there will be a selection of different technical and methodical implementation options in order to deal with the requirements placed on Big Data.
The development of Big Data is only beginning and scenarios, which are still considered unimaginable, could become a reality in a few years. The possibilities to collect data are steadily increasing. A lot of information is already available, but the linking of the individual elements is often missing to discover something new. The more questions are asked about the connection of the data, the more clearly a forecast can be made.
In addition, the potential of these data is steadily growing. Technical progress continues to grow steadily in the coming years and is increasingly influencing all areas of life. In addition, older generations produce only a fraction of the total data. They rarely use smartphones and the use of the Internet is sparse in this population group. In addition, EC card payments are the exception rather than the rule in this generation. The younger company, on the other hand, uses smart devices such as smartphones and fitness trackers intensively, they often pay with an EC card and search and publish a lot of information on the Internet. If this behavior persists, future generations will use more and more technology, which greatly increases the exploitation potential of all these data.
Another important point that needs to be clarified in the near future is data protection. The amount of data that companies can find and use through individual individuals to gain additional insights has enormous potential. However, there are still a lack of clear rules and regulations, such as anonymisation and traceability, the duration of storage or the clarification of the ownership of the data. Furthermore, in addition to technical advancement, it must also be ensured that sufficient expert knowledge is available in these new areas. To this end, new fields of work, such as the Data Scientist, must be developed and defined in order to support the introduction of new technologies and procedures for data analysis. This is of paramount importance, particularly in the business environment, since the introduction of new technologies always entails high investment costs and therefore also a high risk. Many IT managers are faced with the question of whether it is worthwhile to implement Big Data and its use in their company. This uncertainty can be attributed to the lack of experts and the high costs. The future will show the direction in which Big Data will evolve and how each individual will be affected.