This article is part of series of articles on the topic which can be guessed from the title. Theoretical Foundations of Big Data is second part of our series of articles. Readers if not already read, can read our first article of this series on Theoretical Foundations of Big Data. While dealing with large amounts of data, there are certain rules to which the providers and users of Big Data have to adhere. This primarily refers to legal requirements, which include, in particular, data processing and access to personal data.
|Table of Contents|
Legal and Security Part
Data protection and big data is a difficult issue. Due to the sheer volume of different information, it is easy for American Internet companies like Google and Facebook to link and analyze existing user data. The allegedly free, offered services paid each individual user however with the awarding of its data, which enable the companies to create a detailed user profile and to use for their own business practices (such as personalized advertising through e.g. Google Ads). Unlike some EU countries, in the US, legislation on data processing is far less stringent, which greatly simplifies extensive analyzes based on personal data and the development of new data analysis procedures. There is always the underlying contractual relationship between the consumers and companies for which there should special arrangements. For instance, to enable credit companies and payment service providers to identify and prevent misuse and fraud attempts, the customer’s transaction data must be monitored and analyzed in order to detect unusual payment transactions (fraud detection). This is compatible with the Data Protection Act as common law, as the company complies with its contractual obligations through this procedure. Similarly, Big Data analysis can be used to calculate the creditworthiness of a potential customer. Various aspects such as age, occupation, income, etc. can be considered. Theoretically, the complete behavior of the potential customer could be evaluated. According to the Data Protection Act should be resorted only not to sensitive information, such as the previous health history, ethnicity or nationality in this case. In the insurance sector, different statistics are also used to create individual risk profiles, on the basis of which the insurance sum is calculated (for example, classification of the damage-free class or “pay as you drive” models in motor insurance). If personal data is to be passed on to third parties, the anonymisation of the data plays a special role. All information must be removed or pseudonymized so that no conclusions can be drawn about the actual natural person. The pseudo- and anonymisation has to be carried out as far as possible since, due to the large number of different data and data sources, the combination of these data and effective analysis methods makes it possible to draw conclusions about personal details of an individual.
Through the ability of Big Data to evaluate data from any source, society can feel that everyone is constantly being watched by companies or the governments. In particular the evaluation of information, which is cultivated in social media or the pure observation of the purchasing behaviour, makes very close conclusions about the character, habits, needs and interests of a user. These insights can be used by companies to produce highly personalised advertising for a customer, in order to create supposed new buying incentives. The personalized advertising became so perfect that many people began to feel uncomfortable with it and began to question the system of the customer’s approach to the company. This led to the fact that Target is now purposely less perfect advertising so that the customers will not become more suspicious. In this drastic case, it will be clear how quickly customer loyalty can be damaged due to personalized advertising. Such incidents are often the reason for public discussions as to how companies should and should be allowed to use personal data for their own marketing purposes. It is necessary to differentiate between the benefit for the affected person and the benefit for companies and to find a balance so that on the one hand the personal information of the customer can not be exploited and on the other hand the companies can work profitably with these data. In order to prevent data misuse, it should be ensured that the affected parties have access to and control of their own data at all times.
Of course, personal data can be misused not only in terms of evaluations, with the aim of monitoring and manipulating people. When storing and processing data, the question must always be asked who is allowed to gain insight into this data and who can access this data. Only a restricted circle of authorized system administrators should have direct access to the collected raw data. Otherwise, these sensitive data could be manipulated in a targeted manner, in order to make business analyzes more difficult or completely distorting. Another possibility would be a targeted data theft in which large quantities of sensitive data such as address, password or financial data are stolen. Such a theft can be carried out not only internally but also externally through targeted hacker attacks. The most massive hacker attack in recent years hit the US bank JP Morgan, where around 83 million records were stolen by customers of the bank. This example shows the great challenge of data security for enterprises. Data theft on such scale can severely harm the trust relationship between customers and business partners to the company so that they can even completely stop the collaboration. The major challenge here is to ensure that the personal data, both internally and externally, is protected by abuse as much as possible.
The topic Big Data is an unstoppable trend in all areas of the ever-increasing importance. In order to be prepared for this issue in the future, companies must react to the future in order to remain competitive. According to a forecast by BITKOM, the turnover generated by Big Data solutions in some countries by an average of approximate 53 percent per year.
Theoretical Foundations of Big Data
This part will cover the theoretical basis of Big Data. In this case, first of all, the data sources are used to produce the very large amounts of data. Some models are described, which are used in the Big Data area to evaluate large amounts of data.
A study by the IBM Institute for Business Value, examines the importance and the status of the implementation of benefit data by Big Data, in which more than 1100 specialists and IT managers from 95 different countries are interviewed, most companies have so far mainly based their Big Data systems on in-house data. This includes log data as well as e-mails, log and transaction data, and data from enterprise resource planning solutions. These are data that capture machines and IT systems in order to give responsible persons a detailed insight into the automated process. Further, data obtained by other operational programs.
However, it is also described that, in some companies, even the evaluation of the internal data, due to the extremely high volume of data, is only possible to a limited extent and consequently, a lot of data remain unused which can still offer a great added value. Therefore, the company strives to exploit the full potential of this data first, in order to expand the evaluated internal data with information from external data sources .
In a master thesis, the author described that the smartphones are a source of a variety of different data, such as mobility data or general user behaviour information. In addition, it is shown that companies can generate additional data by offering an app for smartphones, which can be completely transparent for the user as well as can be initiated voluntarily by it. Furthermore, social media is considered a very large source for big data, whereby the profit not only from the actual information that the user surrenders about himself, but also from time data and geodata, which conclusions on the spatial and temporal circumstances with the information are allowed. Social media, however, need not only social networks, such as Facebook or Twitter, but also weblogs, microblogs, wikis, chats, RSS feeds (Real Simple Syndication) etc.
Sensor data are also a rich source of information. These are data, for example, generated by electrical components installed in a vehicle, which contain various states associated with the vehicle, such as tire pressure or current fuel consumption. A further company-external data source are surveys and research carried out by third-party companies as well as scientific databases.
The reliability of the data is considered an important point in the selection of data sources for a company. Above all, social media as a data source is comparatively unreliable, as not every population group or age group takes part in social media and thus no representative intersection can be achieved by the population.
The models described in the following chapter are exemplary illustrations of the theoretical possibilities of data analysis, which are applied in the area of Big Data. The selected ones are for illustrative purposes only and should not have the character of a rating list of models.
A data warehouse is a system whose task is to provide applications that are used by the management with the necessary data and thus to support strategic decisions. It should be set up and operated separately from the existing productive systems and is operated for analysis purposes. A further goal, alongside the support of management, is the development of knowledge management. In general, a data warehouse therefore contains data that is used for problem analysis. For this reason, a data warehouse represents the data source for various analytical models, such as data mining.
The central component of a data warehouse is a database. It houses copies of data that are imported at regular intervals in a timed and automated manner from the databases of the productive systems. The challenge is that the productive systems should not be influenced by this process. For this reason, the schedule is usually designed in such a way that this process takes place in low-load periods (for example at night or on weekends).
In most cases, the read-only access to the database is for the end user, since the data stored in it, as already mentioned, are duplicates of the data from productive systems and should be invariable as such. Furthermore, normalization of the data in the database of the data warehouse is usually dispensed with in order to accelerate the analysis process.
The databases often consist of factual and dimensional tables. The fact tables contain the actual measured variables, while the dimension tables contain further characteristics assigned to them. The dimension tables are often further interleaved in the Snowflake schema, which also entails performance advantages.
Since the complete data warehouse for the individual applications is often too complex, are often specially created small parts of the complete data warehouses, so-called data marts, defines which data for example by OLAP methods (Online Analytical Processing).
In Microsoft TechNet, an online service that provides vendor information about Microsoft products, the cube is described as the best-known and most widely used form of OLAP, which is usually created from the company-internal data warehouse. This is a quick and easy way to analyze data, which can not be a very comprehensive and powerful tool for controlling units and the management of a company.
A cube is a composition of dimensions and scales. The dimensions contain data about the characteristic, which is to be evaluated, the scales are the key figures assigned to the dimensions. Theoretically, an unlimited number of dimensions can be mapped; possible limitations are dependent on the respective manufacturer (limitation of Microsoft SQL Server: 128 dimensions). The application executed by the end user is a query based on the user interaction. As soon as the database server has delivered a result, the application builds a view of the cube. This has the advantage that the end user does not have to write database queries himself and thus does not need to know the syntax of the database query language .
The possible data sources of a cube are, as already mentioned at the outset, the company-internal data warehouse, as well as, on the other hand, a database that may have previously been created for these reporting purposes. The server also uses pre-compiled data (aggregations), the cache of the end-user device with which the end-user works, or a mixture of all these sources
A server can provide multiple cubes so that, for example, a company’s departments can use their own cube. This offers the advantage that the data that can be generated can be tailored to the user’s needs as much as possible without neglecting the needs of other users, This means that, for example, a company that has several subsidiaries, which in turn manages subordinate administrative units to which a plurality of sales sites are assigned, can represent this complex structure within a dimension of the cube. The end user can break this structure as much as he wants, so that depending on the complexity of the underlying data, he can display both the total sales of a subsidiary of the year and the sales figures of a single sales location for a particular day for an article. For this purpose, the dimensions must be broken only by the end user in the application. The basic approach is that the server that provides the cube.
Each dimension of a cube can be divided hierarchically. This means that, for example, a company that has several subsidiaries, which in turn manages subordinate administrative units with a large number of sales locations, can represent this complex structure within a dimension of the cube. The end user can break this structure as much as he wants, so that depending on the complexity of the underlying data, he can display both the total sales of a subsidiary of the year and the sales figures of a single sales location for a particular day for an article. For this purpose, the dimensions must be broken only by the end user in the application. The basic approach is that the server that provides the cube.
The cube consists of three dimensions (Source, Route, and Time), in which it specifies two scales (the number of packets and the last packet in that node). This makes the cube a very powerful analysis tool since the user can have a general overview of many data elements as well as a precise insight into individual data elements. The cube is very suitable for the analysis of data, text data or similar less structured data, the cube is less suitable.
MapReduce is a programming model that is used to analyze large amounts of data. The basic approach here is to build up a cluster of many computers, whereby a parallelized approach is used for the processing of computing processes instead of serial processing of all the resulting calculations. The most common approach is based on the approach of using inexpensive commodity hardware in large quantities.
The rough procedure can be divided into three sub-steps. First, for the so-called Map Task , the data, which are partially terabytes in size, are subdivided into small subareas (for example, individual files, if the data stock consists of many individual files). These are then distributed to a node in the cluster on which the records, which are often organized in no logical structure, are converted to tuples, ie key-value pairs. However, there is still no aggregation of the data. The data are then sorted in the so-called shuffle task at a central point and thus prepared for the Reduce task.
At theReduceTask, the resulting tuples are then combined to prepare them for data analysis.
A MapReduce framework usually consists of a Master JobTracker and a Slave TaskTracker per cluster node. The task of the Master JobTracker is to plan and monitor the execution of the tasks, and to run failed calculations again. The slave task tracker is responsible for forwarding the tasks and the corresponding data to the individual slaves which carry the final calculation.
The data are stored in distributed distributed systems on all cluster nodes, where the files are redundant so that the failure of a single cluster node is unobjectionable and the data remains consistent, An example of this is HDFS ( H adoop D istributed F ile S ystem).
Through the use of MapReduce, not only is a significant acceleration of the data analysis process compared to the traditional methods achieved, but also a failure safety since such a cluster is usually made up of several thousand nodes so that the failure of a single node does not lead to significant performance-loss or loss of data . It follows that a certain form of load balancing is used, which, however, is relatively static. Furthermore, a MapReduce framework is highly scalable (scale-out) and by the use of non-specialized commodity hardware a cost saving compared to the use of less expensive high-end servers is to be mentioned.