What is massive parallel processing

Diethelm Siebuhr, Martin Becker

With Massiv Parallel Processing and the Hadoop framework, two approaches are available that meet the requirements of big data applications. Both differ in the nature of their hardware and software, but can be combined to form powerful hybrid big data architectures.

The amounts of data in companies are now taking on incredible proportions; they are not infrequently in the petabyte range. The main cause of this flood of data is the digitization of all commercial and technical processes, for example in the form of sensor data from industrial plants, geo and telematics data from vehicles or RFID data from the logistics chain. Due to the progressive networking of systems in the Internet of Things, this development is being driven even further. In addition, there is user-generated data on social media platforms, multimedia content, photos or HD videos.

It is a challenge for companies to collect this structured and unstructured data from heterogeneous sources according to their relevance, to evaluate it and to make it available for business decisions - preferably in real time. The data from web shops, vehicles, machines or smartphones then provide information about consumption, behavior and preferences of customers, the status of machines, places and movements and many other things. Linked with context-related information such as customer history or weather, forecasts can be derived from this.

But the common data warehouse systems are getting on in years and are simply overwhelmed with big data analyzes. Massive Parallel Processing (MPP) and / or the Hadoop framework with its comprehensive ecosystem can help here. Both approaches divide complex and large amounts of data into smaller units and process them in parallel on several distributed computer nodes. However, they differ in the nature of their hardware and software.

Massive Parallel Processing (MPP): Proprietary systems and native SQL

MPP applications are only available as proprietary subsystems with their own hardware and software. They consist of a large number of server nodes that work in parallel and are connected to one another via powerful switches. Each node is equipped with processors, storage and I / O channels and processes part of the overall data with these resources. In this so-called shared-nothing architecture, requests are distributed to the individual nodes of the database cluster using special partitioning methods and algorithms. Each node processes its data portion of the query locally in its main memory; a kind of "master node" summarizes the individual results at the end. However, the data must first be prepared because MPP systems cannot work with unstructured data.

MPP systems are characterized by the high I / O performance with which the individual nodes access the partitioned data in parallel. This enables them to process very large amounts of data in a short time. In addition, they are much more scalable than classic database systems, as their computing and storage capacity can be easily expanded by adding new nodes or servers. Another advantage: Extremely large in-memory storage media are potentially possible with MPP databases if the entire available cluster storage is used.

MPP appliances are proprietary systems, as every manufacturer optimizes their offering using algorithms and the technical coordination of components for the special requirements of their big data application. This does not matter to the user, however, since MPP systems communicate externally via SQL and also support SQL-based business intelligence tools. They also relieve the database administrator through internal procedures, automatic data distribution and optimized workload management.

Since the number of CPU cores per system ranges from a few hundred to several thousand, companies have to spend a lot of money on an MPP product. You can count on mid-six-figure amounts here. Examples on the market are IBM Netezza, EMC Greenplum, HPE Vertica, Oracle Exadata, Microsoft Analytics Platform System, Teradata Aster Data nCluster or SAP HANA.

MPP meets Hadoop

Hadoop: Open source framework for standard computers

In contrast to MPP, Hadoop follows an open concept. The framework is a project of the Apache Software Foundation, based on Java and the essential core components Hadoop Distributed File System (HDFS), MapReduce and the function library Hadoop Common. The latter contains, for example, JAR files and scripts, source code or documentation. The HDFS storage layer stores data in blocks and distributes them redundantly to several computer nodes. The parallel processing makes HDFS ideal for reading large amounts of data quickly.

Google developed MapReduce to index websites. The framework consists of the two functions Map and Reduce: Map distributes tasks to various nodes in the cluster, Reduce sorts the tasks and summarizes them in a common list of results. MapReduce is a programming environment for parallelization of queries, which supports not only HDFS but also other file and database systems. The disadvantage: Since MapReduce and HDFS work batch-oriented, their basic form is not suitable for transaction processing or real-time analysis. In addition, unlike MPP, Hadoop does not directly support native SQL. Therefore, under Hadoop, more programming effort is often required than with MPP in order to develop and adapt applications.

But because of its open hardware and software architectures, Hadoop runs on a network of standard x86 computers. This means that the corresponding offers are significantly cheaper than MPP systems. Hadoop also handles both structured and unstructured data very well.

Around these core components of Hadoop, an extensive ecosystem of providers, tools and other modules has developed that compensates for weaknesses in real-time analysis or the lack of SQL support. The most important Hadoop distributors include Cloudera, Hortonworks, MapR, IBM, the EMC subsidiary Pivotal and also cloud service providers such as Amazon and Microsoft. All Hadoop distributions try to simplify data management in distributed environments with the help of uniform admin components.

MPP and Hadoop are moving closer together

MPP and Hadoop therefore differ primarily in the nature of their hardware and software as well as in the price level. Accordingly, they are also aimed at different target groups. Both approaches can, however, coexist economically and technologically. This is why manufacturers of Hadoop and MPP solutions are now increasingly working together to enable the integration of the two worlds. Examples are the cooperation between SAP and the Hadoop providers Cloudera, Hortonworks, HP and IBM or that of Microsoft and Hortonworks with the Azure service HDInsight. The EMC subsidiary Pivotal offers both Greenplum MPP appliances and Hadoop frameworks in order to serve customer requirements with technologies from both worlds.

A key issue in combining the two approaches is the integration of SQL into Hadoop. Since Hadoop does not directly support native SQL, users have to pursue other integration approaches, for example Apache Hive or its much more powerful successors Cloudera Impala, Pivotal HAWQ or Apache Spark:

  • Hive: Hive offered an initial approach to analyzing the data in a Hadoop cluster with SQL. It did not have the complete, standardized SQL scope, so it only provided a subset of the SQL functions; the query language is therefore also referred to as HiveQL (Hive Query Language). Since HiveQL enables the execution of queries or the analysis of the data stored in the HDFS, Hive represents a kind of data warehouse component of the Hadoop framework.
  • Spark: One of MapReduce's most pressing problems, the high latency of responses in batch mode, is solved by Spark by enabling real-time queries, similar to SQL, to Hadoop clusters. Spark offers in-memory techniques for this purpose. This means that it processes queries and data directly in the fast main memory of the computer nodes. In contrast, MapReduce reads and writes the records to hard drives. Since Spark also distributes queries across multiple nodes in parallel, the performance increases by a factor of around 100 for comparable queries. The framework does not necessarily have to be based on the Hadoop storage layer HFDS, but also works on other data platforms such as HBase, AWS S3 or Apache Cassandra efficient. Spark is also able to handle more complex tasks than MapReduce, as it also uses machine learning with the MLLib machine learning library, a collection of algorithms. This function is used, for example, by the Spotify streaming service to predict the music tastes of its users.
  • Cloudera Impala: In the open source environment, Impala is another MPP technology that has its own SQL query engine and extends Hadoop's traditional batch mode with real-time processing. It uses its own query planner and, unlike, for example, with HadoopDB and Hadapt, does not run PostgreSQL instances on the DataNodes. Cloudera provides about 50 percent of the Apache Software Foundation's own development and has fully integrated Impala into its Hadoop distribution. Impala uses the metastore already introduced by Hive for schema administration and the SQL variant HiveQL; it is installed parallel to Hive and allows SQL queries to be carried out with low latency on data available in HDFS or HBase.
  • HAWQ (Hadoop with Query): Similar to Impala, HAWQ provides a framework as a native SQL engine based on Hadoop. It is based on the MPP database Greenplum from EMC, has among other things intelligent algorithms from the MADlib library for machine learning, but also divides complex queries into small tasks and distributes them to different nodes for processing. HAWQ reads and writes data natively on HDFS and offers a so-called Query Optimizer and Planner, which evaluates the statistics of the tables for SQL queries. The administration tools of the Greenplum database are available for the installation and management of the system. Business intelligence applications can be connected to HAWQ via Java and Open Database Connectivity (JDBC, ODBC).

Conclusion

Hybrid architecture trend

Hadoop is now part of the big data architecture in numerous companies, also because of its high flexibility and the many additions from the ecosystem. The framework offers an open concept for processing and analyzing large amounts of data, which, in contrast to MPP, is based on cheaper standard hardware. Since Hadoop is basically a batch system and does not directly support native SQL, its original form is less suitable for real-time processing and iterative algorithms. Therefore, users have to pursue other integration approaches such as Hive, Impala, HAWQ or Spark.

As optimized proprietary appliances, MPP databases are more expensive, but also more efficient and powerful than Hadoop. They have their weaknesses when it comes to unstructured or variable data, since they have to be prepared in advance. Hadoop, on the other hand, can handle both structured and unstructured data.

There is currently a trend towards hybrid big data architectures that combine the advantages of using SQL in the MPP world with the scalability and cost-effectiveness of Hadoop. MPP providers such as Pivotal are increasingly using approaches from Hadoop, while Hadoop top dogs such as MapR, Cloudera or Hortonworks are integrating the advantages of MPP into their products. (ane)

Diethelm Siebuhr is CEO and Martin Becker Senior Solutions Consultant of Nexinto Holding in Hamburg, a company that uses data lakes, MPP and Hadoop applications for customers and internal tasks.

Leave a Comment