What is SQL on Hadoop

Hadoop and SQL are moving closer together

Innovative solutions put the long-established database manufacturers under pressure to link Hadoop with their relational database systems. Classic relational database systems often reach their limits in the big data environment, because large databases in the multi-terabyte range require fast storage, high-performance servers and therefore usually many CPU licenses. As a supplement, companies are now increasingly using the open source framework Hadoop. Apache Hadoop essentially consists of two components:

  • the distributed and highly available Hadoop File System (HDFS) and

  • MapReduce, a program library for the parallel processing of files stored in HDFS.

Strengths and weaknesses of Hadoop

MapReduce basically hides the programming paradigm of executing arithmetic tasks in a highly parallelized manner on a large number of computer nodes. MapReduce is essentially based on ideas from Google, which originally developed the algorithm to create its own search index from the web content. A major goal of the procedure is the parallel processing of jobs on a low-cost hardware infrastructure consisting of standard servers. In addition to its scalability, the great advantage of such an overall system is its fault tolerance: failures of individual machines can be tolerated at any time and absorbed by the other computers.

However, with the Hadoop implementation, separate operating system processes are started for each process step, so that the process generally requires several seconds of overhead. If, as with Google or Yahoo, many files have to be read and processed in batch, this overhead is negligible. Interactive analyzes on relational databases are not possible with it. Hadoop is therefore initially only suitable for storing and preprocessing big data. The framework cannot easily be used for business intelligence on poly-structured data in HDFS.

MapReduce jobs are essentially Java programs. However, with the additional software "Hive", Hadoop offers a relational wrapper. Hive translates classic SQL queries into MapReduce code and executes it. Access to Hadoop via Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC), the conventional connection paths to a relational database, is therefore possible in principle. However, Hive only supports basic SQL functions and works in a batch-oriented manner due to the use of the MapReduce framework.

Even if the license and hardware costs for Hadoop appear low at first glance, there are a few other cost factors that users need to consider. Specialized knowledge is required both for operation and for access to the data, which often has to be purchased externally. In addition, the Apache basic version lacks the monitoring and security functions required for business-critical operation. Commercial distributions from Cloudera, Hortonworks or IBM can remedy this, but in return incur license and maintenance fees that reduce the cost advantage.

  1. Big data in numbers
    Karl Valentin once coined the bon mot that something difficult is easy. That can certainly be said for the big data trend. Sensible in theory, difficult to implement. We provide a few facts.
  2. What problems do you see when using big data?
    Big data concepts are not being promoted because the right skills are lacking.

Figures in percent; n = 206; Multiple choices possible; Source: BARC

Link relational databases and Hadoop

Despite these limitations, real big data applications (several terabytes and poly-structured data) can hardly be economically implemented without Hadoop. It is therefore not surprising that practically all well-known providers in the database market incorporate the open-source framework into their product portfolio in one way or another. An import-export connector for direct data import from Hadoop can already be regarded as standard today. The basis for this is usually the "Sqoop" application belonging to Hadoop or a corresponding manufacturer-specific adaptation.