What is SQL on Hadoop
Hadoop and SQL are moving closer together
Innovative solutions put the long-established database manufacturers under pressure to link Hadoop with their relational database systems. Classic relational database systems often reach their limits in the big data environment, because large databases in the multi-terabyte range require fast storage, high-performance servers and therefore usually many CPU licenses. As a supplement, companies are now increasingly using the open source framework Hadoop. Apache Hadoop essentially consists of two components:
the distributed and highly available Hadoop File System (HDFS) and
MapReduce, a program library for the parallel processing of files stored in HDFS.
Strengths and weaknesses of Hadoop
MapReduce basically hides the programming paradigm of executing arithmetic tasks in a highly parallelized manner on a large number of computer nodes. MapReduce is essentially based on ideas from Google, which originally developed the algorithm to create its own search index from the web content. A major goal of the procedure is the parallel processing of jobs on a low-cost hardware infrastructure consisting of standard servers. In addition to its scalability, the great advantage of such an overall system is its fault tolerance: failures of individual machines can be tolerated at any time and absorbed by the other computers.
However, with the Hadoop implementation, separate operating system processes are started for each process step, so that the process generally requires several seconds of overhead. If, as with Google or Yahoo, many files have to be read and processed in batch, this overhead is negligible. Interactive analyzes on relational databases are not possible with it. Hadoop is therefore initially only suitable for storing and preprocessing big data. The framework cannot easily be used for business intelligence on poly-structured data in HDFS.
MapReduce jobs are essentially Java programs. However, with the additional software "Hive", Hadoop offers a relational wrapper. Hive translates classic SQL queries into MapReduce code and executes it. Access to Hadoop via Open Database Connectivity (ODBC) and Java Database Connectivity (JDBC), the conventional connection paths to a relational database, is therefore possible in principle. However, Hive only supports basic SQL functions and works in a batch-oriented manner due to the use of the MapReduce framework.
Even if the license and hardware costs for Hadoop appear low at first glance, there are a few other cost factors that users need to consider. Specialized knowledge is required both for operation and for access to the data, which often has to be purchased externally. In addition, the Apache basic version lacks the monitoring and security functions required for business-critical operation. Commercial distributions from Cloudera, Hortonworks or IBM can remedy this, but in return incur license and maintenance fees that reduce the cost advantage.
- Big data in numbers
Karl Valentin once coined the bon mot that something difficult is easy. That can certainly be said for the big data trend. Sensible in theory, difficult to implement. We provide a few facts.
- What problems do you see when using big data?
Big data concepts are not being promoted because the right skills are lacking.
Figures in percent; n = 206; Multiple choices possible; Source: BARC
Link relational databases and Hadoop
Despite these limitations, real big data applications (several terabytes and poly-structured data) can hardly be economically implemented without Hadoop. It is therefore not surprising that practically all well-known providers in the database market incorporate the open-source framework into their product portfolio in one way or another. An import-export connector for direct data import from Hadoop can already be regarded as standard today. The basis for this is usually the "Sqoop" application belonging to Hadoop or a corresponding manufacturer-specific adaptation.
- Many database systems offer the alternative of accessing files via SQL using so-called external tables. The data, which is stored in the form of text files, for example, is not loaded into the relational database. Instead, the files are declared as "External Tables". This process is also called "schema-on-read", since the relational schema is only superimposed as a virtual structure over the files and thus different schemas are possible for one and the same text file. Oracle, Teradata Aster, EMC Greenplum and Microsoft have implemented this mechanism.
- What are Computerized EM Methods
- What does a humidifier do for babies
- How the kidney senses acidosis
- How is Haskell used in music
- How do nerve pain feel?
- What class were samurais
- Is there Uber in Paris 2019
- What is registry in computer system architecture
- Difference between break and continue jump statements
- How do Samsung and Sony smartphones compare?
- Was Javert the villain in Les Miserables
- Why are excursions good for students
- How were missiles first developed?
- Is JIRA the 1 ticketing tool
- What will happen when China finally reunites?
- Is Wakie a stupid app
- How can I treat a person
- What is the Serbian party system
- How to use CareerHigh
- Why speak to a solipsist
- How do I download this app
- Is Gibraltar worth a visit
- What is nyay in politics
- What was Alfred Hitchcock's education
- Golf How can I break 80
- Who is your favorite Kung Fu legend?
- What is the dual system of government
- Men who urinate while sitting, why
- What is serotonin 1
- Why do little people hire little people
- Why doesn't everyone want financial independence?
- Which company made Teslas charging stations
- What ultimatum do you regret