How similar big data and Hadoop are

Hadoop opens up big data for data warehouses

The open source framework Hadoop complements data warehouse systems with regard to the storage and distributed processing of large amounts of data. The scarce know-how on the market requires extensive consulting. (Edition 10/2013)

The data volumes are increasing rapidly worldwide. Responsible for this are not only smartphones, social media and collaboration platforms, but also RFID (Radio Frequency Identification) and sensor technology applications. Further real-time information comes from production and storage facilities via urban street lighting networks and wind farms to private electricity meters, automobiles and refrigerators. When it comes to the amount of data, we are now talking about petabytes, exabytes and zettabytes.
In addition to the structured data, which has always been managed and evaluated via a data warehouse, there is an increasing number of unstructured data in the form of text files, audio and video streaming, which describe, for example, purchasing behavior in web shops or user behavior in online gaming. Dealing with this requires new technological concepts and tools. The Apache Hadoop framework is one of these technologies. It enables the storage of huge amounts and types of data as well as their distributed and parallel processing.

Hadoop is an open source framework based on the Hadoop Distributed File System HDFS and the MapReduce algorithm. Technologies such as Hive, HBase, PIG and Mahout are added from the ecosystem. Source: white duck

Big data opens up new business models for companies
Using data streams to generate decision-relevant information has always been the goal of business intelligence (BI) providers. You were the first to embrace big data. The new technologies expand the possibilities of including unstructured data in the evaluations in addition to quantitative data, and promise new business models under the banner of big data analytics.
One example is the electricity supplier industry: Today, the electricity meter is usually read manually once a year by an employee of the electricity supplier and imported into a data warehouse using a form-based ETL (extraction, transformation, loading). This results in a rigid price model for electricity. As an alternative to this, the electricity consumption could in future be read almost in real time by means of automated sensor data queries. With Hadoop, a technology is available that can pre-filter the large amounts of data. The extracts from Hadoop are sent to a data warehouse for further processing and are then available for evaluation with classic BI tools. In this way, price models can be created that are based on consumption patterns.
The expansion of the product portfolio with Hadoop as a big data technology was a logical consequence for many business intelligence providers, which has several advantages: First of all, NoSQL data (not only SQL) can be handled via Hadoop, in other words, amounts of data that go beyond the mere relational approach. Furthermore, Hadoop scales horizontally and runs on inexpensive standard hardware. The areas of application of this technology range from purely technical use as a staging area to collect the relevant data first one-to-one from the previous systems before the transformation, to resource-intensive simulations in the area of ​​sandboxing and clickstream analyzes.

From HDFS to MapReduce to Hive and HBase
Hadoop is a free, Java-based open source framework based on the distributed file system HDFS (Hadoop Distributed File System) and the MapReduce algorithm patented by Google in 2010. A framework because it is not a solution in the narrower sense, but an ecosystem that in turn contains a large number of other technologies. These include the data warehouse functionalities from Hive, the column-oriented database HBase, the data integration tool Pig and the machine learning method Mahout, which maps data mining requirements.
Hive, originally developed by Facebook, is based on Hadoop's core functionality and enables a cost-effective, highly scalable data warehouse infrastructure. A scalability of this kind could not be achieved with classic technologies. At Facebook, for example, reporting, ad hoc analyzes and machine learning applications run in the enormously growing Hive / Hadoop data warehouse.
A big advantage of Hive is that it uses the SQL-like query language
HiveQL, which makes it easier for users with SQL know-how to get started. HiveQL is translated internally in MapReduce jobs. The end user does not have to deal with it. Hive is an open technology: A number of analytical databases such as Microsoft SQL Server Analysis Services allow Hive data to be extracted.
HBase as a NoSQL database is particularly suitable for near real-time read and write access to big data as well as for special requirements for data modeling and distribution, such as for unstructured data on Web 2.0 pages. For this reason, HBase is more interesting for application developers than for BI users. HBase, for example, is the basis for Facebook's messaging system.
Facebook, AOL and Yahoo use Hadoop for their own big data analyzes and are also developing the framework further. Given such prominent users, it is foreseeable that Hadoop will gradually become a basic technology. A diverse ecosystem has developed around Hadoop. Distributors such as Hortonworks, which offers the only Windows-based Hadoop distribution, and Cloudera have set out to encapsulate and market the large number of different Hadoop components with their own enterprise-grade tools.
Most of the major BI manufacturers do not have their own Hadoop distributions. Rather, they integrate existing offers into their data platforms and use them to bundle big data solutions. In this context, for example, Microsoft HDInsight (based on Hortonworks) in the cloud on Windows Azure and on-premise for Windows Server as well as the Oracle Big Data Appliance (based on Cloudera) or the Teradata Appliance for Hadoop (based on Hortonworks) should be mentioned. .

The technical resources for Hadoop projects are scarce
Consultants bring in maintenance and service capacities. Both make the core technology usable in the professional corporate environment. However, building big data analytics with the help of the very technology-heavy Hadoop offering requires technical components and human resources, which are currently rather limited. The bottom line is that this leads to comparatively high costs. The current status of development and administration tools is still at a relatively early technical stage, which makes the automation of standard processes complex.
Although Hadoop - due to its architecture - has an enormously high scalability compared to conventional relational databases, it requires storage and computing capacity that can be used flexibly, which can very quickly result in high costs despite low hardware costs in classic on-premise use . Operation within a cloud infrastructure creates a cost-effective alternative thanks to its resource-related billing models and its scale-out options and is generally recommended for operating a Hadoop cluster.

Hadoop and data warehouse complement each other
Hadoop and NoSQL are fundamentally different things, but there is some uncertainty about the differences. NoSQL represents an alternative to the conventional database: It is a non-relational database that does not have a fixed table scheme (this is also called Schemeless Database Technology) and can be scaled horizontally by adding additional systems. There are more than 100 NoSQL databases (a detailed market overview can be found at nosql-database.org), including, for example, Google BigTable, Amazon Dynamo, MongoDB or CouchDB. Most of them are open source products. They can, but need not necessarily, be based on Hadoop technologies. Hadoop is an ecosystem that includes applications such as the column-oriented database HBase and the data warehouse software Hive. In BI scenarios, these in turn form a practicable bridge between Hadoop on the one hand and the traditional data warehouse on the other as comparatively less technical and more subject-oriented solutions.
Hive supports classic data warehouse operations such as data aggregation in Hadoop. Hive uses HiveQL, a SQL-like language that makes it easy for users with SQL know-how to get started.
HBase, on the other hand, is suitable as a NoSQL database for applications with special scaling requirements in terms of data modeling, distribution and scalability.
Modern data storage platforms must be capable of running in-house as well as in the cloud or a combination of both variants. This is what companies want or need to keep their data in their own data center. In addition, there is the necessary ability to handle structured and unstructured data of any size - and to evaluate it using the common BI tools. It is also important that the Hadoop solutions can be combined with the classic data warehouse solutions. It is not about replacing a solution, but rather about the complementary use of both data storage strategies. The goal is efficiency- and performance-optimized scenarios. If, for example, the Microsoft SQL Server is used for the relational data storage of structured data in the classic data warehouse, connected Hadoop components can supplement the non-structured side.

The technology is still at a very early stage
Even if the key technologies are already in place, we are currently still on the technological side of the wave in the development cycle. The corresponding applications are not yet available on the market, although the marketing statements of some providers suggest it. In the next step, the new approaches to NoSQL data management and evaluation options for enormous amounts of data must therefore be gradually converted into concrete solutions.
Technical restrictions have also hindered acceptance so far: Hadoop is not yet a real-time platform. Like most Hadoop components, HDFS works in a batch-oriented manner. This limits the speed and frequency of data loading or data extraction. Real-time BI requirements cannot therefore be implemented directly in Hadoop. New projects such as Cloudera Impala, which offer an SQL interface to HDFS and HBase data, should facilitate such requirements.
Alternatives to Hadoop as a distributed computing framework are the open source platform HPCC (High-Performance Computing Cluster) or Twitter Storm. NoSQL databases can also meet many big data requirements. NoSQL databases are also available in Hadoop (e.g. HBase). Nevertheless, Hadoop is developing as the standard in the BI context because the major BI providers rely on it. Microsoft's decision to throw all of its own previous approaches overboard and jump on the bandwagon confirms this.

Great benefit must be weighed against high risk
Anyone who has large and unstructured amounts of data and would like to set up new business models on this basis can do so today. Hadoop offers interesting development paths for this. However, so far, users have been reliant on setting out on foot, so to speak, on the rocky road to implementation. It is up to each company to weigh the effort and the risks against the benefit aspects and to secure a lucrative market share as an early bird.
It should be three to four years before the first Hadoop-based specialist BI applications find their way into companies. That is why it is important in the BI context to choose the right components from the Hadoop ecosystem, decide on an open source or enterprise distribution and carefully weigh the entire architecture - i.e. Hadoop and data warehouse. In order to be able to assess the benefits, a business case should always be set up and the Hadoop project initiated with a proof of concept.
The good news in the end: Even well before the availability of standard off-the-shelf products, both the technologies and the specialists are already available to tackle big data analytics based on Hadoop. jf

The experts

Ilias Michalarias is a business analyst at white duck.

Markus Sümmchen is managing director and partner at white duck, a consulting company for cloud-based software development