Is Hadoop necessary for data scientists

Apache Spark versus Hadoop

Apaches Spark is the new trend technology in the fields of big data, analytics and data science. Many Spark protagonists already believe that this new platform overshadows everything else to such an extent that it will soon be the dominant tool for all data scientists. This does not seem unfounded, because Spark's high performance with very large amounts of data has led to it being viewed as the successor to Hadoop. "From a technical point of view, Spark is a significantly faster and more powerful engine than Hadoop," said Reynold Xin, data engineer and co-founder of Databricks, the company leading the Apache Spark project. Forrester analyst Mike Gualtieri also sees Spark at an advantage due to the faster processing. "Hadoop was built for big data, Spark for high speeds," he enthuses.

Record breaking performance

Sparks performance was first publicly recognized when it set a new record at the Daytona-GraySort last year. In this test, 100 Tbytes are to be sorted. Databricks had set up 206 machines with almost 6,600 cores for this purpose. Sparks only needed 23 minutes for the sort job - significantly less than the previous record of 72 minutes held by Yahoo with Hadoop. It should also be taken into account that Hadoop used 2,100 nodes with over 50,000 cores.

Much more features than Hadoop

But the much better performance alone is not enough to predict the end of Hadoop. Flexibility and the range of applications are at least as important. Spark can be used in conjunction with different data platforms. It also offers native support for in-memory, including optimized data distribution between memory and hard disk. In this respect, those who expect Hadoop to end soon seem to be right.

Friend and foe at the same time

In fact, Spark can be either: a dominant competitor or an excellent addition to Hadoop. Gualtieri puts his Spark praise into perspective: "If you consider that opposites attract, then Spark and Hadoop form a perfect team, after all, both are cluster platforms that can be distributed over many nodes and have very different advantages and disadvantages . "

The Hadoop specialist Cloudera emphasizes the combined market interest of the two platforms. "Anyone who adopts Hadoop today assumes that Spark is one of them," says its chief technologist, Eli Collins. This fits his view that Spark is just one of many Hadoop tools - similar to MapReduce, Drill, Impala and a few others.

Universal data platforms

But Spark differs from many other tools in one essential point: Spark does not necessarily have to be based on the Hadoop HDFS file system. It is just as efficient when operated with other data platforms such as AWS S3, HBase or Apache Cassandra. Cassandra is now becoming the preferred data platform for Spark. According to a study by Typesafe, 20 percent of all Spark instances already run on Cassandra.