Big Data and the Complex Hadoop Ecosystem

Nine years after being launched as an open-source Apache Software Foundation project by Yahoo engineers, Hadoop has become synonymous with big data. It has evolved into an ecosystem of interlinked components, and several ambitious startups have positioned themselves to offer variants customised with their own contributions.

Two of the technologies linked to Hadoop from the outset are MapReduce and the Hadoop Distributed File System (HDFS). The former is a computing paradigm whereby a task is broken into smaller tasks that can be executed in parallel (the Map step). The outputs of these tasks are then recombined (the Reduce step).

The latter is a scalable storage system that provides several features to support the MapReduce process. HDFS builds and monitors a mapping of the available computing nodes to allow optimal allocation of Map tasks to nodes. When (not if) node failure occurs, the node mapping is reconfigured and tasks are dynamically reallocated. Data loss in this scenario is mitigated by ensuring redundant copies are made at each stage.

From these founding technologies, and after nearly a decade of steady evolutionary improvement, Hadoop finds itself positioned as the default gateway to storing and processing big data. Its concepts provide a way to process huge data sets by distributing them across commodity servers, removing a dependency on specialised high-end hardware. There is still, however, a price to be paid.

The learning curve for MapReduce can be steep for newcomers as its process is not intuitive. Decomposing a computational workflow into non-interacting parallelisable map/reduce function pairs is a challenge. For many, it requires a paradigm shift in analytical thinking. In addition, the overhead that HDFS adds to manage the computational process means that even simple jobs cannot run quickly – certainly not in real-time, meaning speed and responsiveness are issues. Hadoop is best suited for scheduled batch jobs.

Making Hadoop More Accessible

Into the breach have stepped a number of solution vendors; Hortonworks, Cloudera and MapR are prominent names. Each offers an enterprise-level distribution of the Hadoop project, with differentiated features added in areas such as security, job management, governance and performance improvements. Amazon Web Services offers a cloud-based version called Elastic MapReduce which can prove cost-competitive for entry-level and intermittent Hadoop usage.

2013 marked a turning point in the evolution of
the big data ecosystem, with the release of Hadoop 2.

With Hadoop 2 came a crucial change: a stratification of functionality into separate system access and application framework layers. While Hadoop 1 could be simplistically viewed as a MapReduce application accessing a HDFS layer, Hadoop 2 introduced an intervening layer called Yet Another Resource Negotiator (YARN). As the name suggests, YARN provided a standardised way to access the resources of HDFS and, in doing so, opened up the Hadoop environment to alternative application frameworks.

Hadoop V1 to V2

And along comes Spark

Nearly two years later, one name is prominent among those alternatives. Apache Spark is a different approach to data computing that buys speed with heavy usage of memory for dataset management and processing.

This is a potential cost barrier to uptake, but one that buys many gains in areas where Hadoop MapReduce falls short. Another plus is Spark’s support for modern languages (Scala, Java 8 with Lambda expressions, Python) which in turn confers coding concision.

Spark’s architectural concept has not been restricted from the outset to the confines of the MapReduce approach. Rather, it has at its core the Resilient Distributed Dataset (RDD), a fundamental abstraction for distributed data and computation. Spark transparently manages the partitioning of the RDD across multiple partitions as well as the processing of RDD partitions on multiple nodes of a cluster.

Spark: Options, Speed and Potential

The Spark user has many more operations than MapReduce in the RDD toolset. Spark will lazily evaluate these operations – that is, it will consider them as a chain of transformations – and will compute just what is required for the outcome, and only when it is required. This minimises system transactions and boosts the speed of processing – in optimal scenarios, by a factor of a hundred.

The generalised computational approach means that many different variants of data processing can be supported. Spark’s strength is in supporting iterative algorithms that make multiple passes over in-memory datasets. Machine learning is a prime example of this type of computation.

Because Spark’s architecture is centred around the RDD, and because this is such a fundamental keystone of most data processing approaches, libraries supporting disparate computational approaches can be tightly integrated, sharing the RDD data and functions. In contrast, although Hadoop has grown an ecosystem of powerful add-on packages with similar roles (such as Apache Mahout for machine learning), data sharing between applications requires additional steps to persist interim data.

Spark’s potential has triggered a meteoric rise in mindshare in the Big Data field. Development activity, as measured by both software contributions to its open-source codebase, and the calibre of those contributors, is burgeoning. Many of the existing established Hadoop-based products, are updating to integrate with Spark. Mahout, for example, will no longer add new MapReduce algorithms, instead switching to a format designed to work with Spark. The Hadoop solution vendors are quickly adding Spark support to their services.

Hadoop vs Spark or Hadoop AND Spark?

In 2015, the organisation embarking or continuing on its Big Data journey has the benefit of multiple options. The Hadoop name today encompasses a core established product with a retinue of bolt-on enhancements and experienced commercial support. Spark is the sleek, trending, second-generation upstart.

The best news of all is that this is not an either/or situation. Hadoop 2’s modularisation and standardisation of access allows both products to operate on the same cluster. If the organisation has already migrated its datasets into HDFS, the data can remain where it is. Application toolsets can be evaluated based solely on the nature of the data operation.

Hadoop still has its place. MapReduce remains an effective way to implement scheduled data batch jobs, when latency is not an issue. At the petabyte level, data shuffling in memory would erode Spark’s edge over Hadoop. The extra costs of purchasing enough memory for Spark may not be justified for a straightforward MapReduce operation.

The prudent option for an organisation already invested in Hadoop
would be to install Spark in its data operations tech stack.

With Spark installed in the stack to augment Hadoop, the organisation must re-evaluate existing data operations to see if the benefits of reduced latency justify the migration costs. Even reimplementing a MapReduce operation in Spark provides a speedup. As the organisation’s data operations grow and diversify, Spark is likely to be a more effective bridge across a pipeline of heterogenous data processing requirements.

It is worth noting that Spark can run completely independently of Hadoop technologies on clusters based on local filesystems, Apache Mesos and Amazon EC2. Apache Tachyon, a next-generation update on HDFS, is progressing towards a version 1.0 release.

The only constant, it would seem, is change. The co-operative evolution of these products in parallel demands of the customer an ongoing evaluative initiative to derive best value, but the payoff has been and will continue to be an explosive growth in the capabilities of all products on the market.