Pepperdata Project Hosts HDFS on Kubernetes

November 14, 2017November 17, 2017 Mike Vizard analytics, Apache Spark, big data, Hadoop File System, HDFS, HDFS on Kubernetes, kubernetes, open source, Pepperdata

by Mike Vizard

Pepperdata has launched a project aiming to enable the Apache Spark in-memory computing framework for big data analytics applications.

Pepperdata CTO Sean Suchter says the Hadoop File System (HDFS) on Kubernetes open-source project hosted on GitHub seeks to take advantage of a unique opportunity to unify the underlying infrastructure employed to support both big data and traditional applications. One of the primary issues that IT organizations regularly encounter when deploying big data applications is that they require dedicated infrastructure. The HDFS on Kubernetes project eliminates that requirement by making it possible to deploy the Apache Spark on top of any HDFS running Kubernetes to provide a common layer of abstraction across multiple types and classes of IT infrastructure, he explains.

Suchter says this is significant because increasingly organizations need to be able to apply advanced analytics and machine learning algorithms across all the data the organization possesses. Deploying on HDFS makes it possible to consistently manage all the silos where that data resides, regardless of whether they are deployed on-premises or in a cloud that supports, for example, the S3 interface defined by Amazon Web Services (AWS).

Applications will be able to access data via HDFS running on top of Kubernetes or another platform. But over time, Kubernetes presents a unique opportunity to rationalize those platforms by essentially replacing the Yet Another Resource Negotiator (YARN) facility that underpins most instances of HDFS and Apache Spark with Kubernetes, Suchter says.

HDFS on Kubernetes also includes a data locality function that makes it faster to access data across distributed instances of HDFS on Kubernetes, as well as support for Kerberos-based authentication to secure both access to the overall environment as well as protect the credentials used to access applications.

It’s still early days as far as the HDFS on Kubernetes project is concerned, so Pepperdata is looking to work only with a few intrepid IT organizations willing to contribute code and test the overall environment, Suchter says. HDFS on Kubernetes is nowhere near being ready to support production applications. But given the level of support being provided to Kubernetes by companies such as Google and Red Hat, Suchter says it’s now only a matter of time before Kubernetes becomes a de facto standard across the enterprise.

It’s unknown how the major vendors in the big data community will respond to the rise of Kubernetes. Many of them have spent years developing proprietary platforms to host HDFS. However, it’s arguable that the underlying infrastructure used to host HDFS doesn’t offer any long-term differentiated value in a world where there is now a common layer of abstraction supported by multiple classes of applications.

Any shift to Kubernetes for big data applications would take several years to play out. But as more IT organizations make it clear they would rather focus their efforts on building and managing applications, there is more pressure to standardize the underlying IT infrastructure environment to dramatically reduce costs across the enterprise.