Querying the Data Lake

Water you doing?

The Apache Open Source contributions to Hadoop are numerous and cover a broad portion of a reference architecture. It has been some time since we considered foundational low cost storage and in-place query capabilities. And as we saw in the “Data Lakes” blog posting, many organizations utilized this foundational offering.

The low cost commodity hardware could be scaled to align with a growing appetite and commensurate increased workloads. Queries in-place reduced dramatically the need for manual programming efforts traditionally required to extract, transform and load data into relational data models. Queries could be constructed using similar tools and techniques, albeit not identical and often necessitating some degree of retraining, in a simple (okay, simpler) manner than traditional methods. There were even attempts to leverage existing business queries remapped to Hadoop equivalents. Some actually worked. But the query experience from the business community missed expectations, occasionally taking hours/days to complete when their relational equivalents were completed in seconds or minutes.

Although the reduced cost of ownership for Hadoop was quite favorable, queries that took several hours to complete were a tremendous setback. The hopes and aspirations of using a centralized data store for operational reporting and even potentially operational analytics appeared quite doubtful. Many different attempts to combine hybrid components and arrange them in differing orders to address performance concerns failed, some failing quite miserably. An in-memory approach was surely needed. And one was created.

Apache Spark is a massively scalable, distributed in-memory parallel query software that extended the Hadoop footprint by providing in-memory query capability with exceptionally fast response times. In some cases, benchmarks of Apache Spark well-outperformed relational counterparts in similar size or volume tests. So we add Spark to our architectural illustration below. The Data Lake concept now has commodity storage, in-place query still avoiding expensive manual processes and an in-memory query processing component called Spark.

Key Aspect	Response	NexJ DAI
Hadoop EcoSystem	HDFS Low Cost Commodity Storage Hive In-Place Query	NexJ DAI Integrates with Hadoop Data Lakes as a potential source system
Hadoop EcoSystem	Spark In-Memory Query	NexJ DAI Provisions Semantic View Data consumable through a Spark Adapter

How does your organization use a Data Lake? What 360-degree data views power your analytics? We welcome your thoughts, value your insights and action your feedback: share below!

Post Comments

Author: Matthew Bogart

Vice President, Marketing

Matthew is responsible for building awareness for NexJ and demand for its solutions. He regularly engages with analysts, key industry stakeholders, and thought leaders to stay abreast of technology innovation and financial services industry trends and challenges.

Matthew will be sharing his insight and perspective on the enterprise customer management market and the issues affecting the financial services industry today in regular contributions to the NexJ blog. He encourages readers to take part in the conversation or reach out to him directly with their observations on market trends and issues.

Comments Off