Data Engineering: Modern Solutions Produce Modern Problem

October 6, 2020

Charles Du

The landscape of data is in a state of constant flux. When it comes to data processing and warehousing technologies, there is always a flavor of the month accompanied by an un adopted, neglected technology that is destined for relegation into the forgotten abyss.

A revolving door of tech offers the bleeding edge when it comes to engineering data for analytics. With adoption of unstructured databases and fault tolerant data methodologies such as schema on read, unpredictable data sources that were once potential showstoppers in the SQL engines of yesteryear are no longer problems! We can collect data that would’ve once been classified as junk to deliver additional insight into previously uncaptured customer behavior!

The adoption of cloud infrastructure has also made arduous hardware and software setup a thing of the past—to that I say, good riddance! Have a sudden need to collect vast quantities of potentially useful, unstructured text data? A few clicks on AWS to combine S3, Lambda and a processing engine like EMR creates a basic data lake for you! All without having to fight other departments to find a brand-new server.

Modern solutions…Modern problems

But with great analytics capabilities, comes great...problems. A pivot to cloud infrastructure often sees companies reduce their overall operations cost, but only when done correctly! On-premise hardware and software licensing often provide a fixed cost—easy to calculate, easy to budget for. Cloud costs are mostly variable with pitfalls for the uninitiated. Take Spark compute as an example. Most can calculate EMR and EC2 fees; but few consider the EBS needed to support the EC2 compute, nor the I/O cost associated with S3 for storage. On smaller systems, this discrepancy is trivial. But scale it up to a pipe handling billions of records per hour; the pennies become dollars rather suddenly.

Too much choice?

With so many data platforms out there, we have an abundance of choice to find that goldilocks piece of tech for any given use case! A sweet proposition at first, but much akin to sugar; it's a curse in the quantities presented to us!  

Take a simple database selection problem. How do you decide between structured vs unstructured? If DynamoDB is the flavor of the month, do you go with that? Is your workload well suited to distributed systems? What if you want to go relational? How do you decide between Maria, Aurora and Redshift? Unfortunately, there is no one size fits all. The choice varies greatly based on individual use cases, perhaps a topic for another time.

Compare this to the simple problem of deciding the flavor of SQL Server to buy 10 years ago and the job of the modern data engineer seems rather complicated! Yet despite all these problems posed to us, we've yet to meet the real challenge.

What happens after taking the blue pill?

Given how easy services are to set up, it can be far too easy to get drawn into a rabbit hole! Architects can often get onboard the hype train far too quickly given the lack of setup barrier. The wrong tech can easily be selected and committed to. Given how specialized some of the modern data platforms are, this mistake can be devastating; a service can be wonderfully performant for a certain use case and woefully poor in others. Pick DynamoDB to daily wipe and load billions of records and the outcome will be far from stellar.

But should you go down the wrong tech stack, how do we fix the problem? Had this been an on-prem implementation in a large corporation, the solution is easy. Hardware and licenses are committed to, it’s time to use a pipe wrench as hammer! So long as business SLAs are met, a sub-optimal tool may be the most efficient choice. But in the cloud, the lines are blurred. As mentioned, services are easy to adopt without a spin up cost nor a commitment. All that's lost is time spent on a defunct code structure for your original cloud service of choice.

On the one hand, this can lead to a too many cooks situation; each time you encounter a new niche workload, bring in a new database. Before you know it, you’ve become Victor Frankenstein and created your own tech stack monster that’s far too complex to justify any marginal gains you made.

Alternatively, the freedom of choice can manifest as a roadblock when trying to back out of your initial cloud service in the form of endless debate for when/if to change the adopted technology, there are no more on-premise setup barriers to make the decision for you; a truly 'modern' twist to an age-old problem!

Discrepancies add to the mix

The same can be said about another ‘modern’ problem in data engineering—discrepancies in data sources! If I had a nickel for every time source datasets lined up perfectly with each other, I'd have enough money to buy a candy bar in 1964. Fundamentally, this is as much a business problem as it is a data engineering problem. In my utopia, data collection methods are standardized, and output schemas follow standard type guidelines to be cross compatible with all vendors.

This has been a problem since the dawn of data. But now, we have different platforms contributing to the data fragmentation. For example, different processing engines requires different data structures: think distributed vs single threaded. This makes optimizing data for a complete data lake a potentially troublesome task when multiple (especially legacy) systems are involved.

The minutiae between different data technologies is the final cherry on top of our modern data problem pie: a fine example being a Spark dataset built around arrays not meshing with MSSQL which does not have an array type!

But benefits outweigh problems

Despite all these hazards, the additional business value from modern data engineering tools far outweigh the potential hurdles during adoption. So how does Blend360 tackle some of the problems we outlined? It all boils down to our clients and the exact nature of their data processing needs! If data volumes and transformation procedures suit distributed platforms, we will happily build that! But if your processing is localized and best suited to an old-fashioned RDBMS, we will just as happily build that! Like an honest mechanic shop, we are here to solve your problem, not upsell you on bells and whistles. With a diverse group of SMEs in our cohort covering high tech products including Adobe Clickstream and Apache Spark, we work as a team to come up with the best solution, each SME has their own perspective, this all adds to the perfect solution. We provide a team of expertise, so you will never have a runaway architect steering the ship!