Successfully adopting Hadoop means getting past the hype and saying goodbye to the old idea of data modelling.
Recently, on my last trip to Toronto, I had the opportunity to meet with three major national banks to talk with them about their Big Data initiatives for 2018. Interestingly, all of them had a similar question for me:
“How do I implement my financial data model in this new Hadoop ecosystem we’ve adopted?”
I could immediately tell they were in the ‘trough of disillusionment’ in their Hadoop adoption process. Part of Gartner’s hype cycle, the ‘trough of disillusionment’ is the feeling you get when you’ve bought into a product with a lot of potential but that is failing on implementation and not providing any returns.
When it comes to Hadoop, the main selling point has been the operational savings offered versus the cost of maintaining and upgrading current platforms to scale when centralizing the data. However, the short proofs of concept tested during the early adoption cycle rarely bring out the true use cases for using an existing data model on Hadoop.
Now that these banks I was meeting with have Hadoop clusters, they’re looking for the ‘easy answer’ to move their data warehouse model on their existing platform into this new Hadoop world. Unfortunately, there’s no silver bullet in these cases and I wasn’t able to give them the easy answer they wanted to hear. The truth is that additional modelling/mapping work is required to see true value from Hadoop, and, in some cases, this makes for awkward conversation.
Out With the Old
Data modelling is still an important exercise. It forces IT professionals to create a data representation of the complex business process within an organization. More importantly, it aids in documenting and describing the data (i.e. metadata) so everyone has the same definition of the data elements.
At some point in the data modelling process, we need to build the physical version of the model. History hasn’t prepared us well for moving from the ‘relationship style’ of physical data models to the new scalable Big Data platforms. For years, the data modelling mindset has been some variant of adhering to third normal form (including dimensional modelling). This data modelling style worked well when deploying to relational database management systems that were accepted and available in every organization. The goal in these cases was often to eliminate redundancy and data repetition to ensure efficient storage. It also aided with data quality by allowing users one place to make a data change if needed.
For example, if a customer’s address changed, it only needed to be updated in one place and the rest of data that relied on this address would immediately see this change automatically. Operationally, this was an ideal way to represent the data and would support the ‘one truth’ for any data entity. (Data modelers rejoiced in achieving this goal.) For the most part, the relational databases did a good job of handling normalized data models for organizations.
In With the New
Two words changed everything in the field of data modelling: Big Data. Data sizes were increasing. New data sources were being requested. Changes to the data model were hard to deploy. Performance issues were not going away. Upgrades were becoming costlier in this relational world. There were new tools and platforms that respected companies with ‘Internet scale’ data were using to solve these problems in the new era of Big Data. This triggered an adoption cycle of technologies such as Hadoop (and its related ecosystem) within enterprise IT organizations faced with these types of issues.
Centralizing data from many sources became a straightforward process with the Hadoop Distributed File System (HDFS), since it accepted all formats and schemas. If you can copy a file, you can get that file’s data ingested into Hadoop. To be scalable, folks knew that data would be written and read differently on Hadoop compared to a relational database. But those differences didn’t really sink in until they tried it for real.
With a platform like Hadoop, you have little control on where the data is written across a cluster of data nodes. There’s also a lot of slow disk writing activities that take place. Concepts like ‘indexing’ are different and you learn firsthand about the ‘small file’ problem in this new data storage world. If you keep the traditional ‘relational mindset’ and apply your highly normalized data model, this will lead to a poor performance experience when trying to do anything with the data. This is where many organizations get stuck and find themselves in the ‘trough of disillusionment’.
Changing the Way We Think About Data Modelling
All is not lost here in this new data world. It will, however, take a new perspective on how to store, retrieve, and handle moving data in order to realize its true value. I believe part of this new perspective starts with truly understanding how the data is being consumed outside of (or downstream from) the traditional data warehouse that implements a highly normalized data model.
From the business intelligence and reporting perspective, the tools want to consume the data in a highly de-normalized or flattened form. When given a complex model with many tables, the first thing a reporting tool will do is create a flattened view. It does this by performing a series of complex joins against many tables. The more tables that need to be joined, the more you can expect a degradation in performance to retrieve the data. Often, to overcome these types of performance issues, a pre-built view or snapshot is used to create this de-normalized view of data ahead of time. The bottom line of all this? The users who consume the data will want it in flattened form and not in some complex relational form.
De-normalized Isn’t a Dirty Word
When you broaden your view on how the data flows from its sources to the end targets, what you usually see is the following sequence:
It takes a fair bit of effort to implement and maintain this ‘de-normalize to normalize to de-normalize’ process. By taking a view from target de-normalized data to the source of the data, you will likely find opportunities to ‘flatten’ the physical model. The data warehouse represents this awkward intermediate step. The idea here is not to flatten the data to one massive table. The goal is to flatten entity areas into single tables. This isn’t about finding ‘one data model of tables to rule them all’. It’s about creating tables that are ‘purpose built data products’ to support downstream targets. This will make the model simpler and easier for your end user to consume (rejoicing will ensue!).
Not All Data Modellers Agree
Now, some data modellers may reject this point of view.
“By de-normalizing the core data model we introduce redundancy and increase data size!” one may protest.
“How will we efficiently handle updates?” another may cry.
Well, here’s the solution.
Firstly, for redundancy and increased data sizes, Hadoop and other NoSQL platforms will scale to minimize this risk. These platforms are designed to handle a few larger wide tables then many smaller related tables. They can linearly scale data growth for tables across a cluster of machines. Data tables can be arranged to have thousands of columns per table. Greatly reduced storage costs – through the use of commodity hardware running open source software (no software license fees) – offset the duplicate data storage cost when compared to traditional relational database storage. This redundancy also makes it easier to retrieve the data by minimizing table joins.
Now we turn to the second objection: handling updates and consistency. This is a much more interesting problem to handle with the de-normalized model.
Focusing on the Key Output
As stated earlier, data modelling is still an important exercise to produce a de-normalized model. The exercise may feel more like ‘mapping’ as opposed to modelling. What I believe is different now is the focus on the key output.
In data modelling, the key output was the entity relationship diagram. This large diagram with really small fonts was often proudly displayed on the wall by its creator. But now, the key output should be focused on metadata and definitions/business rules of data elements. I am not just referring to the field’s physical data type of an integer or string, but what this data element represents from a more business-centric point of view (i.e. does this number represent pounds or kilograms, what are the acceptable ranges, format, etc.). A central metadata of the data elements provides the ‘common definition’ when these elements are repeated in a de-normalized model. This helps minimize inconsistent implementations of any logic or rules for your data when these definitions are at the forefront of the modelling process.
Handling Updates in a De-Normalized Model
As for handling updates in a de-normalized model, I take the position that there are no updates. There are only views on versioned data. Most of the scalable Big Data platforms write data in an immutable way. You don’t modify the data, you only insert or append data. Relational databases work in similar ways with the immutable change or write ahead log. Conceptually, this log is what you have now in the Big Data world to query. Write all your data changes as an insert with a time stamp or version. Then, query for any version of data if needed by date or version number.
The Hadoop Journey
The journey to Hadoop or an NoSQL platform is really about taking de-normalization to its full conclusion. Administrators of relational databases were already going down the path of de-normalizing the physical data to achieve better performance. The move to Hadoop becomes an opportunity to apply this to the core data model. This is a change to stop fighting the way users want to consume the data. But this change isn’t about saying goodbye to data modelling. It’s about moving to a process focused more on mapping with a well-defined metadata. It’s about acknowledging that Big Data has irrevocably changed the way data is stored. These changes mean we must change the ways we interact with data. And so, in turn, the ways we think about data modelling must also change.