With the official launch of our end-to-end data warehousing solution Astera DW Builder, we’ve introduced a much faster, more agile path to designing and deploying data warehouses. But that’s just the beginning, we’re planning some major additions and updates to the product over the next year that will add unique value to any organization looking to bypass issues that come with traditional data warehouse development.
For our third interview in this series, we decided to sit down with someone who’s built their career on being able to identify, anticipate, and successfully resolve these challenges while leading data warehousing projects for Microsoft and currently, Ernst & Young. As one of the foremost thought leaders in this space, James Serra has tackled a range of topics from BI and data architectures to big data and analytics in his popular blog and at speaking sessions around the world.
In this interview, we talked to James about some of the insights he’s gained during his time in the industry, the evolution of data architectures in the age of big data, and what the future of the data warehouse might look like.
Could you tell me a little bit about your involvement in data warehousing? When did you first start working in this space and what sort of roles have you played in the development process over the years?
I started out many, many years ago with databases back in the ’80s. The initial focus was on Microsoft SQL Server, first SQL Server 1.0 then OS 2.0 back in 1989 I think it was. So, those were more transactional databases with a lot of updates, inserts, and deletes.
Maybe 20 years ago is when I first got involved in data warehousing. I was working as DBA at a company, and they were tasked with building a data warehouse. At the time, I knew very little about the technology, but the data warehouse was built on SQL Server, so I got involved in the process. That was one of my first experiences in creating a true data warehouse i.e. pulling in data from all these different sources and creating BI reports on top of that.
Since then, I’ve had many different jobs – I was a consultant working on many different types of databases always in the Microsoft realm, starting from SQL Server and then Azure came around, then there was SQL Database, SQL Data Warehouse, and now Synapse which is what I’ve been using pretty much exclusively over the past few years.
Obviously, you have a wealth of experience in building data architectures both with Microsoft and now at EY. What would you say are the most common use cases for data warehousing? In other words, why do these enterprises want to build a data warehouse?
Above all companies want to make better business decisions. To do that they need to do have all the data possible. So instead of trying to go to each of their operational databases individually and create reports, they can get more value if they pull all that data together.
As an example, say they have customer information stored in a CRM system, then they have customer support information stored separately in another system, yet another platform for managing sales information, and some ERP system that contains customer information as well. They have this data all stored separately and of course, they want to collect it and use it to identify historical trends so they can find reasons for things like why customers are not buying in certain areas of the country. With a data warehouse, they can go in, dig deeper, and find out what’s going on.
More recently, it’s become about not just historical trends but looking into the future as well, this is where AI and machine learning come in. Today, companies don’t just want to see where they’ve been but also where they’re going, which is why predictive analytics is becoming a big thing. So, if we take that customer information from earlier and say we want to use it predict the chances of a customer leaving. A machine learning model might be able to estimate a 70% chance that the customer will leave based on all the data collected. So now, you, as the decision-maker can take proactive measures and do something to prevent that, like sending them a coupon to secure their loyalty.
Fundamentally, it’s really the gathering of all this data that allows the end-user/analyst to generate reports, then slice and dice the outputs to enable those better business decisions.
So, we’ve obviously seen a pretty significant increase in the volume and variety of data moving through enterprises over the past decade. How do you see the so-called traditional data warehouse evolving to handle these requirements?
So, first of all, I’d define big data as not just the size of the data, but the type and the speed of the data. So many customers that I’ve worked with are handling terabytes of data that they’re trying to consume. But they also have the challenge of dealing with data in all kinds of formats including Parquet, CSV, JSON as well as data they may want to consume in real-time from social media feeds like Twitter, they may also want to pull some IoT data in there.
So now, you have the challenge of all three-speed types, and sizes of data. While systems have come along in the last half dozen or so years that can handle big data like Azure Synapse in Microsoft platform, for example, there are still a number of other tools required to build a modern data warehouse. These tools handle the variety, volume, and size of data. Nowadays they can handle any combination of data, pull it in adjust it, clean it, and master it before generating reports.
What’s your take on the data lake? Does it have a place in the modern BI system? Does it complement the data warehouse, or is it a viable alternative?
They’re definitely complementary. When data lakes first came into use around ten years ago or so, they were on Hadoop and the idea behind them was that we couldn’t capture big data effectively due to its size, semi-structured or non-relational data also didn’t really fit well into a relational database. So, we decided to put it into the data lake and query it in there. That became the landing zone for all types of non-relational data, while relational data was still stored within a database.
Of course, most organizations want to combine the two types of data, so the non-relational/semi-structured data in the data lake would need to be moved to the database. Although, early on there were attempts early on to put everything into the data lake, which led to a variety of problems.
So the offshoot of that is that we realized that we’d always need a relational database. Over time this thinking evolved, and we now understand that you need to have a data lake alongside the relational database and that each has its own strengths.
We can look at a data lake as being able to handle data no matter the size or the type, but it has some limitations. The data lake is not good for extremely fast queries, it’s not great for the types of security that you may want out of your data architecture – like low-level security or column-level security, it can also be harder for your average end-user to query data sitting in a data lake because it’s schema-on-read, meaning it’s just a glorified file folder, you can put any sort of data in there and that can make it challenging to try and pull it out if you don’t have the necessary technical skills. That’s where a relational database comes into play because somebody in IT will do the work to put the data along with metadata, so it becomes much easier for an end-user to query it. So, if you go from a data lake into a relational database and from 3NF into a star schema end-users can now easily perform self-service BI because they can just drag fields from the database and create reports or queries off of that.
Now tools like Azure Synapse have come along and made it much easier to query data whether it’s in a data lake or a relational database with regular SQL, and they do both have their pros and cons. Ultimately though, the whole idea is to take data from all these sources, move it along, and do some work to prepare it which can involve extra costs but in the end, you’re going to get data that is formatted in a way that’s much easier for most end-users. Meanwhile, you can still have the data lake for data scientists and power users to query, but for most users, you’re going to want to have a relational database available.
What would you the most mission-critical aspect of data warehouse development? Is the modeling of data? Loading data to the data warehouse? Ensuring that the data warehouse is accessible to your BI platforms?
All of these aspects are extremely important, but data governance is really the main thread throughout the building of a data warehouse. That is to make sure the data is accurate, correct, and constitutes a single source of the truth for the organization. Because the worst thing that can happen is that you do all this work to build a data warehouse and generate a report, then the first time an end-user sees it they say the data is not accurate or incorrect. Right away, you’ve lost their trust in the system. So, you really have to make sure upfront that data is governed properly. That means having it properly cleaned and mastered and all those things that go along with data governance.
Almost as important, if not more, is security. Nowadays there’s so much personally identifying information that could land in a data warehouse, and if you don’t have the right type of security in there then people might start seeing data that they shouldn’t. That could be personal information, could be sales figures, or other things that can get you in a lot of trouble especially if that data is pulled outside of your company. Now you’re on the front page of Wall Street Journal with a breach. So that’s the other big thing is security. So I would say those two should be top of your mind when building a data warehouse.
In recent years, there has been a big push to start moving data warehouses to the cloud on platforms such as Azure Synapse, Google BigQuery, or Amazon Redshift. What’s your take on the advantages of cloud deployment? Would you still recommend sticking with on-prem databases in some cases?
I have a hard time ever seeing a solution anymore that should be on-premises, there are just very rare cases of that being a viable solution anymore. It used to be, at Microsoft for example, when I first joined 7 years back where I would have a lot of conversations about cloud vs. on-prem but for the last few years it’s been extremely rare that anyone wants to talk about the latter as a possible option.
The rare exceptions are if the company is in a place that doesn’t have access to the Internet like say a mine or a rig out in the sea. Or if they’re dealing with data that needs millisecond response times when it comes to queries and things like that because there could be a little latency to the cloud. But outside of that, everybody has, is, or is going to the Cloud for numerous reasons which I could spend half an hour talking about.
It’s important to note that cost is not always at the forefront of this move, it’s other things like having the latest features of a product, or probably the biggest benefit which is the ability to starting working quickly. With a cloud data warehouse, I can go into Azure, for example, and have a database ready within minutes, whereas with on-prem it could take days, if not, weeks, if not months to get a database on a server. So, I can’t remember the last time I met a customer that I recommended on-prem too. Sure, there might be a couple of use cases as I mentioned above, but these are few and far between.
So, I’ve seen you write a fair bit about Reverse ETL in recent years. If you could just sum up the concept, and talk a little bit about the benefits you feel this approach brings?
Yeah, this is a very new concept. So, some companies can have customer data in a dozen different source systems especially if it’s a large company. Say they pull all this customer data, clean it, and master it (meaning creating golden records) because there could be multiple copies of the customer with different spellings. They could also supplement customer data from one system with different types of data from other systems like taking customer data from a CRM system and adding support ticket information from customers.
So let’s suppose all of this data is now in the data warehouse and I’ve done all this work to create one unified view of the customer. Well now, the problem is that the cleaning has all been done at a data warehouse level, but the source systems where the data was taken from are still not cleaned. With reverse ETL the idea is that I can now funnel the data warehouse data back into the source system and correct those customer records. That’s one reason I see reverse ETL becoming something popular because you’re doing all this work in the data warehouse which you can then apply to the source systems to correct them.
The other big benefit of reverse ETL is that if I pull all this data into a data warehouse and I create BI reports on that and say it’s a customer view, well I may have a lot of sales associates who are used to these other operating systems i.e. a CRM system where they already look at customer data and that they already like using. Now, you’re asking them to go to a data warehouse and use a different system, a different report, to look at those same customers. So, why not reverse ETL and take that data from the data warehouse and copy it into the operational system. The analysts can then use the data in the system they’re most comfortable with. So, they’re not having to go into the data warehouse.
This is another area where I see Reverse ETL being really popular. Do the work in a data warehouse, but then land the data in an operational system where the end-users are most comfortable.
If you were looking at a data warehousing solution for a client, what would be some of the key features you’d want it to have?
I’ve used a number of data warehouse automation tools over the years with varying levels of success. Some of them have been really helpful but the ones that I find are most effective are the ones that are not too proprietary and can provide automation that makes the building of solutions quicker, that’s where they can be hugely valuable.
When you weigh up the Build vs Buy question of whether to start projects completely from scratch, I always recommend first looking for something that’s already been built and see whether that will help you. Now, that could be a common data model or a tool for data warehouse automation that will quickly build the data warehouse for you. If the solution can do that using technology that is common, meaning if that automation tool creates a data warehouse for you, could you still use the data warehouse without being forced to use the third-party tool then you’re on the right track.
For example, the data warehouse automation solution could interface with a particular ETL software to create pipelines and after it has created those pipelines, you might go and update them yourself without going through the DWA tool. Now, let’s say it’s using a Microsoft product for that task, the DWA tool may not keep up with updates to the ETL product, so say the software gets new features then the data warehousing tool may take some time to leverage those features which of course limits the functionality of your solution. There’s also the matter of skillsets, if you’re using a data warehouse automation tool at your company then you may have to hire people that already have that third-party tool experience in order to move the project along.
Overall, I think DWA tools are especially helpful for companies that are new to data warehousing and don’t have the best practices or standards in there and neither do they have a large team to go and build their own setup. They can go and get some quick results and sort of shortcut the process by using a readymade tool to speed up the development.
What do you think of the concept of the agile data warehouse? Basically this idea that data warehousing is a process, not an end-goal and that iteration is pretty much at the heart of any effective BI system?
When I was at Microsoft customers generally saw it as a technology, the people and process side of it was something they looked to deal with themselves. Of course, it’s important to have all of those elements in place. Whether we talk about creating a center of excellence for data governance, or whether we’re talking about DataOps. Because building an application is very different from building a data warehouse which is why I make the distinction between DevOps and DataOps.
In a data warehouse, you’re dealing with so many systems. You could be updating a model in a data warehouse, to an ETL pipeline, to reporting. So all those things have to be coordinated and that’s where DataOps comes into play and that’s hugely important. I see a lot of companies, and EY is one of them in particular, that have this gigantic data warehouse built and they have to follow a DataOps type of process in order to make sure that the product is used both internally and outside of the company, also that it doesn’t break whenever there’s a new feature or bug fix put into it.
So, people and processes are sometimes the hardest part while technology can be relatively easy. So you have to make sure to get all these elements in place to make sure that the finished solution is as bug-free as possible and provide, as I mentioned before, accuracy in data.
Finally, where do you see the data warehouse headed in the future? What major advances (if any) do you foresee in this space?
Well, touching on that – I think there’s going to be a much bigger focus on AI and the machine learning part of that in particular. The challenge when you’re building out a data warehouse is collecting data, cleaning it, then landing it someplace whether it’s a data lake or relational database. For customers, it could take months if not years to collect all that data. And the icing on the cake becomes the reporting off of that, where you’re using something like PowerBI to slice and dice datasets. At the moment, that’s where a lot of customers still are, in the process of collecting their data.
Most of the work in this space is follows a hybrid approach because a lot of companies have data sources in the cloud, but a lot of systems are still on-prem and they need to pull all of them into the cloud which has its own challenges. Once you collect all the data, you can report off of it and get some solutions pretty quickly.
The next step is doing machine learning and creating models that can be trained using the data in the cloud. A lot of customers are not there yet, they’re still in the collection stage trying to get data in the data warehouse. So the big push will be, customers seeing the benefits of ML models in creating predictive analytics for things like customer churn or for when a part is going to fail (IoT device data has become really popular for this type of predictive maintenance). But again, it’s a pretty big jump to get there, because you’ll need data scientists and the necessary products. Sure, solutions have come a long way with automated machine learning to make this part easier, but junk in junk out still applies you still need someone who understands data science concepts in there.
So, I think in the next 2 years, you’re going to see a huge amount of work done to build those machine learning models to get even more value out of the data than you can out of historical reporting.
An End-to-End Solution for Modern Data Warehouse Development
Astera DW Builder offers a unified platform that users can leverage to streamline every aspect of their development process, from the initial collection, cleansing of data to designing reporting-ready data models that are suited to your data governance requirements, and of course the deployment of your data warehouse in the cloud.
With ADWB you don’t have to rely on a complex technology stack or experienced technical resources to get your implementation over the line. The product offers an intuitive drag-and-drop interface, supports speedy iteration, and works equally well with an array of source and destination systems. Contact our team to get started with Astera DW Builder today.