With the launch of Astera DW Builder just around the corner, we’ve been talking to some of the industry’s foremost thought leaders to get their take on the industry as it stands, how it is evolving, and where they see data warehouse automation fitting into the bigger picture.
For our second interview in this series, we sat down with Vincent Rainardi, a longtime veteran in the data warehousing space. He has worked on projects for clients such as Barclays, Lloyds Bank, Toyota, and Avery Dennison. Vincent has also written extensively on data warehousing and BI, authoring a book Building a Data Warehouse: With Examples in SQL Server and a popular blog on the topic.
In this discussion, we took a deep dive into Vincent’s varied experiences and insights on data warehousing, with a specific focus on the financial services industry, where he has built up substantial expertise over the years.
Could you tell me a little bit about your involvement in data warehousing? When did you first start working on these projects, and what roles have you played in the development process over the years?
I started working in data warehousing in 1996, building a data warehouse and BI for a car manufacturer. At that time, DW & BI was called DSS and EIS/MIS (Decision Support System and Executive/Management Information System). I then left DW/BI for a few years but got involved in this space again in 2005 when I read Ralph Kimball’s book on data warehousing (The Data Warehouse Toolkit) and became very passionate about it. After that, I wrote many articles on the topic for various websites and my blog. In 2007, I published a book covering many of my insights on data warehousing.
Since 2005 I have worked on several data warehousing projects, developing data warehouses for banks, asset managers, insurers, manufacturers, healthcare providers, retailers, and eCommerce companies. During implementations, my primary responsibilities are as a data warehouse architect, i.e., designing the data flow architecture, the data models, and the physical platforms including SQL Server, Oracle, Teradata, Hadoop, Azure.
I have also served in a variety of other roles, including as a business analyst (business requirements, functional specs), BI designer (defining dashboards and visualizations), BI developer (working with platforms such as QlikView, Spotfire, Tableau, SSRS, SSAS, ProClarity, Business Objects, Strategy Companion), ETL developer (SSIS, Informatica, Teradata utilities, SQL), tester, and project manager.
You’ve talked a little bit before about the differences between data warehousing and BI – how do you think these two concepts complement each other, and where do they differ in principle?
In practice, they are used interchangeably, but actually, they are different. BI is the front end (dashboards, reports, visualizations), and the data warehouse is the backend (database, storage, ETL, files). However, they do complement each other, and, in practice, they are used together. Today, the backend does not have to be a data warehouse; it can be a data lake. And the front end does not have to be BI; it can be machine learning.
You have a wealth of experience in financial services. What would you say are the most common use cases you see from organizations in industries like insurance or investment banking?
The use cases in insurance involve a lot of complex analysis. So typically, organizations will look to answer queries on the types of higher-level gross premium earned, written premiums, risk coverage, cost of risk, claim adjustments, claim timelines, sales pipelines, customer profitability, underwriting, actuarial rates, reinsurance risks, and cash flow.
In investment banking, these analyses will cover market risks, position exposure, trade activities, regulatory reporting, market research, liquidity, collateral, deal pipelines, client accounts, anti-money laundering, counterparties, fraud detection, portfolio management, and capital stress tests.
We’ve seen a significant increase in the volume and variety of data moving through enterprises over the past decade. How do you see the so-called traditional data warehouse evolving to handle these requirements?
So, traditional data warehousing has evolved to incorporate data lake and big data technologies. Unstructured data is stored in data lakes, including documents, images, video, IoT, and social media feeds, while structured data (data stored in tabular format) is stored in the data warehouse. Data lakes are usually built on Hadoop file system compared to the data warehouse, which is generally built on relational databases. Modern data warehouses also incorporate NoSQL, such as graph and document databases, as some data is more suited to these types of formats. That covers the variety bit.
As far as volume goes, in the early years of data warehousing, MPP databases like Teradata, Oracle Exadata, and Microsoft PDW were a popular way to handle large volumes of data while keeping the data relational. But nowadays, big data technologies are much more widely used.
What’s your take on the data lake? Does it have a place in the modern BI system? Does it complement the data warehouse, or is it a viable alternative?
Yes, it does. Modern data warehouses use data lakes as a staging area which enables users and frontend ML technologies to access raw data. As long as you’re not looking to integrate data, the data lake is a viable alternative. This type of use case is often seen in projects where data is siloed (separated from other functions) and development time is minimal, in which case data lakes are very effective. Good examples for this would be a machine learning-based fraud detection project in a bank or a project where a specific regulatory reporting requirement needs to be met within three months. When these siloes are removed later on (i.e., when other enterprise functions need to access the prediction result), then the model output can be stored in the data warehouse for downstream users.
However, in most enterprise scenarios, you will need to integrate data. For example, an insurer or asset manager will need to create a unified view from data on securities, issuers, policies, and customers to develop an accurate picture of their enterprise.
You’ve said in the past that understanding the business might be the most critical part of data warehouse development. How would you go about ensuring that business requirements are faithfully translated into the data architecture?
Yes, business understanding is very important. You can’t possibly design a good data model without a good understanding of the business. Imagine if you were tasked to design an ESG (environmental, social, governance) data warehouse, but you had no idea what ESG was. Or an underwriting data mart, but you never worked in insurance. You can’t possibly know what the fact and dimension tables should look like. A good data architect must be interested in the nitty-gritty of the business. Let’s say you’re designing a performance attribution mart for an investment manager. It is essential that you learn about the difference between arithmetic and geometric returns, yield curves, and interaction factors. Sadly, some data architects are not interested in these essential business details. They are more interested in the technical bits.
The best way to ensure that the business requirements are incorporated well into the data model is by running various business scenarios, testing if every scenario can be satisfied by querying the data model using a simple SQL query. If for that, you need a complex query, the data model is not well designed. This is usually the downfall of a data lake; they don’t satisfy complicated business scenarios. And this is the crux of the whole thing: to write those scenarios well; you need to have a good understanding of the business.
In an earlier blog post, you wrote about how many financial services organizations are still using Excel to generate reports and analyses. Do you see this practice changing with more sophisticated visualization tools becoming commonplace, or will we continue seeing this process for the foreseeable future?
Financial services companies will continue to use Excel for analysis and for reporting too. This is human nature because employees in this domain tend to have strong Excel skills. I experienced this preference first-hand at various banks and investment companies. Even organizations using platforms like Tableau, Qlik, or Power BI asked for an “export to Excel” facility for further analysis.
Although the standard reports and dashboards produced by visualization tools are important (they contain formalized data in an easily digestible format), users need to perform their own analysis in many cases. For that, they ask for the raw data behind those dashboards in a tool they’re far more familiar with – Excel. With the raw data at hand, they can group the data differently, apply certain filters, etc., all according to their needs.
Remember, these are financial analysts, number savvy, intelligent people – analyzing data and crunching numbers is what they do. Consuming static reports will never satisfy their entire needs, which is why It generally isn’t a good idea to try to cater to every possible need in the BI tool.
There has been a big push to start moving data warehouses to the cloud on platforms such as Google BigQuery or Amazon Redshift in recent years. What’s your take on the advantages of cloud deployment? Would you still recommend sticking with on-premise databases in some cases?
The main advantages are the outsourcing of human resources and the flexibility in computing resources. With a cloud data warehouse, you don’t need to pay a lot of people in-house to maintain the servers, storage, network, security, backup, and disaster recovery processes. If you need a new database server, you won’t need to wait three weeks to purchase, configure, install, attach storage, etc. Instead, you get increased server capacity on-demand, and you pay by the hour.
This value proposition is not only interesting for small companies but also big corporations who often ask themselves, “what business should I focus on?” Whether you are a retailer, telecom, or financial company, the answer is certainly not “the server business” nor “the IT infrastructure business.” So you outsource. With cloud infrastructure, you can increase and decrease computing power and memory as and when you need. You turn off test and dev servers when they are not in use at night, and you pay by the hour.
When it comes to powerful data warehouse servers, the cost is prohibitive to set it up in-house (on-prem). The required employee skills are prohibitive, too, as there are many different specialist skills required, from SAN to security, from Docker to Yarn. So I would recommend a cloud-based deployment rather than an on-premise data warehouse. That is the argument for using Google BigQuery, Amazon Redshift, and Azure Dedicated SQL Pool (formerly SQL DW, PDW, APS).
However, suppose it is not a greenfield project, and you have to enhance an existing data warehouse that is on-premise with many on-prem systems connected to it. In that case, the cost to migrate all of those systems/applications to the cloud is massive. Let’s say for argument’s sake that it would cost an additional $2 million and three years to complete this kind of project. It does not make good business sense to spend that $2m for no additional business features. Keep it on-prem, do the enhancement on-prem, and in the meantime, slowly build your new system in the cloud. Currently, roughly 25% of IT budgets go towards maintaining existing systems and infrastructure. This is the pool that you should draw from to drive a slow migration to the cloud.
I’ve seen you touch on the big ETL vs. ELT debate before. When it comes to loading data to the data warehouse, is there an approach you prefer? Or does it need to be decided on a case-by-case basis?
It needs to be decided on a case-by-case basis. It depends on the infrastructure. It depends on the data lake. It depends on the data ingestion tool. And it depends on the data warehouse platform. Ultimately, it doesn’t matter whether you’re using ETL, ELT, or ELTLT; you can do the data ingestion in various different ways. You can stage it once, twice, or none. Transform it once, twice, or none. You just need to choose the most appropriate approach for each pipeline and each situation.
Whether it is ETL or ELT, the “must-have” on data ingestion is automated testing. Nimrod Avissar explains it well here. That is the approach I prefer on ETL/ELT: automated testing approach.
As far as the data warehouse development process itself, do you feel that agile or waterfall are better-able to meet constantly changing BI requirements?
It is not about agile or waterfall (agile, of course, who wants to commit a big chunk in one go?). The data warehouse development process is about development operations. This engineering discipline is mature now. You need to have a release pipeline, a sprint board, backlog, source control, change control, error logging, and incident management. A data warehouse development team needs to have DevOps automation, i.e., the release process should be automated, from the source control, pull request into the release branch, deploy into a test, running automated testing and deployment into production.
CI/CD used to be associated with application development, not data warehouse development. But now, it is a must-have. If you run a data warehouse development team without it, you incur a lot of additional costs every month. Or worse, sacrificing the data warehouse build quality.
On your blog, you’ve written about some key elements that determine the effectiveness of your data warehouse, including data availability, data security, data quality, and of course, query/load performance. Which would you say is the most crucial factor?
The most important element that determines the effectiveness of a data warehouse is whether or not it addresses business problems and use cases. At the end of the day, you could be using your DW to create client reporting, management dashboards, regulatory reports, or general business analyses. Still, if it fails to address your business problem, then it just isn’t effective. In this case, failure can usually be pegged to one of three things:
- Lack of particular data, e.g., data on an important product or customer,
- Data quality issues, e.g., incorrect data,
- Difficulty in querying data, e.g., the data is in the warehouse but not showing up in reports.
I would say that incorrect data is particularly damaging for the business, especially when they have taken action based on that data. For example, if you used the warehouse’s performance data to create a fact sheet and misstated the fund’s performance, it is both a reputational risk and a huge financial risk. Not only can this result in a substantial penalty from financial authorities, but your clients might head for the exit based on the misleading figures you’ve quoted, causing massive outflows and an overall grim business future.
In comparison, if a one-hour query ran for the whole day, that’s nothing, almost no financial damage. The same with not having particular data (data availability), that’s not as massive a risk compared to incorrect data being published/sent to the authorities or clients. On the other hand, if your clients’ data is stolen, that too can damage the client’s trust, causing client exodus, which could leave you with no business for the foreseeable future.
I’d say that data quality and data security are the two major issues you need to safeguard against.
You’ve also discussed some of the critical steps in building a data warehouse, i.e., data modeling, creating processes for data loading and delivery, and then architecting these processes to ensure maximum efficiency. Which would you say is the most challenging part of data warehouse development?
The most difficult part of data warehouse development is not data modeling, data loading, or process efficiency but determining what architecture is the most suitable.
If you chose to build a data lake, whereas what you really need is a dimensional data warehouse (because of its integrated nature), then you were going in the wrong direction from the start. If you decided to use a particular ingesting technology that none of your employees had the skill to operate, you’ve made a significant investment that has increased your risk and created avoidable delays in implementation. If you decided to buy a pre-designed data warehouse, but the data model didn’t suit you, it was like building a skyscraper on a weak foundation.
What we don’t see behind the scenes is what usually makes the warehouse a success, i.e., the architecture choice, the development operations, and the data quality checks.
Where do you see data warehouse automation fitting in here? How can it help cut down the effort and time spent on some of these tasks? Where do you see the big value as far as a purpose-built tool goes?
The value of data warehouse automation is the automatic loading. For example, a lot of asset management companies use widely used off-the-shelf investment management systems to manage their portfolios. A dimensional data warehouse that can automatically load data from them has a lot of value. Many large manufacturing companies use SAP as their core operational system. A data warehouse that can automatically load data from SAP is valuable because you don’t have to develop the pipelines for loading the data, which takes a lot of time. In all cases, you must be able to alter the dimensional model as you see fit.
The next best thing is to try to automate the data ingestion. Integrating multiple data sources into one dimension is complex (e.g., customer dimension, company dimension, security dimension), and automating it is even more difficult. We could be dealing with vertical files or incremental files, CSV, or Excel files. We need to be able to pivot, aggregate, and coalesce that data.
Also, we must not assume that the data source is a file that is ready to be processed; it might need to be decrypted, unzipped, renamed, or downloaded first. The core of integrating data from multiple sources involves a process called “data matching,” and it could be a waterfall matching on several criteria. The second core process is called “data enrichment,” e.g., if the security does not have an S&P rating, a risk country, or GICS sector, then we enrich it from an external source such as Bloomberg or FactSet.
The core value of a data warehouse is the integrated dimension which is created from many sources. A data warehouse automation tool needs to be able to construct this dimension, not a country or currency dimension, but a customer or security dimension made up of several source tables.
Which features and functionalities do you think are absolutely essential for an effective data warehouse automation tool?
The functionalities which are absolutely essential for an effective data warehouse automation tool are:
- It must keep up to date with ERP, CRM, or IMS (e.g., SAP, Salesforce, or ThinkFolio) as they change versions
- It must have the ability to integrate multiple sources to create complex dimensions, i.e., the customer dimension
- It must have the ability to pre-process data files, e.g., to retrieve or download files for processing.
- It must allow for automated testing and deployment
Finally, where do you see the data warehouse headed in the future? What significant advances (if any) do you foresee in this space?
Some people think they can replace two years of data warehouse development with a three-month data lake development. This is deceiving because the customer or product data being loaded into the latter architecture is not integrated, so it is not queryable. And they might forget to build data quality checks or operational readiness (e.g., monitoring tools) into the system, which would otherwise take 30% of the whole effort. Granted, some projects do not need integration, and in this case, a data lake can be used, but you still need to build data quality checks and operational readiness.
What I see coming in the future is the automation of testing and the automation of data quality checks. It is not too difficult to construct a tool that watches every incoming file and performs checks on specified columns, comparing it with the last few days’ data. The other thing which could be built is a BI tool that is CI/CD friendly, e.g., XML-based, so it can be deployed from the source control to production automatically because it is text-based.
And finally, up to now, there is no single data governance tool in the market that can scan databases, ETL, and BI tools and automatically create data lineage from source files to the report. It might not be possible to construct this tool for every single database, ETL, and BI tool out there, but it is possible to do the major ones. The key to data governance is automation.
Astera DW Builder – An Agile Answer to Your Data Warehouse Development Challenges
Astera DW Builder is a unified platform that enables you to quickly deploy and consume metadata-enriched data models driven by end-user requirements.
The product brings a range of out-of-box functionalities you can leverage to build an agile data warehouse that truly serves your business’s needs. From implementing business rules and naming conventions at the logical level to creating complex data pipelines that can automatically retrieve data from your chosen source systems, transform them according to your data quality requirements, and load them to a cloud or on-premise database of your choice, it’s all made possible through our no-code solution.
Now, we’re giving you the chance to get an exclusive first look at the speed and intuitiveness of ADWB for yourself in an end-to-end launch event that promises to change the way you think about data warehousing. Watch the exclusive launch event here.