As anybody that follows my blog knows, I am not a fan of vertical scaling. I don't like solutions that can only be implemented in a single address and storage space. Unfortunately, there are analytical problems that need a holistic view of data. This is very typical of data warehousing applications. As a result, data warehouses are expensive, often out of the reach of smaller organizations. But there may be an alternative that is less expensive and horizontally scalable. What is this great revelation? Processing streams of events using an Event Stream Processor (ESP) solution.
ESP analyze streams of events using a language similar to SQL. In the same manner that databases and data warehouses use SQL to perform analysis of data tables, ESP use their query language to analyze streams of events. The simplest way to understand ESP is to think of events as rows in a table and the attributes of an event as the columns. Each event type is the equivalent of a table. From this perspective, it becomes straightforward to see how ESP works. But how does this relate to replacing data warehouses?
Data warehouse analysis involves aggregating information along a variety of axis as well as inverting relationships in the data. The goal is to provide the business with different perspectives on what the customers are doing. In order to do this, data is loaded into the warehouse periodically. Typically daily ETL processes are performed on the production databases to keep the warehouse fresh. This process though has a couple issues beyond the cost of the warehouse infrastructure. First, the ETL places a significant load on your production databases. If your business has nice offline windows for the ETL, that's great, but if not, managing the scale becomes a challenge. Second, the freshness of the warehouse is typically 24 hours behind or more. As your business grows this lag will grow as well.
ESP address this by analyzing the changes to your data as it occurs. Rather than doing batch ETL's, you stream business events as the state of your data changes. This creates a more manageable scaling model for your production system. The business analytics extracts are spread throughout the transaction day. ESP can also be horizontally scaled, providing a more cost effective solution for your business. And since ESP is performing the analysis in real time, the business metrics can be current and remain that way as the business grows.
Does this spell the end of data warehouses? Well, maybe but there is one challenge with the ESP approach. While it is able to provide analytics cost effectively, it does not provide the ability to perform historical analysis. If you know what you want, then ESP will deliver the results from the current point in time forward. But what if you want a different perspective on your business activity and you want it over the past 3 months. One solution is to create a framework for capturing and replaying transactions but this can be expensive. This becomes a matter of deciding the business value of performing the historical analysis.
Whether you choose to use a data warehouse or not, ESP is definitely worth investigating as a way of delivering business analytics more cost effectively.
Technorati Tags: architecture, database, engineering, performance, scalability, services, software, to_read, toread, web
I think another use of ESP could be to replace ETL while keeping the datawarehouse intact. This will removed the ETL related bottlenecks while still preserving the ability to do historical analysis.
Posted by: Tahir Akhtar | Tuesday, September 23, 2008 at 01:21 AM
I think this solution could have some performance issues.
While any change to database should be intercepted...
Is there any reference for further study...?
Posted by: Farshid Zaker | Saturday, November 01, 2008 at 01:48 AM
Some feedback on your post:
"Unfortunately, there are analytical problems that need a holistic view of data. This is very typical of data warehousing applications. As a result, data warehouses are expensive, often out of the reach of smaller organizations."
Having a holistic view of data does not mean that a data warehouse has to be expensive and out of reach of smaller organizations. With a little bit of knowledge, some good books and open source software, even small organizations can build a functional data warehouse on a limited budget.
"ETL places a significant load on your production databases."
If it does then your ETL process is designed poorly. ETL should occur in a staging environment independent of the production environment. Most databases support online replication which can provide an easy means for keeping a staging environment in sync and ready to be processed.
"As your business grows this lag will grow as well."
Yes, this is an issue with ETL systems that rely on bulk processing, however ETL has been evolving and is moving towards extracting relevant information from transaction logs and only processing deltas, which greatly reduces latency.
"While it is able to provide analytics cost effectively, it does not provide the ability to perform historical analysis"
This is a pretty big deficiency but as you say there are ways around it. The expense of recreating the transaction logs is a one-time cost and can be done in non-production environments.
My biggest issue with your post though is that it glosses over the usability issues that come with trying to work directly with operational data. Data warehouses are about more than just optimizing for analytics from a technical perspective, they are also about creating an easy-to-understand schema that business folks can use directly in a self-service fashion. This means converting operational normalized data into business-process oriented denormalized structures. Unless your event processor is doing that you are missing one of the greatest advantages of data warehouses when they are done right.
Posted by: Anthony Eden | Tuesday, November 04, 2008 at 12:45 PM