domingo, dezembro 16, 2007

Dimensional Relational vs. OLAP: The Final Deployment Conundrum

The choice between deploying relational tables or OLAP cubes is not a trivial matter. Weigh these 34 pros and cons of each approach early in the design of your extract-transform-load system.


By Ralph Kimball
Abril 27, 2007

Ralph Kimball Ralph Kimball

It has become fashionable to regard the final deployment step of building an extract-transform-load (ETL) system as a mere tactical choice between delivering dimensional relational tables or OLAP cubes into the end-user environment. But is this choice quite so superficial? Should we defer thinking about this choice until just before roll-out? In this article we take a hard look at this final deployment conundrum and urge you to resolve this question very early in the design process Business intelligence (BI) developers have largely accepted the premise that data is most user-friendly when delivered in a dimensional format. If you define the data warehouse as the platform for all forms of BI, then the last step of the ETL process in the data warehouse is exposing the data in a dimensional format. Many BI developers have recognized that a properly designed set of dimensional relational tables can be transformed into an OLAP cube in a virtual one-to-one mapping. For various reasons explained below, I recommend that all OLAP cubes be built directly from dimensional models. The dimension tables in such a relational schema become the OLAP cube dimensions, often referred to as the edges of the cube. The fact tables from the relational schema provide the content of the OLAP cube cells.

While there can be some small semantic differences between the relational dimensions and the OLAP cube dimensions, the profound overlap between these two approaches has made it tempting to regard the final deployment choice as a tactical maneuver executed at the very end of the data warehouse development. Worse yet, the argument is sometimes made that BI applications can be switched between relational and OLAP implementations because of this similarity.

Not so fast! Under the right circumstances, the ETL pipeline can be mostly insulated from the final deployment choice, but the relational vs. OLAP choice is multi-faceted decision with lots of issues to consider. Let's look at the advantages and disadvantages of both choices before we jump on either bandwagon.

    Dimensional Relational Advantages

  1. Relational database structures are largely vendor independent and especially dimensional structures are pretty easily ported. However, ETL command line scripts and proprietary code like PL/SQL may not be very portable.
  2. All the main relational DBMS's have high-volume bulk data loaders, and they're especially effective if you turn off transaction logging.
  3. A wide variety of SQL-generating tools from many vendors are able to directly access the data. Usually these tools can be repurposed to point at a new relational DBMS.
  4. SQL expertise is widely available in the marketplace, since the main features of SQL have long been standardized, and SQL is routinely taught at the college level.
  5. Hand-coded SQL is generally readable, although the SQL emitted by high-end BI tools is overwhelming and cannot reasonably be changed by the BI applications development staff.
  6. There are many different ways to control DBMS performance, including schema designs, indexes, aggregates and materialized views. Dimensional relational structures, because of their predictable characteristics, have well-understood performance tuning techniques, although these techniques vary somewhat by vendor.
  7. Relational databases are extremely stable and are suitable for serious archiving and backup.
  8. Relational database structures are not vulnerable to catastrophic invalidation, unlike OLAP cubes that may involuntarily rebuild themselves if a SCD Type 1 change is made to a dimension.
  9. Database sizes are pretty much unlimited, and single dimensional fact tables in the multi-terabyte range are increasingly common.
  10. High-end relational databases can join many large tables.
  11. Hybrid (mixed workload) applications involving both querying and updating are usually easy to construct.

Relational Disadvantages

  1. SQL is a truly horrible language for powerful analysis and complex applications.
  2. SQL is severely asymmetric: you can perform complex constraints and calculations generally only WITHIN records, not across records.
  3. In spite of performance tuning capabilities, it is still too easy to lose control of performance.

OLAP Advantages

 

  1. Generally much better performance than relational when the cubes are designed correctly, with less need to play complex performance tuning games compared to relational.
  2. Much more powerful analytic capabilities than relational. MDX, for example, has more powerful semantics for traversing ragged unbalanced hierarchies, such as organization charts, than does SQL.
  3. Vendor-supplied OLAP tools for reporting and querying have historically been superior to relational tools, although I'm impressed by the steady investment relational tool vendors have made in improving ease of use and feature sets.
  4. OLAP doesn't suffer from the symmetry problem that limits SQL.
  5. The decision to load OLAP directly from dimensional relational tables doesn't affect the ETL backroom very much: the deployment step occurs at the very end.
  6. Certain loading scenarios can be very fast.
  7. Certain vertical industries, especially financial services, have developed awesome OLAP solutions.
  8. OLAP encourages more complex security scenarios, especially for ad hoc access. By comparison, it's difficult to set up a relational database to protect detailed data (sales by sales rep), but provide more open access to summarized data (sales by region). This is especially true for ad hoc access on the relational side. Security is significantly more powerful on OLAP because of the semantics about parents and children inherent in the access languages.

OLAP Disadvantages

  1. The big objection to OLAP is the proprietary, non-standard character of each vendor's OLAP offering.
  2. Don't expect to port an OLAP implementation to another vendor's product. Discard everything including all your application development
  3. There is no accepted, universally implemented access language for OLAP, although Microsoft's MDX is the closest thing to a standard access language. Significantly, Oracle has not embraced MDX, preferring to rely on SQL for all forms of database access.
  4. MDX in its full glory may be too complex for IT personnel to write by hand, or understand a complex application. But to be fair, I have looked at incomprehensible SQL many times!
  5. Historically there has been much less industry expertise in MDX than in SQL although Microsoft has successfully sponsored a cottage industry of training organizations.
  6. OLAP application development expertise is fragmented by vendor.
  7. OLAP cubes can be catastrophically invalidated if you are not careful, i.e. a Type 1 SCD change to a dimension may cause all the cubes using that dimension to rebuild!
  8. OLAP cubes are not considered stable enough for serious archiving and backup: this is a strong reason for creating a set of dimensional relational tables duplicating the content of the cube for those purposes.
  9. OLAP vendors have certain size limits not present in relational implementations including possibly the number of members in a dimension, the number of distinct values at various levels in a hierarchy and the overall size of the cube.

When you must rebuild a cube, it may be a VERY time consuming process.

Equally Easy in Either Approach

The basic foundations of dimensional modeling, including slowly changing dimensions, the three fundamental grains of fact tables, and achieving integration through conformed dimensions, are equally easy to deliver with either approach, especially since these dimensional structures are carried to the very last step of the ETL pipeline.

    Making the Final Choice

    So how do senior management and the enterprise BI system designers resolve the final deployment conundrum: Dimensional relational or OLAP? As I hope you appreciate, there is no slam dunk answer since there are significant advantages and disadvantages for both approaches. But let's consider two extremes. If you are a large distributed enterprise with a number of different database vendors and you are struggling to establish more commonality across your BI deployments, as well as creating an enterprise wide pool of BI development expertise without being beholden to any single vendor, then I recommend dimensional relational. On the other hand, if you are looking for the most potent single-subject solution with high performance and killer analytics, and you are confident that you can source the development expertise you need, then I recommend OLAP supported by underlying dimensional relational schemas. Otherwise, it depends.

    Ralph Kimball, founder of the Kimball Group, teaches dimensional data warehouse and ETL design through Kimball University and reviews large warehouses. He has four best-selling data warehousing books in print, including The Data Warehouse ETL Toolkit. Write to him at ralph@kimballgroup.com.

    Nenhum comentário: