Friday, June 28, 2013

OLAP with Mondrian

If you ever wanted to do OLAP analysis, first thing that would needed to be done is selection of the tools. There are several good tools out there but I would like to write a little bit about Mondrian OLAP as it comes in form of open source community edition and commercially supported project.

Pentaho Analysis Services Community Edition also known as Mondrian, is an Online Analytical Processing (OLAP) server that enables business users to analyze large quantities of data in real-time. The need to analyze the data comes from data feeds that are detrimental for successful planning and strategies when making decisions about past, present and future sales/expenditures. A great introduction to Mondrian can be found on Pentaho's site so I will skip some of the explanation and get to the point of describing why I liked this product.

To begin with, Mondrian concept is based on ROLAP which means that it draws its data from relational database and it uses relational model for its operations. A good comparison between different concepts can be found here (MOLAP, ROLAP, And HOLAP). To draw on one segment of description, ROLAP methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

A few other concepts that you might want to understand before using Mondrian are:

  1. You will use MDX query language to work with your data. Multidimensional Expressions (MDX) is a query language for OLAP databases, much like SQL is a query language for relational databases. It is also a calculation language, with syntax similar to spreadsheet formulas. You can find a few online tutorials to learn about it and if you have tools like JPivot or Pivot4J, you will only need a starting MDX query where olap4j and JPivot will build up queries based on your navigation through UI.
  2. olap4j is a common API for any OLAP server, so you can write an analytic application on one server and easily switch it to another. Built on that API, there is a growing collection of tools and components. If you ever wanted to write a custom API do work directly with Mondrian, you would use this library.
  3. Mondrian schema is a single XML file where you will create OLAP constructs like Cubes or Dimensions. You will load the schema when connecting to Mondrian and you will write your MDX queries to be executed against this schema.
  4. Mondrian comes with two tools, Mondrian Schema Workbench, that you will use to build schema, and Aggregation Designer, that you will use to create aggregate tables based on your MDX query.
  5. There are two front ends for Mondrian, JPivot (older, fast and more stable) and Pivot4J (newer, with JSF, but slower).
Edit: I did view Pivot4J and Analytics UI as one and that would not be completely fair to say that Pivot4J core is slower than JPivot as it is comprised from two parts. Anyway, many thanks to Xavier who is lead on the Pivot4J project for explaining the difference and for quickly addressing some of the issues that I had. I feel that Pivot4J will be great product considering that it started less than a year ago.

Schema Workbench JPivot Table Drill Through Table

Now that we understand some basic concepts, here are a couple of pointers on how to start building your OLAP solution.

First thing is that you will probably want to create a separate web project just for OLAP purpose as depending on your facts table size, you might need a lot of memory. Mixing Mondrian with your application can unnecessarily slow things down. Mondrian (used with Java 5 and up) has the ability to report high memory consumption and gracefully shut down the query before exception is thrown (well, exception is still thrown but not OutOfMemoryError exception).

You can create link back to the OLAP servers and integrate them using IFrames or popups, for example (making sure that you have SSO, of course). If you are using JPivot I would not worry too much about integration with JSF as it comes with WCF (Web Component Framework). I found it a bit difficult to integrate with JSF anyway and hence the need for new project Pivot4j with full JSF support.

Next item on the list is creating a solid fact table. Mondrian schema will use one fact table per Cube. Designing a good fact table means creating proper foreign keys and measure columns and assigning adequate indexes. It also means thinking about implementation and implementing, for example, Star or Snowflake schema. Generally, having too many branches of tables will be more complex to work with (two tables per dimension should be good enough, and if you have more, just create views as needed) as modeling in current workbench will require you to work directly in XML (more than one join). Fact table should have basic dimensions where you will then create hierarchies and levels. Mondrian schema is pretty flexible, so it can work with tables and views, and allows for many complex constructs, but for maintenance purpose, aggregating data to appropriate level will prove easier to design and faster to use. Aggregation Designer is used for that purpose. For OLAP, speed and flexibility are important concepts and we should try to optimize things as much as possible.

If we take FoodMart example, we can analyze a schema and potential usage of data. So, lets examine Sales Cube for example. If we open FoodMart.xml in Schema Workbench we can then expand the Cube Sales node.

We will find following sub-nodes:

  1. Fact Table - with several aggregate commands defined. This will shape the fact table to desired cube. 
  2. Dimension Usage - these are dimensions that you will share between cubes and this tag allows you to construct dimensions outside of the cube and reference it in the cube. This increases re-usability.
  3. Measures - like sales, cost etc. These are going to be numbers that you will work with. This will give you different perspective on performance using different hierarchies/dimensions combinations.
  4. Calculated Memebers - these do not exist in the fact table but are calculated from existing measures.

If we get into one of the dimensions (e.g. Promotions), we will find that each dimension has link to fact table using a foreign key column. Usually, dimension will have only one hierarchy unless we can link directly to two or more tables from the fact table foreign key columns. Please note that we can also create a dimension from the fact table if more data is stored in the fact table (other than measures and keys). Each hierarchy has a primary key that links this hierarchy with a dimension. Dimension foreign key matches hierarchy table primary key. Each hierarchy must have a table, view or a join. Each join can be nested e.g. (T-J(T-J(T-T))). Each hierarchy has to have one or more levels. Each level can have property (like employee id, has name which is property). Each level represents one column (unless it has properties which can expand this). You will need to specify a column type. When using joins, you can create aliases for easier navigation. For usage on many other XML elements, you can check this page.

Sample MDX:
select {[Measures].[Unit Sales], [Measures].[Store Cost], [Measures].[Store Sales]} ON COLUMNS,
  Hierarchize({([Promotion Media].[All Media], [Product].[All Products])}) ON ROWS
from [Sales]
where [Time].[1997]

When starting to work with a schema, you will be able to put dimensions on columns and rows in addition to the measures. Once you start drilling into dimensions, a different breakdowns will appear as a combination of different levels and hierarchies. With each level you can get different measures. You will be able to change positions on rows and columns until you are happy with the data presented.

Designing meaningful information for your business is a key to success. You should be able to use this information to better understand current sales and predict future trends. Once you are at acceptable level you can drill down into data where Mondrian will display appropriate detailed information based on selected levels and slices (time in this example). This would be a report level. To achieve this level of flexibility without an OLAP, you would need to build many different reports.

Mondrian also has the ability to cache the data and this cache is maintained per connection. There are many options that you can set using mondrian.properties file which can greatly improve the performance.

I hope that this gives you a brief overview of what you can accomplish with the Mondrian. If you are looking for a more detailed explanation, please reference Pentaho documentation as I found it very extensive and helpful.

If you have any suggestions, please send me a note.

Regards

3 comments:

  1. Hi, I'm the developer of the Pivot4J project, and find your post very intuitive.

    By the way, it came to my notice that you seem to have a performance issue with my project.

    I'd appreciate much if you could visit our project page at GitHub and post an issue there.

    I believe for the most cases it should perform not much worse than JPivot, so I'd like to further investigate the issue for you and our future users if possible.

    Thanks for the nice article, and mentioning of my project. It really helps us to move forward!

    Regards,

    Xavier Cho

    ReplyDelete
  2. Hi Xavier,

    Thanks for the comment. Glad to know that people are reading my blogs :)...

    I found that your tool is fantastic successor for JPivot but yes, I had some issues with performance. I will be glad to post my observations on GitHub and hopefully I will be able to use the Pivot4J as is as I found it to be very intuitive.

    Regards

    ReplyDelete
  3. Nice write up for a starter in Open Source OLAP like me. Have been using SSAS till now.

    ReplyDelete