Sunday, September 30, 2018

Spring Boot (1.5) OAuth2 Server in Enterprise environment

The problem

If we want to have an array of microservices and support user interaction through delegated authorization, this implementation would be one of the options to consider or at least review. We have to first understand the differences between OAuth2 and e.g. OIDC before we continue explaining how to achieve OAuth2 implementation in Spring Boot in a way that is stateless and integrated so many microservices can rely on this implementation.  OAuth 2.0 is the industry-standard protocol for authorization where OpenID Connect 1.0 is a simple identity layer on top of the OAuth 2.0 protocol. It allows Clients to verify the identity of the End-User based on the authentication performed by an Authorization Server, as well as to obtain basic profile information about the End-User in an interoperable and REST-like manner. This does not mean that OAuth2 cannot do authorization, it is just that OIDC is better suited for this work.

We are going to focus on OAuth2 implementation in this article. There are several good references for reading that you can look for before venturing to create your own implementation (one, two, three, four, five, six). My reasons to do this were several, like not having OAuth2 service available, need for microservices in our architecture, uncertain structure of the client-server architecture and several other. Since we already worked with Spring Boot implementing this solution was the next best thing we could do to move our project further towards our goals.

There are two ways we can handle tokens in OAuth2 and those are plain token and JWT token. For our purposes we chose plain token implementation. The difference was that plain token needs to be verified by OAuth2 service every time it is accessed and JWT can be stored in the resource and verified by the public keys provided. Either way we can have a working solution but implementing plain tokens was a simpler and faster way to go.

Our requirements were to provide solution for the stateless authentication/authorization for the we client, server had to have a small footprint, had to be scalable, we had to see the tokens/users that we generated, we had to have revoke token capability, we had to provide automated solution to obtain the token to integration testing, microservices had to be able to both authenticate themselves and take user authenticated tokens, we had to have ability to connect to LDAP or database and ability to later support SSO from third party provider. There is always an opportunity to implement different solutions but this was something that could potentially play well into the future banking architecture and plans.

Server configuration

To start off we chose Spring Boot OAuth2 and created spring boot application. There were several configurations that we needed to implement to make our server to perform authorization and authentication.

  • WebSecurityConfigurerAdapter (to define /login, /logout, swagger, filters - we also can use @EnableOAuth2Client to additionally configure SSO client)
  • GlobalAuthenticationConfigurerAdapter (to define user details service, BCrypt password encoder and init method to distinguish between LDAP and database source). This adapter was needed as there are several filters to read users depending on the flow invoked. 
    • auth.ldapAuthentication() was starting point for LDAP
    • auth.userDetailsService(...) was starting point for user details service and password encoding
  • LdapAuthoritiesPopulator bean for LDAP custom authorities (used repository to load authorities based on user authentication)
  • AuthorizationServerConfigurerAdapter (to define OAuth2 server infrastrusture, including custom SQL queries (as we needed DB access for stateless solution across servers). This included tables like oauth_access_token, oauth_refresh_token, oauth_code and oauth_client_details. Tables are used depending on the flow invoked. Action involved overriding TokenStore, ClientDetailsService, AuthorizationCodeServices, configure(AuthorizationServerEndpointsConfigurer endpoints), configure(AuthorizationServerSecurityConfigurer security), DefaultTokenServices and configure(ClientDetailsServiceConfigurer clients) - with @EnableAuthorizationServer.
  • ResourceServerConfigurerAdapter (to define adapter that will server as entry point and configuration for any custom APIs) - with @EnableResourceServer.
  • We also needed to expose API for the user verification where we publish Principal object (this will be used by the microservices to obtain user details)
It is very important to note that adapter ordering is extremely important and that you may loose a lot of time investigating why something is not working just because of this. The order (lowest to the highest) should be Web (needed for authorization_code and implicit and stateful - because of the login page and authentication) -> OAuth2 (needed for all grants but stateless for password, refresh_token and client_credentials grants) -> Resource (needed for APIs).

We opted to use authorization_code without secret and refresh token for user authentication, client_credentials for the server and microservices that needed to authenticate themselves and password grant for the integration test cases (as this is the easiest way to obtain the token for specific user). Our client_credentials added a default role for the client and the rest added a default role for the user. This way, every authenticated client/user will have a default role to start with. This was a sure way to distinguish between human user and server API. The one problem that we still needed to solve for is propagation of the tokens in the layers of services. It is not a good practice to propagate same token between different horizontal layers due to the loss of the identification of the service doing the authorization.

Client configuration

Before we start with the microservices it is important to say that all services are defined as resources (using @EnableResourceServer annotation). This automatically means that we can for one identify a resource and enable its usage to the client for the OAuth2 configuration, and second, we can setup verification URL for the token. In order for any microservice to identify itself, we have two options for the configuration. First in the application.yml and the other programmatic (in case we need to obtain the token itself). For the first option it is useful to use it on any service that needs to implement verify_token, that is, whenever we receive the token our API will send the request to validate the token and populate user details in the SecurityContext in the spring. This is achieved in the security.oauth2.client and security.oauth2.resource entries. There we have to specify our given client id, secret, verify_token URL, resource id, user details URL and a few other parameters. The second option is to obtain token programatically, for example in the spring integration layer, where declarative approach might be difficult, non existent or depending on the extensive logic. In this case the approach is to create a client code that is annotated by the @EnableOAuth2Client. In our case this was done in the ClientHttpRequestFactory. Obtaining the token is achieved by the OAuthClient and OAuthClientRequest to our authorization server using client_credentials grant. 

In the end the goal we had was achieved and all flows are now functional and serving the purpose. The next thing we need to worry about is switching the framework to the OIDC coupled with OAuth2. This will, perhaps, be a good topic for one of the next blogs.

Sunday, April 15, 2018

MyBatis Paging, Sorting and Filtering

When considering which database persistence framework to use to complete the goal of development, we must take in consideration several factors:
  • Knowledge of your team and ability of the team leads to help out with problems
  • Available documentation
  • Maturity of the framework
  • Availability of the helper classes or supporting frameworks (e.g. how much custom functionality we need to create)
  • Underlying structure of the database and it's complexity (if exists)
  • Portability (if required)
  • "Top down" vs "bottom up" approach driven by either business requirement or existing technology
  • Whether we want or have to write SQL statements and how complex they need to be to fulfill the business goals
  • Do we "own" the data model or is this vendor maintained data model
As you can see, reaching the decision of what to use can be based on the  experience or trial and run errors.

In one of my recent projects, we reached the decision to use MyBatis as the framework that will enable us to fulfill most of the goals set upon us by the business and existing applications and database topology. In the enterprise environment, you may be faced with the decisions that span not only to your immediate application, but that may involve several others and their data models. We needed to do just that. Read data from the various data sources, integrate with several different web services (both REST and SOAP) and provide unified DB and API interface (facade). This interface, needed to bring web services and various databases together to work as one and with improved performance. Introducing ESB layer was needed but also we needed to create uniform data model (that we can model our facade objects on). MyBatis was perfect tool for this. The only problem we faced was the lack of dynamic paging, sorting and filtering (PSF) functionality, given that we needed to combine results from the different databases (where some of them had awkward design, to say the least). Hibernate was dead in the water here. We ended up using PL/SQL and SQL from various sources, using pipelines and extensive logic to bring order into the data models. This solution worked very well, and in the end, we only needed to implement and expose the Search + PSF to the API and clients. We chose this supporting framework to help us build dynamic additions to the query: squiggle-sql. In their own words:
Squiggle is a little Java library for dynamically generating SQL SELECT statements. It's sweet spot is for applications that need to build up complicated queries with criteria that changes at runtime. Ordinarily it can be quite painful to figure out how to build this string. Squiggle takes much of this pain away.
 This worked perfectly as we could expose REST API JSON with PSF parameters and reuse this through various interfaces. To avoid SQL Injection (as we were building these custom), we needed to use enumerations to match exact constructs, avoid certain risky operations like OR 1=1 or comments in filters, limit the field types and lengths. Overall, we achieved a good mix of security and usability with flexible interface.

We started first by defining the interface in JSON that must have Paging (or streaming), Sorting and Filtering functionality. Again, it is important to limit any functionality with Constants or Enums to make sure all constructs can exactly be matched to operations or underlying supporting Beans. This is important security feature.

Executing Paging in database is relatively straight forward in Oracle by adding this OFFSET <x> ROWS FETCH FIRST <y> ROWS ONLY. The other thing that is needed is to get total number of rows returned from database in order for GUI to calculate how many rows are present. This generally requires executing the statement second time without the row limiting clause or we could write a WITH statement in Oracle and execute once and link into broader query that can execute count on first occurrence and limit rows on second. Important constraint is that the page can start with 0 and up and you should not allow returning of huge pages. If you require a single result that has all records, this may be achieved by secondary API that is limited in use and users that can access it. The reason is that if parallel multiple request are executed on a huge underlying data set, it may lead on DoS type of attack.

Sorting can have multiple columns, so this means that we must match any columns for the sort with ascending or descending parameters and participating Bean properties. Sorting is added as ORDER BY clause. Important feature is to match ORDER BY clause by Enum and columns by underlying Bean. Bean that is exposed to the API should only carry fields that are necessary for API to function properly and to satisfy a business need. Any other database functionality regarding the tables should be hidden behind a services and transfer objects (ref). Order clause can be applied to the complex queries, e.g. multiple set joined by creating an encapsulating select statement around original request.

Filtering may be the trickiest one to implement due to wide range of criteria that can be applied. Same as before, using Enums to define operations and limiting search capability to the filter to use only single fields names and only AND operation is important for the aspect of security and speed. Allowing OR or free statement entry may be dangerous option for execution.

Overall, what we achieved is that now MyBatis has a potential to achieve some of the things that Hibernate has out-of-the-box.

Saturday, May 13, 2017

Spring Boot with JSF

I have open sourced an application, recently, that I created for my own time tracking and invoicing and published it on GitHub under the Apache 2.0 license. When I started working on it, I thought about which technology should I use and decided on Spring, more precisely Spring Boot, because that was something that is natural to me as I have used this for many years. But this would only serve a purpose for the middle tier and I still needed to think about what to use for the front end. Trend for the  Spring Boot is to use AngularJS or similar frameworks that would tie in it's capabilities and potentially be deployable on the platforms like Cloud Foundry or similar cloud based solutions. This is due to the fact that these are the stateless solutions, and due to their scalability and ease of deployment they are a good fit for the Spring Boot. However, once we start distinguishing between web applications and enterprise applications, depending on what they need to achieve, I believe that  JSF still plays a big role due to its robustness and number of out of the box components that we can use. PrimeFaces framework is something I have used extensively in the past years and it played well with older Spring implementations, but Spring Boot was not meant to work with this out of the box. I have found a project on the GitHub called JoinFaces to help me with this. JoinFaces incorporate best practices and frameworks from JSF world and allow it to work together with the Spring Boot. In an environment where we will need scalability and multiple server deployments, we would still need a solution to share sessions or create sticky client connections, but for my purpose, this was ideal. So here is the stack that I used:
  • JoinFaces
  • Spring Boot
  • Hibernate (with QueryDSL)
  • H2 database
  • BIRT reporting engine
These are the basic libraries that allowed me to create an easy deployable solution for the local computer or cloud based application depending on your need. This application still has only one deployable file, but we can easily abstract and create mid tier with business services if we needed micro services type of an app. So why use these components? Let's address my thinking behind each one of these:


This is the project that ties JSF to Spring Boot and it can be found on this location. In their own words: This project enables JSF usage inside JAR packaged Spring Boot Application. It autoconfigures PrimeFaces, PrimeFaces Extensions, BootsFaces, ButterFaces, RichFaces, OmniFaces, AngularFaces, Mojarra and MyFaces libraries to run at embedded Tomcat, Jetty or Undertow servlet containers. It also aims to solve JSF and Spring Boot integration features. Current version includes JSF and CDI annotations support and Spring Security JSF Facelet Tag support.

Since it includes PrimeFaces and this is my favourite JSF engine, it was perfect for what I was trying to do. It is also up to date and well maintained.

Spring Boot

Spring Boot is the new flavor and trend for the development if you prefer this framework. This is what Maven was to Ant when it came out. The goal here is to pre-configure as many parameters as possible and autodetect dependencies. The whole Cloud Foundry was created to make development and deployment as easy as possible in the enterprise environment. In their own words: Spring Boot makes it easy to create stand-alone, production-grade Spring based Applications that you can "just run". We take an opinionated view of the Spring platform and third-party libraries so you can get started with minimum fuss. Most Spring Boot applications need very little Spring configuration.

Spring in general was a revelation when it came out. It made developing in Java something that you can actually like. Using this as a wiring framework for the libraries is a really good fit in my opinion.

Hibernate (with QueryDSL)

Hibernate was always one of my favourite ORM frameworks. In the world where good programmers are difficult to find and where Java developers do not know or understand how to write a proper SQL, Hibernate offers to bridge this gap. Another good thing is that it will adapt to any supported database you can use. The bad thing is that you may not be able to use database specific things (e.g. hierarchy queries) or where you have complex and demanding queries. Hibernate also may not be a good option for optimisation of the complex queries. If your existing database model is also not 'by the book', you may experience additional problems. In their own words, Hibernate is: Hibernate ORM enables developers to more easily write applications whose data outlives the application process. As an Object/Relational Mapping (ORM) framework, Hibernate is concerned with data persistence as it applies to relational databases (via JDBC). 

You may fall back to JDBC from Hibernate in case you need to write database specific queries or write native type of queries.

QueryDSL is a good addition to JPA as it enables for an easy filtering and query structuring where this may be required. I found it to be very helpful. In their own words: Querydsl is a framework which enables the construction of type-safe SQL-like queries for multiple backends including JPA, MongoDB and SQL in Java. Instead of writing queries as inline strings or externalizing them into XML files they are constructed via a fluent API.

H2 database

I have gone through several databases (Derby, H2 and HyperSQL) for this project and H2 for me was easiest to setup, backup, run in shared mode and just support a lot from the SQL standard. Very fast to initialise and super easy to understand. It integrated well into the application and so far I have not seen any issues with it. In their own words: Welcome to H2, the Java SQL database. The main features of H2 are: Very fast, open source, JDBC API, Embedded and server modes; in-memory databases, Browser based Console application, Small footprint: around 1.5 MB jar file size.

BIRT reporting engine

For invoices, I required an open source reporting engine that was free. I looked into two: Jasper Report and BIRT. I started with Jasper as it seemed easier to integrate but once I started doing  development for the subqueries is when the things got 'interested'. They both have a good UI for designing reports but I found BIRT to be easier and faster to work with. The main difference is that Jasper enables absolute positioning of the elements and BIRT does not. With some planning, this really is not relevant. In their own words: BIRT is an open source technology platform used to create data visualizations and reports that can be embedded into rich client and web applications.

In the next few blogs, I will try to explain what I did, how and why. You may find the application that I called My Open Invoice on GitHub.


Wednesday, December 16, 2015

Books 2

Here is the new batch of the books that I have read and liked.

[Homeland] Cory Doctorow wrote a sequel to Little Brother. His protagonist, Marcus, continues the fight against the police state. If you enjoyed the first book, this is certainly going to entertain you. You can get the book here.

[Lost Symbol] The Lost Symbol is a masterstroke of storytelling that finds famed symbologist Robert Langdon in a deadly race through a real-world labyrinth of codes, secrets, and unseen truths . . . all under the watchful eye of Brown’s most terrifying villain to date. Set within the hidden chambers, tunnels, and temples of Washington, D.C., The Lost Symbol is an intelligent, lightning-paced story with surprises at every turn. You can get the book here.

[Inferno] Harvard professor of symbology Robert Langdon awakens in a hospital in the middle of the night. Disoriented and suffering from a head wound, he recalls nothing of the last thirty-six hours, including how he got there . . . or the origin of the macabre object that his doctors discover hidden in his belongings. Langdon's world soon erupts into chaos, and he finds himself on the run in Florence with a stoic young woman, Sienna Brooks, whose clever maneuvering saves his life. Langdon quickly realizes that he is in possession of a series of disturbing codes created by a brilliant scientist-a genius whose obsession with the end of the world is matched only by his passion for one of the most influential masterpieces ever written-Dante Alighieri's dark epic poem The Inferno. Racing through such timeless locations as the Palazzo Vecchio, the Boboli Gardens, and the Duomo, Langdon and Brooks discover a network of hidden passageways and ancient secrets, as well as a terrifying new scientific paradigm that will be used either to vastly improve the quality of life on earth . . . or to devastate it. You can get the book here.

[A.I. Apocalypse] Leon Tsarev is a high school student set on getting into a great college program, until his uncle, a member of the Russian mob, coerces him into developing a new computer virus for the mob’s botnet - the slave army of computers they used to commit digital crimes.

The evolutionary virus Leon creates, based on biological principles, is successful -- too successful. All the world’s computers are infected. Everything from cars to payment systems and, of course, computers and smart phones stop functioning, and with them go essential functions including emergency services, transportation, and the food supply. Billions may die.

But evolution never stops. The virus continues to evolve, developing intelligence, communication, and finally an entire civilization. Some may be friendly to humans, but others are not.

Leon and his companions must race against time and the military to find a way to either befriend or eliminate the virus race and restore the world’s computer infrastructure. You can get the book here.

[Influx] Are smartphones really humanity’s most significant innovation since the moon landings? Or can something else explain why the bold visions of the 20th century—fusion power, genetic enhancements, artificial intelligence, cures for common diseases, extended human life, and a host of other world-changing advances—have remained beyond our grasp? Why has the high-tech future that seemed imminent in the 1960s failed to arrive?

Perhaps it did arrive…but only for a select few. You can get the book here.

[Flash Boys] In Michael Lewis's game-changing bestseller, a small group of Wall Street iconoclasts realize that the U.S. stock market has been rigged for the benefit of insiders. They band together—some of them walking away from seven-figure salaries—to investigate, expose, and reform the insidious new ways that Wall Street generates profits. If you have any contact with the market, even a retirement account, this story is happening to you. You can get the book here.

[The Money Bubble] The US, Europe and Japan are making financial mistakes that will soon cause a crisis of historic proportions. This book explains those mistakes and the likely shape of the crisis, and offers advice to those hoping to protect themselves and profit from what's coming. You can get the book here.

[Beginning Python] is not a bad book to remind yourself of the basics. It is written for Python 2.4.

This tutorial offers readers a thorough introduction to programming in Python 2.4, the portable, interpreted, object-oriented programming language that combines power with clear syntax
Beginning programmers will quickly learn to develop robust, reliable, and reusable Python applications for Web development, scientific applications, and system tasks for users or administrators
Discusses the basics of installing Python as well as the new features of Python release 2.4, which make it easier for users to create scientific and Web applications
Features examples of various operating systems throughout the book, including Linux, Mac OS X/BSD, and Windows XP

This is it for now and if you liked my selection, please drop me a note or recommend another book.

Thursday, April 23, 2015

Aspose document generation


In today's business world reporting comes out as the end result of the functional application and it serves the purpose of giving introspection into the system, functionality, current and future needs and much more. Having a good system to produce reports is a challenging requirement as one has to balance functionality, requirements, price, potential support needs, easiness of use and widespread acceptance in development community, documentation, performance, interoperability etc. There is a huge potential of various products on the Internet when searching for a suitable framework to satisfy such needs.

During one of my last projects, we needed to accommodate several requirements in finding a product that can do it all :). I believe that we have come across a product that in our testing and POC proved to be a right choice. That product is Aspose.

Aspose is a leading vendor of .NET, Java, Cloud and Android APIs, SharePoint components and rendering extensions for Microsoft SQL Server Reporting Services and JasperReports. They provide various products for working with Word, Excel, Images etc. To make the process easier, Aspose offers a license for all of the products called Total license. The pricing is very competitive too.

The benefit of using the Aspose is that end reports or document templates can be designed by business users in form of Excel of Word document and Aspose can be used to programmatically populate these documents. Documentation is very good with a lot of examples. Support is done primarily through the forums. The response times for support requests at the time I was writing this were very good even for unpaid support.

Usefulness of the Aspose comes from the versatility of the product, in a way that we can for example, start working with Word and add Excel details or pull parts of it, convert and standardize both and produce combined PDF, and then insert images or graphs on the go. Product comes in a form of library and supports both Java and .NET worlds. Library use is straight forward and license is loaded through the code before invocation of the function calls. Documents can be processed from the files or from the streams (if you keep them in the database). Library supports multithreading and in our tests proved to be very consistent and without apparent memory leaks. Data population uses Mail Merge functionality from Word, and supports both plain fields and repeating fields (like tables). Nesting tables require some playing around design in Word documents but it is doable and result looks very good. The product can run in both Windows and *NIX world and having Word or Excel installation in not needed.

I also have to add that libraries can be converted into OSGi friendly libraries and they work quite well in OSGi container too, whether you package them as JAR dependency or expose them as an OSGi bundle.

When working with Aspose and Mail Merge, data stream is established through the java.sql.Statement and has to be passed like this. We have tested functionality on both Oracle and DB2 and both work really well. Minor overriding needs to be done in case of BLOB fields to accommodate output based on the client library that you want to use.

A sample generic application (you can find this code on Aspose site in various sections):

public class AsposeDemoImpl {

    private final static Logger logger = LoggerFactory.getLogger(AsposeDemoImpl.class);

    private DataSource dataSource;

    public DataSource getDataSource() {
        return dataSource;

    public void setDataSource(DataSource dataSource) {
        this.dataSource = dataSource;

    public String generateDocument(String document) {"Generate document service is starting");

        Connection connection = null;

        try {
            connection = dataSource.getConnection();

            Document doc = new Document("WordTemplate.docx");

            doc.getMailMerge().setFieldMergingCallback(new HandleMergeOracleClob());

            DataSet dataSet = new DataSet();

            ResultSet resultSet = createStatement().executeQuery(
                    "SELECT * from DUAL"

            DataTable s1 = new DataTable(resultSet, "LogicalNameInTemplate");





        } catch (Exception e) {
        } finally {
            try {
                if (connection != null && !connection.isClosed())
            } catch (SQLException e) {

        return "DONE";

    private Statement createStatement() throws Exception {
        return getDataSource().getConnection().createStatement();

    private Statement createStatementFO() throws Exception {
        return getDataSource().getConnection().createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY);

    private Statement createStatementSI() throws Exception {
        return getDataSource().getConnection().createStatement(ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);

    private class HandleMergeOracleClob implements IFieldMergingCallback {

        public void fieldMerging(FieldMergingArgs e) throws Exception {
            if ("TEXTFEILD".equalsIgnoreCase(e.getFieldName()) && "TABLE".equalsIgnoreCase(e.getTableName())) {
                //TODO update content with text field - text is retrieved from BLOB

        public void imageFieldMerging(ImageFieldMergingArgs e) throws Exception {
            if ("IMAGEFIELD".equalsIgnoreCase(e.getFieldName()) && "TABLE".equalsIgnoreCase(e.getTableName())) {
                //TODO update content with image manipluation - image is retrieved from BLOB

Tuesday, November 18, 2014

ThinkPad t440s + Ubuntu

Several weeks ago I have purchased a ThinkPad t440s as I needed a new laptop for development. I was looking at several options including Carbon X1 and MacBook Pro. Having been using Ubuntu for quite some time I decided that I do not want to go with Mac. I also recently switched from iPhone to Nexus 5 so it did not seem like a good choice to me. And it was expensive. Between X1 and t440s, after some research, I found that t440s will suit my needs much better. I also use Lenovo at my work, but unlike that one (W520) which looks and feels like a big brick, t440s is a really elegant laptop. I was worried a bit that there might be some incompatibility with Ubuntu based on some reviews, but this was not the case. The main issue everyone was complaining about was the new touchpad. It supports multi gesture and it has entire area clickable. Ubuntu works almost perfectly with it (few glitches here and there but nothing major). In fact, after switching back to traditional one, it seems very unnatural to use the old one. You can do almost everything with one hand easily with the new touchpad.

I also upgraded my t440s with SSD hard drive and additional 8GB memory and added better wireless network card, but other than that, it was good to go. I opted to go with lower resolution monitor because I just like bigger letters on my screen as some of the applications do not scale well on high resolution monitors. Default Ubuntu installation took some time to organise to my liking, but after that it was just fantastic! Right now, startup time is ~10sec and sleep and hibernate work great! Two batteries last forever (6-7 hours) and they are very light. I also found this cool wallpaper to use from


Note: Now after 2 years of usage, I can say that everything stands for this laptop that I wrote except for the battery. Unfortunately, its capacity is down to 50% for both batteries and barely last 2.5 hrs.

Friday, June 28, 2013

OLAP with Mondrian

If you ever wanted to do OLAP analysis, first thing that would needed to be done is selection of the tools. There are several good tools out there but I would like to write a little bit about Mondrian OLAP as it comes in form of open source community edition and commercially supported project.

Pentaho Analysis Services Community Edition also known as Mondrian, is an Online Analytical Processing (OLAP) server that enables business users to analyze large quantities of data in real-time. The need to analyze the data comes from data feeds that are detrimental for successful planning and strategies when making decisions about past, present and future sales/expenditures. A great introduction to Mondrian can be found on Pentaho's site so I will skip some of the explanation and get to the point of describing why I liked this product.

To begin with, Mondrian concept is based on ROLAP which means that it draws its data from relational database and it uses relational model for its operations. A good comparison between different concepts can be found here (MOLAP, ROLAP, And HOLAP). To draw on one segment of description, ROLAP methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP's slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement.

A few other concepts that you might want to understand before using Mondrian are:

  1. You will use MDX query language to work with your data. Multidimensional Expressions (MDX) is a query language for OLAP databases, much like SQL is a query language for relational databases. It is also a calculation language, with syntax similar to spreadsheet formulas. You can find a few online tutorials to learn about it and if you have tools like JPivot or Pivot4J, you will only need a starting MDX query where olap4j and JPivot will build up queries based on your navigation through UI.
  2. olap4j is a common API for any OLAP server, so you can write an analytic application on one server and easily switch it to another. Built on that API, there is a growing collection of tools and components. If you ever wanted to write a custom API do work directly with Mondrian, you would use this library.
  3. Mondrian schema is a single XML file where you will create OLAP constructs like Cubes or Dimensions. You will load the schema when connecting to Mondrian and you will write your MDX queries to be executed against this schema.
  4. Mondrian comes with two tools, Mondrian Schema Workbench, that you will use to build schema, and Aggregation Designer, that you will use to create aggregate tables based on your MDX query.
  5. There are two front ends for Mondrian, JPivot (older, fast and more stable) and Pivot4J (newer, with JSF, but slower).
Edit: I did view Pivot4J and Analytics UI as one and that would not be completely fair to say that Pivot4J core is slower than JPivot as it is comprised from two parts. Anyway, many thanks to Xavier who is lead on the Pivot4J project for explaining the difference and for quickly addressing some of the issues that I had. I feel that Pivot4J will be great product considering that it started less than a year ago.

Schema Workbench JPivot Table Drill Through Table

Now that we understand some basic concepts, here are a couple of pointers on how to start building your OLAP solution.

First thing is that you will probably want to create a separate web project just for OLAP purpose as depending on your facts table size, you might need a lot of memory. Mixing Mondrian with your application can unnecessarily slow things down. Mondrian (used with Java 5 and up) has the ability to report high memory consumption and gracefully shut down the query before exception is thrown (well, exception is still thrown but not OutOfMemoryError exception).

You can create link back to the OLAP servers and integrate them using IFrames or popups, for example (making sure that you have SSO, of course). If you are using JPivot I would not worry too much about integration with JSF as it comes with WCF (Web Component Framework). I found it a bit difficult to integrate with JSF anyway and hence the need for new project Pivot4j with full JSF support.

Next item on the list is creating a solid fact table. Mondrian schema will use one fact table per Cube. Designing a good fact table means creating proper foreign keys and measure columns and assigning adequate indexes. It also means thinking about implementation and implementing, for example, Star or Snowflake schema. Generally, having too many branches of tables will be more complex to work with (two tables per dimension should be good enough, and if you have more, just create views as needed) as modeling in current workbench will require you to work directly in XML (more than one join). Fact table should have basic dimensions where you will then create hierarchies and levels. Mondrian schema is pretty flexible, so it can work with tables and views, and allows for many complex constructs, but for maintenance purpose, aggregating data to appropriate level will prove easier to design and faster to use. Aggregation Designer is used for that purpose. For OLAP, speed and flexibility are important concepts and we should try to optimize things as much as possible.

If we take FoodMart example, we can analyze a schema and potential usage of data. So, lets examine Sales Cube for example. If we open FoodMart.xml in Schema Workbench we can then expand the Cube Sales node.

We will find following sub-nodes:

  1. Fact Table - with several aggregate commands defined. This will shape the fact table to desired cube. 
  2. Dimension Usage - these are dimensions that you will share between cubes and this tag allows you to construct dimensions outside of the cube and reference it in the cube. This increases re-usability.
  3. Measures - like sales, cost etc. These are going to be numbers that you will work with. This will give you different perspective on performance using different hierarchies/dimensions combinations.
  4. Calculated Memebers - these do not exist in the fact table but are calculated from existing measures.

If we get into one of the dimensions (e.g. Promotions), we will find that each dimension has link to fact table using a foreign key column. Usually, dimension will have only one hierarchy unless we can link directly to two or more tables from the fact table foreign key columns. Please note that we can also create a dimension from the fact table if more data is stored in the fact table (other than measures and keys). Each hierarchy has a primary key that links this hierarchy with a dimension. Dimension foreign key matches hierarchy table primary key. Each hierarchy must have a table, view or a join. Each join can be nested e.g. (T-J(T-J(T-T))). Each hierarchy has to have one or more levels. Each level can have property (like employee id, has name which is property). Each level represents one column (unless it has properties which can expand this). You will need to specify a column type. When using joins, you can create aliases for easier navigation. For usage on many other XML elements, you can check this page.

Sample MDX:
select {[Measures].[Unit Sales], [Measures].[Store Cost], [Measures].[Store Sales]} ON COLUMNS,
  Hierarchize({([Promotion Media].[All Media], [Product].[All Products])}) ON ROWS
from [Sales]
where [Time].[1997]

When starting to work with a schema, you will be able to put dimensions on columns and rows in addition to the measures. Once you start drilling into dimensions, a different breakdowns will appear as a combination of different levels and hierarchies. With each level you can get different measures. You will be able to change positions on rows and columns until you are happy with the data presented.

Designing meaningful information for your business is a key to success. You should be able to use this information to better understand current sales and predict future trends. Once you are at acceptable level you can drill down into data where Mondrian will display appropriate detailed information based on selected levels and slices (time in this example). This would be a report level. To achieve this level of flexibility without an OLAP, you would need to build many different reports.

Mondrian also has the ability to cache the data and this cache is maintained per connection. There are many options that you can set using file which can greatly improve the performance.

I hope that this gives you a brief overview of what you can accomplish with the Mondrian. If you are looking for a more detailed explanation, please reference Pentaho documentation as I found it very extensive and helpful.

If you have any suggestions, please send me a note.