The process of normalizing the source system data into a Spider Schema is very similar to normalizing the data for the Star Schema. The best way to demonstrate this is to follow through an actual de-normalization of source system data into the Spider Schema while pointing out the differences between it and the traditional Star Schema.
Our sample project will have an imaginary source system we will call “System X”. System X represents an Insurance Companies ERP system, which allows them to issue Policies to their Customers, and track Claims. For those who are not familiar with Insurance based nomenclature I will add in my understanding of each topic as we go along.
When Insurance Companies issue Policies, it is typically for a Risk (driver of the car), and that Risk is typically covered by a number of Coverage’s. For example the Risk might be John Smith, who owns a 1966 Corvette, and is covered from Un-Insured drivers, Liability, Comprehensive and Collision. Additionally, since John Smith has a very clean driving record, with no accidents or traffic violations for the past 5 years, he qualifies for the Safe Driver Deduction. Lastly each Policy is not for an indefinite amount of time. So then each Policy is good for a Policy Term, or amount of time.
To summarize, we have source system data that contains Policies for a Policy Term, which are issued for Risks, and those Risks are covered by any number of Coverage’s, of which some Risks qualify for Deductions due to their driving history.
If you were going to normalize this into the Star Schema, you would likely end up with a Premium Fact Table (amount of money the customer paid for the Policy), and the following Dimensions: Policy, Policy Term, Risk, Coverage, and Deductions. This would leave us with a Fact Table with a Primary Key, the Foreign Keys from all the Dimensional Tables we created, and the Metric or Measure: Premium.
If you were going to normalize this into the Spider Schema, you would likely end up with a Premium Fact Table (amount of money the customer paid for the Policy), and the following Dimensions: Policy, Policy Term, Risk, Coverage, and Deductions. This would leave us with an Intermediate Dimension table with the Foreign Keys from all the Dimensional Tables we created, and a Fact Table with a Primary Key, a single Foreign Key reference to the Intermediate Dimension, and the Metric or Measure Premium.
Now let us assume that someday our Risk or Insured driver has an accident, and therefore a Claim is opened. We then need to include the Claim data into the Data Warehouse. For each Claim entered into the system, the Policy, Policy Term, Insured (Risk), Coverage’s in affect at the time, will be associated with the Claim. Although the Claim data will likely have its own Metric or Measure (Transaction Amount), and although each Claim will have other data associated with it such as an Adjuster, Claim Type and or Status, and the Claimant; the Policy information will also be associated.
Therefore if we created a data model for the Policy and Claim data for the Star Schema Data Warehouse, we would likely end up with a schema similar to this:
We created a new Fact Table called Risk Fact and added that to the Policy Data set since we have measures that have a different level of granularity for those metrics. Also we created a new Fact for the Claim Metrics. We also added in the Claim based Dimensions and associated that to the Claim Fact. However, the most important thing to notice is that all the Policy Dimensional Foreign Keys are repeated in the Policy, Risk and Claim Fact.
However, In the Spider Schema, we already have an Intermediate Policy Dimension, of which we can reuse the Primary Key of this for our Fact Table, instead of repeating the Policy Foreign Keys all over again. Additionally, we would then create a Claim Intermediate Dimension, as it applies to each Claim that is made in the source system. Here is what that data model would look like in a Spider Schema:
Notice how neither Fact Table has anything but the related Intermediate Dimension Foreign Keys in it. Additionally the Policy Intermediate Key is reused in the Fact Risk, and Fact Claim Tables. This means as data is added to the Data Warehouse in the Spider Schema, the Intermediate Dimension represents a set of logical data. Since that data is likely to be repeated later in the Data Warehouse with other data sets, it can be reused. It also means the data savings varies based upon the reuse of the common dimensions. Most importantly the data is no longer siloed, and it allows for an expansion of data naturally.
There are no known issues deploying the Spider Schema in a Data Warehouse or an OLAP (On Line Analytical Processing) software application. All reporting tools seem not only to accept the Spider Schema, but it seems to resolve a number of issues especially around Dimensional Relationships, in particular Hierarchies as they no longer are needed to allow members either within the same or other dimensions to be drilled through too.