dwSavvy ... platform for service oriented bulk data processing ...
Read about Service Oriented Bulk Processing  |  Try Live Demo!
Home > savvySolution
savvySolution
dwSavvy Solution centers around component re-usability and metadata driven ETL. The ultimate goals is to identify data integration patterns, apply best practices to them and build a data service to satisfy dynamically changing needs. Although, the specific solution described here works well for several data warehousing situations, note that not only can it be customized to your specific needs but can also be easily extended by adding new data services.

Core Services: Services like Clustering, fault tolerance, scheduling, process monitoring, process control, data QA, error handling, logging, archiving etc are fundamental to the platform. Core services are automatically layered upon any custom data service.

Data Services: Data profiling, dimensional processing, fact processing, net change processing, bulk loader, schema synchronization etc.

Visualize Solution: Yellow widgets denote data services.



Overview

dwSavvy comes with a sample data warehouse implementation. This architecture works well for most data warehousing situations involving high data volumes. Following were some of the key design consideration:

1. Data Quality: Verifying data at every step of the ETL hop is key to maintaining data quality
• savvyProfiler: Every extracted file/table can be profiled for anomalies. Data profiling activity is becoming quite popular to understand new data sets/sources — aids ETL development. Data profiling, in addition to being useful during the initial data analysis, can play a very important role in proactively monitoring data quality on an ongoing basis.

With savvyProfiler you could very easily profile the data as a part of your load cycle, ones your file/table metadata is loaded into the repository its only a matter of applying “profiling” (process) data service to a metaunit (file/table). savvyProfiler remembers all the historical profile results and provides a web based graphical UI for trend analysis. Also, you could choose which data elements to profile.

On an ongoing basis, it may be too much of an overhead to profile every data element within your data set. Typically, it’s a selected few columns that cause concern. With savvyProfiler, with only a few clicks, you could easily decide which columns to profile on a daily basis.

• Coming Soon: savvyProfiler will be able to raise alerts based on predefined tolerances. Apart from standard data profiling calculations — like count, sum, average, frequency distribution, standard deviation etc— you could provide custom data quality calculations as “clover expressions” or SQL statements for setting alerts as well as trending.

2. High Availability: Global economy is pushing data warehouses to be available 24 x 7
• Amount of data to be processed and loaded is getting larger and the load window getting smaller. Fortunately storage is becoming cheaper. dwSavvy has two mirrored areas (set A and set B) that can be switched with a “commit” data process. Basically, while users hit set A, ETL loads set B and vise versa.

3. ETL Server: dwSavvy is designed so most of the ETL, data aggregation and other data integration steps happens within itself
• Its better to take most, if not all, of the processing outside the database. This gives database uninterrupted bandwidth to service user queries.

Data Movement in dwSavvy

1. Extract: Extract all the data (files or tables) in the extract zone. Use CloverETL to create extract graphs to extract / move your data to the extract area — Directory path to this area can be configured at the load unit level. For several load units (data marts or tenants) different extract areas can be defined. If you have a need to automate your extraction process it can built fairly easily — please contact us.

2. Conform/Aggregate: Contains custom business rules to consolidate facts, dimensions and aggregates. This is the only portion that requires coding, everything else is done for you. A significant portion of your dwSavvy project time shall be spent developing this portion. Note that any portion of the project that you need reused across projects or customer base can be converted into a data service.

• Dimensions: Use CloverETL (Use their easy to use GUI) to build your consolidation graph. This graph shall read from various source files and spit out a consolidated dimension file that follows a pre-configured naming convention — default is mu_<dimensionName>_curr i.e prefix_metaUnitName_postfix (prefix and postfix are configurable at the load unit level. For instance, if my category dimension has mu_name = category, this file shall be called mu_category_curr). Data service Dim Versioning takes care of the rest:

• Dimensional metadata needs to be defined in the repository (Meta Unit and Meta Unit columns; meta unit type = dimension).

• Each column (Meta Unit Columns), apart from usual attributes like column name, data type, size etc. must have column tracking type (col_trk_type = type1 or type2). Ones these attributes are set all the versioning is done automatically — automatic history tracking for type 2 columns.

• Dimensional version can be using with either incremental or a full refresh file.

• Facts: Similar to dimensions, you need to build CloverETL graphs to consolidate all your facts, from various sources, into a single integrated fact format — this implements all your business rules but it doesn’t contain dimensional keys yet. The output fact file should follow a pre-configured naming convention (mu_<MU_NAME>_fact), as in the case of dimensions. Data service Fact Processing takes care of the rest:

• Fact metadata needs to be defined in the repository (Meta Unit and Meta Unit columns; Meta Unit Type = fact).

• Each FK into the fact table from dimensions should be defined as a child of dimensional PK.

• Data service Fact processing automatically replaces all the natural keys by surrogate keys from the dimensions.

• FK violation: The process creates a file with all the fact records where a corresponding dimensional lookup failed. The records don’t get rejected; corresponding FK is replaced by –1.

• Note that depending on the environment and requirements this behavior could be different. These records could be rejected too. This service can easily be customized for your specific needs.

• At this point you have all your dimensions and facts in the file system on the ETL server (dwSavvy server).

3. Load/Refresh: This phase of the warehouse is responsible for completing the data and metadata refresh in the database.
• Data service Sync Metadata is responsible for synchronizing metadata in dwSavvy to that in the database.

• Create any dimension, fact or any table that doesn’t already exist in the database.

• Alter any dimension, fact or any table that needs altered as a result of metadata change in deSavvy.

• Benefits:
1. Adding a new columns to an existing dimension or a fact is as simple as (1) add a new column to the meta unit in dwSavvy (GUI based) and (2) alter the consolidation graph to populate this new column.

2. Adding a new dimension or fact is similarly simple (1) add corresponding meta unit (2) add the corresponding consolidation graph and register it with dwSavvy (as a meta unit).

• Data Service Data Refresh is responsible for refreshing the database with the latest data warehouse load. It takes final dimension and fact files on the ETL server and syncs them up with the database.

• Data Service Data Commit is responsible for making the data refresh available to the user community. Basically, users access the data mart using a Data Mart Façade — database views. Current implementation of dwSavvy maintains 2 mirrored data marts, set-A and set-B. While the Façade points to set-A, set-B is used for data refresh and vice-versa. Data Commit data service should be the last process that gets executed in a batch, and should only be executed after all quality checks in your warehouse have validated the current load.

Data Quality in dwSavvy

Data Quality is central is to dwSavvy. Data Service Data Profiling can be activated at the meta unit level or column level for any file, table, dimension or fact. Note that data profiling is also bundled as an independent product. Statistics and profiling is a beautiful way of validating data quality. dwSavvy allows you to do this at the batch level, as well as across the batch for trending.






Home  |  Terms & Condition  |  Contact Us
email:  sales@dwsavvy.com  ,  services@dwsavvy.com  ,  support@dwsavvy.com