dwSavvy ... platform for service oriented bulk data processing ...
Read about Service Oriented Bulk Processing  |  Try Live Demo!
Home > savvyPlatform > Platform Fundamentals
Platform Fundamentals
New to dwSavvy? Start Here ....

ETL Engine CloverETL: Behind the seen dwSavvy uses Clover.ETL java libraries for all its core data processing needs. Please refer to Clover’s site for details. Following is a quick pictorial of an example clover graph (from Clover’s site for reference):



Platform Independence: dwSavvy is platform and database independent — tested on both windows and Linux.

Cluster: dwSavvy is a massively parallel heterogeneous (windows nodes can be mixed with Linux servers) environment for running ETL job. Cluster is only a virtual concept, there’s no “master”. Master concept is limited to the extent that one of the nodes (cluster members) needs to host the metadata repository (default mysql — any jdbc compliant database can be used). If database can be viewed as an independent entity, dwSavvy is truly without a master. All the nodes run as peers. Each node needs the exact same binaries. Each node has a configuration file with connection information to the repository and self identification information like node name and ip address. Each node needs to be registered with the repository. A single cluster instance can be used for multiple data marts/tenants and cluster capacity apportioned as appropriate — see load unit below.

Node: While configuring each node with the repository, it is important to configure “maximum parallel jobs” and “total available memory”. The first parameter sets the maximum number of parallel ETL jobs that can be run on the node; the second instructs the node on the total memory available for ETL jobs. Each node is designed to run as many ETL jobs as it can run in parallel — these parameters ensure maximum utilization of node capacity without getting bogged down. A node can be configured as either “Active” or “Inactive”. Only “active” nodes participate in hosting ETL jobs. A node, for whatever reason (like maintenance), can be activated or inactivated dynamically even while the cluster is currently processing jobs. When a node is inactivated in the middle of processing jobs, it completes its current assignments before inactivating itself. This feature enables nodes to be added or removed, on the fly, to the cluster without impacting the ongoing batch processing.

Load Unit: dwSavvy is designed for multi– tenancy in that you can have a single dwSavvy cluster instance with a single repository but be supporting data mart (data processing) infrastructure for multiple tenants (departments, customers etc). A load unit defines a single logical data mart. Repository objects can be shared between load units — this helps keep metadata for different tenants separate. While configuring a load unit its important to define its “allocated capacity” within a cluster. For instance, if you have 2 load units, and you allocated 20 (%) to first and 80 (%) to the second the cluster makes sure the second gets 80% of the cluster capacity.

Batch: Every job (lowest unit work) runs in the context of a batch. A batch denotes a logical grouping of jobs that need to run in unison to complete a load — daily, weekly, monthly etc. For instance, one might have 4 jobs that load 4 dimensions and 2 jobs that load 2 facts. These 6 jobs together would constitute a batch. A batch can be configured to either start manually or can be scheduled. A batch prevents typical production support mishaps from occurring. For instance, typically the development team never does production support and documentation never sufficient hence the support team makes all possible errors in running loads — duplicate runs, not performing the necessary cleanup, not running all the required jobs, not running jobs in the right sequence etc. In a sufficiently large data center there are typically 100’s of jobs that needs to follow a strict regimen of sequence and timing making support extremely complicated — throw in outsourcing in the mix with several vendors (not) working together and you have a nightmare.

Metaunit: Any object registered in the metadata repository. Following are some of the metaunit examples that the platform comes packaged with: Dimension, Fact, file, table, clover graph, script (any executable) — new ones can be added as needed. Here are some of the supporting attributes:
• Mu_id: unique key.

• Lu_id: Load Unit identifier.

• Mu_type_id: Metaunit type — dimension, fact, file, table etc.

• Db_conn_id: Connection information for database tables

• Mu_name: Unique identifier

• Mu_desc: Description

• QA: Y or N.

QA is central to dwSavvy. Each data service (process) has an opportunity to perform QA on itself (health check). If you are writing your own data service, we recommend building QA. Even if the data service is designed to perform QA, it can be switched off using this attribute — for very large (or robust) objects one might want to disable QA.

• File_flag: If the metaunit is a file, there are special related attributes like File_path, format_fixed (y/n).

• Table_name: database table name — if the metaunit is related to a database table.

• Priority: Processing priority.

dwSavvy cluster uses this attribute to prioritize processing — amongst jobs that can be run in parallel in case of limited capacity.

• Active: Y or N.

Metaunits that have been inactivated don’t participate in the batch load. This make is very convenient to not process all the jobs that might be related to a metaunit.

• Profile: Y or N. Control to perform data profiling on any metaunit.


Process (data service): A process is an ERC (Extreme Re-usable Component) or a data service. Anyone can build a process by implementing a java interface and registering it with the metadata repository. Why is this concept any different from what any other ETL tool in the market can provide? dwSavvy platform is built ground up to create ETL (data service) that can run on any metadata. For instance, if you need to load a file into a database table, in a typical ETL engine like Informatica, you would create a mapping and define the file and table metadata during design time, which works as long as the file and table formats do not change (object metadata). With dwSavvy you could build a process to do the same only it can seamlessly handle file and table formats changes on the fly. The metadata binding happens not at design time but at run time. A process is more like a template. The binding is not done while defining a job but at run time; although you connect a “metaunit” (file/table etc) to a “process” to define a job, the actual binding happens only during the run time. Imagine what this can do to your data warehouse environment: With other ETL platforms, if file formats change, a code change is required. Not with dwSavvy, all you need to do is update the metadata in the repository and all the processes dependent on this metadata adapt automatically.

Job: A Job is the lowest unit of work in dwSavvy. Jobs run only in the context of a batch. They bind the process with the required parameters and metaunit (see Process (data service)) during run time. Jobs can be scheduled and dependencies set.
• Jobs first get “assigned” to a node.

• They start (“started”)

• Either fail or finish (“failed” or “finished”)

• If “failed”, they can be restarted by manually updating their status to “fixed”.

• Note their



Click here for PDF | HTML User Guides (ver 1.4) - or - to read more about savvyPlatform






Home  |  Terms & Condition  |  Contact Us
email:  sales@dwsavvy.com  ,  services@dwsavvy.com  ,  support@dwsavvy.com