Standing-up a Cloud Data Intelligence Platform
16 Feb '21

Project Spotlight Part 2: Integrating Collibra and Informatica for Data Quality

In the second blog of this three-part series, we will be looking at how DTSQUARED helped our client integrate two data tools – Collibra Data Intelligence Platform and Informatica Data Quality.

Case study

DTSQUARED worked with a client to deliver their long-term data strategy, and one of the challenges to solve along the way was how do we integrate the tool chosen to manage and govern data, with the tool chosen to enforce data quality?

Collibra helped the organisation govern various aspects of their data such as cataloguing, data lineage, and providing data owners and stewards a means to manage their data. Data quality requirements were also defined and captured in Collibra, and each related to a business rule that may involve more than one application and usually more than one entity. While data quality rules can be defined and the outcomes of running the rules viewed in Collibra, the tool itself does not enforce those rules in the source database.

Informatica Data Quality was chosen by the client to undertake the enforcement of the rules that were being defined by the business in Collibra. The results of Data Quality checks which were scheduled to run at pre-determined intervals would help the business know almost immediately issues occurred in Collibra. Furthermore, not only were users alerted to these updates, but the Data Quality errors themselves were uploaded to a report so the business could review them and decide on action. The key business benefit was for users to only use the one product, Collibra, to know the results of the data quality checks that were implemented on another platform, Informatica.

The solution

An integration between Collibra and Informatica was required. One of the first steps in the process was to design the metadata model in Collibra that would be used to define the rules and hold the results. 

Collibra comes configured with Out of the box functionality to recognise asset types that allow these artifacts to modelled.  DTSQUARED worked with the client to ensure that this asset type worked with the other parts of the metamodel being implemented. 

For our use cases, reporting attributes were mapped to data elements (held in source database columns), which in turn had business rules associated with them to monitor quality. These were subsequently developed into technical rules to undertake the quality monitoring. Joining all of these up from reporting attribute to application source database columns meant it was possible to visualise the end-to-end report to source data lineage and provide a view on the quality of the data being used in the report.

Several workflows were designed and developed by DTSQUARED to help the client put controls around the process of defining rules and getting them approved and developed. A custom approval workflow was developed to allow Informatica Data Quality developers to be notified when a rule was ready for development. To help the developer get all the information they needed to implement the rule, a custom extract was developed to hold all the specifics such as tables and columns to base the rule on.

Once the data quality rule was implemented in Informatica, the results needed to be passed back to Collibra. This integration was done by developing a custom task in the workflow to call Collibra’s API to log the most recent results of the data quality rule. However, this was not without issue as Collibra has a maximum history it could store and running the rules several times a day would soon breach the limits. 

To overcome this and build up a meaningful history of scores, daily interval scores were replaced each day with the last one, which reflected an end of day position. Then after each month, the daily scores were replaced with the month end view. By creating this mechanism quality dashboard provided that essential real-time view that the business needed to take remedial action quickly, but also was able to provide trends over a long period to show how quality was improving over time.

Also, with many rules being run concurrently, viewing the results of each data quality rule individually was not practical, so results needed to be rolled up to a level that a user could understand. DTSQUARED helped the client develop aggregation rules to ensure metrics were being accurately represented across each of the data quality dimensions, and that they were traceable to a single data policy within the organisation. In some cases, the rules being developed helped shape and update the policy and so both a top-down approach (policy first, rules second), and in other cases a bottom-up approach (rules first, policy second) was used. In time, the definitions began to converge to ensure a well-rounded data policy was implemented by a broad set of data quality rules.

Another interesting observation we found was when performance was reported using a single measure – a single data quality rule score, not only could this be misinterpreted, but it could also be misleading. This was dependant on how many rules are written under each dimension and whether the total score is calculated based on an average of dimensional scores. For example, if there was a data quality score of 100% in Accuracy and 50 out of a total of 52 data quality rules were all attributable to the Accuracy data quality dimension, if 1 of the 2 other rules for Completeness failed it would mean a score of 100% in Accuracy, but only 50% in Completeness. There are many ways to configure the data quality metrics in Collibra to avoid this, but it provided a warning to make sure each number was understood before it was presented to influential business users.

Handover 

Once the process to define data quality rules in Collibra was implemented, the rule creation process was put in place, and the automated interface to provide business users with the results in Collibra was developed, DTSQUARED handed over the solution to the client.

All documentation for workflows, asset models, organisation structure, integrations, and processes were written with the aim of providing a transparent view of what we had developed, enabling our client to further develop this should they wish to do so. The source code for all the Collibra customised workflows was also passed to the client.

Post-handover the work did not stop. Enhancements to the process are already in development namely integration of reference data between Collibra and Informatica. Because reference data existed in source systems and was mastered in Collibra, any data quality rules written in Informatica needed to be done in a generic way creating minimal development time.

DTSQUARED worked successfully with this client to integrate Collibra and Informatica Data Quality together to achieve the goal of giving business users direct and up to date insights into data quality within the organisation in a single tool. This has enabled them to make quick decisions based on their data, with confidence, taking a big step forward in their journey to become a data centric business.

Next week, in the final instalment of our Project Spotlight series, we will be sharing some of the key lessons learned on delivering value and solving business problems with Data Management Tooling.

You can Sign Up Here to get the next blog straight to your Inbox or speak to one of our experts today.

    Want to talk to one of our team? Contact Us.

    https://www.linkedin.com/company/dt-squared

    Get in touch with our data experts

    Get in touch for a free session with our data experts