Standing-up a Cloud Data Intelligence Platform
03 Feb '21

Project Spotlight Part 1:

Standing-up a Cloud Data Intelligence Platform

In this three-part blog series, we are looking back over one of our recent client projects, advising on and delivering the client’s multi-year data strategy. Sharing lessons learnt along the way, we will demonstrate the relevant findings, processes and successes that can potentially be utilised across other industries and businesses. We will be looking at how DTSQUARED helped the client get Collibra up and running in the Cloud, a key factor within their strategy.

Cloud Basics

DTSQUARED were engaged to support their strategic multi-year journey towards achieving their data ambitions. DTSQUARED has a long and successful track record of helping clients do exactly this and were chosen because of our ability to share best practice and bring tangible benefits to businesses. Part of this project was standing up Collibra – the cloud data intelligence platform. The cloud platform was the chosen solution; but with requirements to integrate on premise systems meant the overall architecture was a hybrid cloud, highlighting the importance of flexibility and customisation to leverage the best from any platform.

The first step along the journey was to provision environments. Collibra make several available as a standard rollout: one for development and testing, and one for production. Each vanilla cloud environment consists of the application used by business users, and a console application used for administrative and deployment tasks. Clients need to ingest data from various sources that exist within their organisation on-premises, meaning for each environment, a Collibra jobserver was also setup on-premises. The jobserver acted as the secure gateway between the client’s on-premises data sources, and the cloud-based Collibra application.

Authentication between the on-premises jobserver and the Collibra application in the cloud was setup as per guidelines, and specifically it was determined that authentication between the jobserver and the source databases would be done with using an account managed within the organisation’s active directory rather than a local database account. Collibra’s jobserver architecture allows this and it meant any credentials for databases on-premises did not leave the network; they were not stored in the cloud.

Ownership of the application components was established, and processes for operational activities such as backups and disaster recovery processes were put in place. To match the cadence of Collibra’s cloud platform release schedule, key dates and timeframes for upgrade windows were published that helped give users visibility of upcoming changes to the platform. The jobserver was only used when a data source refresh job was started, so upgrades to these applications would be done in alignment with Collibra cloud updates but also with data source refresh schedules.

The performance of the data ingestion jobs was put under scrutiny so that metadata from much larger data sources would be ingested in future. A performance tuning exercise was completed that resulted in a 60% decrease in run time on the largest refresh job, which meant the schedule of refreshes could be compacted into a smaller window at a time aligned to client working schedules.

Customisation

With the platform setup and the cloud components securely communicating with the components situated on-premises, Collibra was configured to meet the first use cases suggested by the client. We always recommend working initially in small bit-size chunks to both deliver benefits quickly as well as support the up-skilling of staff in a focused way. We have learn trying to tackle complex issues straight away with solid foundations in place, only leads to downstream problems. When setting out we focus on developing the organisation structure, which lay the foundations for how many of the other operational processes would work. In this particular case study, a federated model was chosen to give the divisions within the company control over their data artifacts relating to their part of the business, but with the same set of standards, responsibilities and processes that were being used throughout the organisation. Centrally controlled structures and hybrid versions exist, all based on the different needs of clients. 

The asset model designed for use within the organisation was a subset of what Collibra offers out of the box – intentional because it was to be customised exactly to be fit for its purpose, and not anything more. Even if it were possible to model all data, we do not advise modelling everything unless it is clear the business would gain value from it.

The next part of customisation was tying down what the most important data elements to focus on were, the critical data elements. There should not be too many elements defined as critical from the start as otherwise the benefit of putting each under tight controls would be diminished. The definition of critical was therefore important to agree on. These rules early on should be flexible to allow change, but as maturity increases, they become more and more set in stone. It also assists to help organisations get a handle on the volumes of data elements they were managing was using simple filters on the view of the metadata ingested by the jobserver. Creating critical data elements does not mean everything is not ingested, rather it is but hidden from certain views to ensure Collibra was still the full and true source of information if people needed it. Over time, what may be deemed critical can change and by ingesting everything, it allows data classification to be amended easily.

Results

Data profiling functionality in Collibra provided a useful insight into the metadata being ingested, and it revealed outliers in certain fields such as an alternative use for a field designed to store a post code. Dashboards were setup to provide the typical business user with the information they might need, as different sets of users needed to see additional information based on their role. e.g., a data steward wanted to see the latest rules being proposed to prepare for the next data quality rules they needed to develop.

What stood out in terms of usefulness and flexibility were the data lineage views in Collibra. Technical lineage on its own was interesting, but it came to life when it was shown end-to-end and related to business data assets. The data provenance of a reporting attribute could be seen by a business user inspecting a report, and it also provided a useful way of tracking what was left to do  for a report to be covered by adequate data quality controls. It also has unexpected consequences of challenging the data in reports to really question if it was needed. Too often we see reports created with a sole purpose in mind, but that changes and much of the data becomes unnecessary or irrelevant. Reducing the content of reports to just what is needed helps the users and those creating  reports and ensures people only have access to the data they need to do their job. This increased control, not only assists with legislation but makes for a more efficient organisation. 

Standing up a data intelligence platform in the cloud and connecting it to data sources secured on-premises is more than a technical exercise. It relies on partnership and trust with DTSQUARED working closely with clients knowing which of Collibra’s strengths needed to be leveraged most to draw real business value from the platform.

Next week, we will be showcasing how we built on these foundations and helped the client integrate Collibra and Informatica for Data Quality. 

You can Sign Up Here to get the next blog straight to your Inbox or speak to one of our experts today.

    Want to talk to one of our team? Contact Us.

    https://www.linkedin.com/company/dt-squared

    Get in touch with our data experts

    Get in touch for a free session with our data experts