DataOps applied for Data Management

Angel Llosa
4 min readAug 31, 2021

Why DataOps?

Data is becoming increasingly important because companies are moving towards data-driven approaches, where the use of data is becoming the main tool for achieving business objectives. This generates a series of effects and needs in the field of data that are difficult to cover with current approaches:

  • Growing volume of data: there are more and more data repositories. Tools such as CRM, sandboxes for data scientists, big data platforms, etc. are added to operational and informational data. This makes the unified management of all this data more and more complicated.
  • Data driven company: Moving towards such an approach makes it more important to manage data so that it is governed and of high quality, since strategic decisions will be based on it.
  • Data democratization: Access to data must be spread throughout the company, but always in a secure and controlled manner.

All these topics make managing and governing data and its applications increasingly complicated and resource-consuming. It is therefore necessary to move towards an approach such as DataOps, which seeks to govern the entire life cycle of data and its applications, seeking automation and agility.

Traditional definition

The most accepted definition of DataOps comes from an article of 2015 written by Tamr. It says that DataOps applies to the entire data lifecycle from data preparation to reporting and recognises the interconnected nature of the data team and information technology operations.

The main benefits that provide this approach is:

  • Manage many sources of data, numerous data pipelines, and a wide variety of transformations.
  • Increase velocity, reliability, and quality of data analytics.
  • Reuse and reproduce
  • Continuously show value to their business customers
  • Reduce Time to Market

To achieve these benefits, it’s necessary to apply agility and automation to all the processes involved in the data lifecycle. For this reason, it’s necessary to work on the automation of the two main areas of the data lifecycle:

  • Data Management
  • Data Applications

In this article, we will cover the Data Management topic.

Data Management

Around the field of data management, there are several areas where it can be possible to work with the objetive to automatize or streamline the tasks involved in the processes. Below are describes these areas and a proposal on how to do a DataOps approach:

Data Integration

There is a lot of effort dedicated to integrating the data between the different repositories in a company. The most common use cases are:

  • Maintain data consistency between repositories (Transactional, Informational, MDM, etc.)
  • Self-service data acquisition for analytics and AI

For this reason, it’s a recommended area to try to automatize. There are several ways to do it. For example:

  • Automatized workflows generation through metadata information. It’s possible to generate the workflows to integrate data between repositories from the metadata of the source.
  • Augmented data integration: recommendations for new workflows and performance optimization based on AI application to metadata.
  • Automatize the discovery of reusable artifacts and optimization
  • Pipeline as code: use code management tools to version and deploy workflows
  • Workflows designer’s recommendation assistant:

There are market and open source tools that give these functionalities which can accelerate the development of new integration workflows and optimize it’s maintenance.

Metadata Management

There is a huge volume of information in the metadata. It’s possible with modern tools to extract intelligence of these metadata to identify and trace the data stored in all the enterprise repositories.

There are several types of metadata:

  • Functional metadata: Table description, column, description…
  • Technical metadata: Table names, column names…
  • Operational metadata: ETL logs, DB logs…

With inference opensource or market tools (like Tamr) it’s possible, from this metadata, to get the trazability between the repositories and have the knowledge of where the data is, how is it processed and from where to get it.

Data Quality

Data Quality enables:

  • To cleanse and manage data, while making it available across your organization.
  • The reliability of data for making decisions.

In the tasks related to maintain the quality of the data, there are several usecases to automatize like data ensure, profiling, standarization, cleanising, matching or parsing.

As in the previous case, there are opensource tools that help to automatize it with traditional approaches like regular expressions, rules, etc. or more advanced techniques like machine learning.

Enterprise tools like Informatica or Talend are working too in these type of approaches.

Data Security

It is well known the importance of securing your data, but with the increase of the data repositories and their volume, and the new cloud challenges, there is a need to automate the data security as much as possible.

There exists open-source software and commercial tools that are focused on the data privacy management and data encryption and masking, to automate the tasks related to these security concerns.

Data Democratization

All these approaches facilitates the democratization of the data through a company. It enables

There are several lines to work in the field of how the information is consumed. New technologies provide new ways to access to the data, without the need of technological skills. For example we have:

  • NQL: Natural Query Language gives the possibility to query data easily, without technical capabilities, using natural language.
  • Auto-analyze: A step further are the auto-analyze capabilities. This technique gives you analysis around a topic or a data

Conclusions

There are a lot of areas in the data management cycle that can be automated. You can do step by step or try to try to address them all at once. In my opinion, it’s better to begin with the pain points, define KPIs and evaluate the ROI. This approach will give you an idea of the complexity and cost of implementing this type of measures, as well as a learning experience with a small investment.

--

--