Migrate Your Data, Not Your Data Quality Issues

Data teams tackle data migration projects when they need to permanently transfer data from one storage system to another. The benefits of these projects are typically reduced costs that come with consolidating systems, and increased productivity, such as faster queries on unified data.

However, data migration projects come with risks. Historically speaking, almost 85% of data migration projects overran time and budget⁽¹⁾. More often than not, the root cause of these failures is a lack of investment in systems and processes that ensure reliable, high-quality data. When newly-migrated data appears to be inaccurate in a new storage system, end users are likely to distrust the data and go back to the old way of working.

In an effort to mitigate that outcome, this article discusses several data migration best practices that use Soda to help address data migration risks.

Decide on your approach

The first thing to think about is your approach. Are you going with an incremental approach, or a big bang? Do you plan on running both systems in parallel for a while and if so, for how long? To decide on an approach, it’s best to create a stakeholder map that helps you understand who is using the system, for what purpose, and when each team can invest the time to migrate.

Plan your data migration

What follows are some of the steps to consider adding to your next data migration project planning.

Catalog data to be migrated

Document schemas and data types; Understand the variety of data in the system
Document volumes; Understand the volume of data processed daily

Define scope, priority, and stakeholders

Stack-rank datasets; What’s most important to do first?
Identify technical stakeholders; Which data and infrastructure engineers need to be involved?
Identify domain experts (SMEs); Which business SMEs can you leverage?
Identify key data consumers; Which consumers will do acceptance testing?

Document legacy data quality issues; Understand the problems of the past

Evaluate and pick tools

CI/CD; Which tool will you use to store and release code?
Ingestion; Which tool will move data from source to target?
Warehousing; Which SQL engine will you use?
Transformation; Which tool will build derived models and views?
Cataloging; Which tool will document data and stakeholders?
Quality; Which tool will evaluate data quality and debug issues?

Design first iteration

Select datasets; Take into account other projects
Profile and add baseline checks; Iterate on quality checks with the stakeholders
Map source to target types; Convert to more specific types, where possible
Migrate data transformation logic; Rewrite the transformation logic in the new tool
Write schema and reconciliation checks; Write checks that compare source to target
Share data quality results; Share the results of data quality checks with stakeholders

Test user acceptance; Ask SMEs and consumers to test

Decommission; Proceed team by team, or use-case by use-case

Report on TCO; Analyze and share the results of the project

Manage data quality

If data quality is the most common root cause of data migration failure, then let’s unpack what can go wrong and why.

Data quality issue	Potential root causes	SodaCL Checks
Incomplete, missing records	Some data was left out during ingestion and transformation	Row count anomaly Count per categorical Missing values Valid values Metric reconciliation Record reconciliation
Incomplete, missing columns	A number of columns were excluded or were completely empty and automatically dropped	Schema
Duplicate rows	Data was mistakenly duplicated because no key was added to the join	Duplicate

‍

Data teams can measure the success of a data migration project by aggregating the check results. SodaCL reconciliation checks provide the most important measure of accuracy as they compare the source and target tables to identify differences at a metric and record level.

For example, use a metric reconciliation check to evaluate the percentage of missing values in a column, and a record reconciliation check to validate that employee name is the same in both the source and target tables. These checks make sure that the state of data in one system exactly matches the state in the other system. Sharing the result of these metrics with everyone increases transparency and trust in the migrated data.

You can implement Soda reconciliation checks as in the following SodaCL example which compares only the data specified in the list of columns, mapped according to order in which they are listed.

With more than six inconsistencies between data in the columns, the data quality check fails. The check result output in Soda Cloud identifies data quality reconciliation issues that you can address problems early, ideally in a Staging environment, so as to prevent migrating data quality issues into Production as you migrate the data itself.

‍Transition to the new source

Many teams are reluctant to change for a variety of reasons. Therefore, it’s very important to work closely with your stakeholders, like subject matter experts and consumers.

Make sure they plan time to test and migrate their processes to the new data source. Check in with them as they progress, and when they are finished, decommission the datasets in the old data source. Best practice dictates that you run both systems in parallel for a short while, and work iteratively towards complete transition. “Big bang” transitions seldom work out well.

To further motivate data teams to make the transition, you can also introduce new data management capabilities that further drive the adoption of the new data source. You can, for example, embark on a project to improve data cataloging, quality, security, retention, or access management. The easier the tools and processes, the more likely users will migrate.

Measure twice, cut once

This old adage holds up well in the digital world of data migration: measure data quality in Staging using SodaCL reconciliation checks before cutting over to Production. We could even add “An ounce of prevention is worth a pound of cure” or, “Garbage in, garbage out”. These proverbs prove that we’re all addressing the same kinds of challenges, and those of us who make the effort to prevent data quality issues early in a data migration project will fare much better.

Sign up for a Soda Cloud account, or install the latest version of Soda Library today to set your team up for success with reconciliation checks. See Soda documentation for details.

Migrate Your Data, Not Your Data Quality Issues

Decide on your approach

Plan your data migration

Manage data quality

‍Transition to the new source

Measure twice, cut once

Sign Up

Share

Migrate Your Data, Not Your Data Quality Issues

Sign Up

Stay Connected