Multi-System Data Synchronization

Posted by

Overview

A frequent design issue in programming is synchronizing data between two systems. This issue commonly pops up if you have internal applications that need the same data as a third party system, such as a marketing application. Depending on your architecture, you can potentially have your two systems be stemming from the same source of data, but often times you don’t have direct access.

Synchronizing data between multiple systems can be handled one of three ways, all dependent on the concept of who is the source of truth and the flow of the data. You can either synchronize system A to have what system B has (where system B is the source of truth), system B to system A (where system A is the source of truth), or a hybrid solution which involves updating both sides conditionally.

One Way Data Flow With Single Source of Truth (A to B or B to A)

The first two situations are the most straight forward, as you are only going in one direction. Where it can be confusing, however, is what that sync looks like. System A could be pushing data to System B, or System B could be pulling data from System A, and it would both be an “A to B sync”, and System A is the source of truth.

Actual synchronization can happen a lot of different ways. If System A is making an API post call to System B to update System B’s data, then that would be an A to B sync. However, if System B is making an API get call to System A, it would still be an A to B sync. In both of these scenarios, the data flow is in the same direction, but the facilitation of the communication is moving from one system to the other. From personal experience, this choice is usually not up for debate, especially when dealing with a third party company you are trying to integrate with.

In the database world, a secondary copy of a database might be needed for a wide array of things, that with Microsoft SQL Server, is called SQL Server Replication (with the secondary database being called a “replicated database”). Naturally, the primary SQL server would be the System A and secondary System B for an A to B sync. More info about replication and use cases can be found HERE.

SQL database replication is a great case study, because it can be handled in different ways. According to Microsoft Documentation about the types of replication, there are Transactional, Merge, and Snapshot replications, where the replication action is in effect a data synchronization. Taken directly from the docs:

Data changes are not tracked for snapshot replication; each time a snapshot is applied, it completely overwrites the existing data. Transactional replication tracks changes through the SQL Server transaction log, and merge replication tracks changes through triggers and metadata tables.

One or Two Way Data Flow With No Source of Truth (Hybrid)

Hybrid solutions are often harder to implement correctly. Usually, this scenario occurs when updates to the data can happen on either side, and you want both sides to get the most up to date data. An easy way to handle this is by having some way to determine what can be called a revision. Revisions can be determined multiple ways, such as a revision number, a status, or an actual last updated date and time. So for example, if system A has data that is on revision 4 (or last updated a week ago), and system B has data that is on revision 6 (or last updated yesterday), you would know that you have to update system A to match system B.

Part of the complexity of the hybrid solution stems from the fact that since data is flowing both ways, the communication between the two systems can be dictated by one or both systems. For example, if both systems have a way to receive an update, then both systems can send an update to the other system (System A to B and System B to A).

In the following example, the phone numbers for John Doe and Jane Doe are different between the two systems. In the first step, we see that Jane Doe has a more recent updated date in System A, and as such, update System B with the new phone number (555-3210). Then, in the second step, we see that John Doe has a more recent last updated date in System B, so we update System A with the new phone number (123-4567).

In the situation of using “Last Updated” time stamps, it is important that the date and time should be updated to the other systems update stamp, as simply updating the last updated date to be “now” would continually show the data as not being in sync. In this regard, using a version number might be easier to avoid confusion and/or potential bugs.

A great example of a hybrid synchronization is git version control, especially in the “Subversion-Style Workflow“. Communication between a developer and the shared repository is dictated by the developer, and changes can happen on both systems. Any developer can push data to the shared repository as well as pull data back from the shared repository, where changes (commits) are ordered by time. If a developer has changes locally when attempting to pull new data from the shared repository, git will attempt to merge any changes into the local code base, and anything that can’t be auto-merged is presented to the developer as a merge conflict to be manually fixed.

Workflow A

Conclusion

I originally started writing this blog post after fixing a cumbersome windows service (written by someone else) that was attempting to do a hybrid synchronization between our system and a third party system, but did not have any sort of versioning of data. After rewriting the service to instead have an A to B sync, with my data as the source of truth, the application became simpler, more logical, and actually remains in sync without issue. That’s not to say that a hybrid sync doesn’t make sense or work better in some situations, but thinking about how the data is flowing during the intended synchronization helped me improve a piece of software for the better.

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.