Home Technology How Bigbasket Tackled Data Migration Problem

How Bigbasket Tackled Data Migration Problem

by midhunsukumaran

“Without data, you’ll be just another person with an opinion”W. Edwards Deming, An American Statistician

Courtesy: Internet

We all know the significance of data in the modern tech-savvy world. Here I’m not going to talk about that. Rather, I will address another related issue which we all deal with in day to day work. It’s nothing but the availability of quality data on our workstation.

You would be a developer, test engineer, analyst or irrespective of the title you are holding, If you are someone who deals with the data, to ease the work, you need data on your workstation. If the data is stored in a file format and it resides on another system, then you can transfer it via a network or a disk.

What if it resides in a database??

let’s say, it resides inside a relational database like Mysql. What would you do in such scenario?

You would be thinking about taking database dump, transfer that file via the above mentioned channels and restore it in your database, right? But this dumping and restoring business comes with the cost. These are a few:

  • Time consuming: Taking a dump of a DB of size 10 GB and restoring it on your machine is going to take ages!!
  • Resource constraints: Restoring DB is excessive memory and CPU usage operation. You may not always have such luxuries with the kind of workstation you have. You may have to sit idle during the operation as well.
  • Loss of local data: You might have had a few valuable data in your local DB and restoring the DB with new dump will make you ended up losing that.
  • All I ask is for a flower, but you gave me a garden!!: This is the usual scenario, you would be having  a fully functional DB and all you need to get from the other machine would be 50 rows of data across tables. Just restoring the whole DB is just a way of complicating things!!

Just for making it more fun, I can complicate it a bit more. What if we are working on an application which has many post and pre save calls for the CRUD operations, which is used to store some data in the secondary storage or some computation of data. So just restoring DB wouldn’t help you mocking such sidekick operations. The transferring data problem isn’t a one time thing, it keeps haunting you, especially when you work with different people, different teams which all ended up creating dependency on some other components and related data. To tackle this problem, you need a smart system which helps you get the patch data from the source DB, transfer it via a network and dump that patch data without losing your valuable local data pieces and route all these operations via your application framework to trigger all post/pre save operations you have defined. Moreover, all this should happen rapidly as well.

“Gain something else without losing something you have!!”. It sounds a bit greedy though, but sometimes the greediness leads you to a better and creative solution.

Here we are talking about the story with the similar plot and it’s about how did Bigbasket tackle the data transfer problem using the native Django ORM framework and which leads to the evolution of the Migration Forest Data Structure.

How it works?

As mentioned earlier, there are two entities involved in this process which are source and target servers. We have developed a tool which allows us to export selected data pieces from the source server, the data get dumped into a JSON file and it will get stored in a cloud storage. In the target machine, we can choose the particular source machine and JSON data file and import the same. I will get into the details with an example as follows:

Assume we have an app called “college” in our Django project. It has four models: “Student” which belongs to a “ClassRoom” which belongs to a “Course” and “ClassRoom” has many “Teacher(s)” and Teacher belongs to a Subject and ClassRoom.

So for a student to exist, it needs an entry in ClassRoom table as there is a Foreign key constraint attached to it. In a similar way there is a foreign key constraint exists between ClassRoom and Course. And a one to many relationship between ClassRoom and Teacher as well. Incase if you want to migrate a Student from a source DB to target DB either the target DB has to have the particular teacher entry or we should migrate that particular data piece, similar way we need migrate particular Course DB entry if it doesn’t exist in the target DB. As all models are linked in some way, we can migrate all related data pieces in a single shot.

Assume a scenario where  we need to migrate one Student instance (call it S1) belongs to a ClassRoom (CR1) which belongs to Course (C1) and CR1 has Teachers T1 and T2.

There can be below cases:

  • Target DB has C1
  • Target DB has CR1 (obviously C1 exists as there is a Foreign Key constraint on it)
  • Target DB has nothing

We just need to migrate only S1 and CR1 in the first case, just S1 in the second case and we need to migrate all in the third case. What indicates is the person who does this migrate this should have an option to select what to migrate and what not to. So we have come with the below UI design in which user can select/unselect at any level.

Assume this how the Source Server looks like:

This is how the export page will look like. Like I have mentioned earlier, in every level, we can select/unselect data.

On the other side, In our Target Machine, This is how the import page looks like. You can see a preview which is a skeleton of the data exported.

After importing the data, this how the target machine’s DB looks like:

How are we doing it?

Here comes Migration Forest Data structure, wonder why this name? We all know forest is a collection of trees and similar way, our own MigrationForest is a collection of MigrationTree(s). A Migration Tree is used to store of information of a single selected instance of the root model.

In our prior example, Student is the root model, the information retrieval starts from the root model. MigrationTree has one root which is an instance of the MigartionNode. The MigrationNode stores all the information of any model instance.

As mentioned an attribute of a model instance, can be used to store data like integer, string etc., or it can be a foreign key (parent of this node, ClassRoom in our example), or it can be many to many relationship with another model or one to many (M20 reverse) relationship with its own children. We are using an instance of ExistingNode to represent a foreign key relation when the user doesn’t opt for a particular parent node to be migrated. So that case we assume that the same entry exists in the target server. This reference is required as foreign key values can’t be none.

We are using list of MigrationNode(s) to store data of a M2M reverse relationship and list of ChildTree(s) to store data of any One to Many relationship. A ChildTree will contain a ReferNode which refers the parent node, this is used to avoid endless recursion as the parent node might get created in the same call, the child node need not trigger a process to create the referred parent node, as it will get into an endless recursion.

How can you make use of it?

Soon this will be open sourced!! Keep in touch with us for further updates.

 

You may also like

2 comments

SWAROOP November 15, 2018 - 1:36 pm

Is it published in github or any other souce community

Reply
midhunsukumaran February 11, 2019 - 1:31 pm

Soon we will publish it on our Github repository.

Reply

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: