Data migrations in rails - Introduction and patterns
What are data migrations?
In Rails, data migrations specifically refer to operations that alter the existing data within the database, as opposed to schema migrations that change the database structure (like adding or removing tables, columns, or indexes). Data migrations might be necessary when you need to:- Clean up or normalize data.
- Populate new columns with derived or default values.
- Migrate data from one column or table to another.
- Perform bulk updates to records.
- Convert data formats (e.g., changing date formats or encoding).
- Seed a database with initial data after a schema migration. Data migrations are typically written as Ruby code inside a migration file, using Active Record methods to manipulate the data.
Here's an example of a data migration that updates the data within a users
table using rails console
:
$ Rails c
User.where(email: nil).find_each do |user|
user.update(email: generate_email_for(user))
end
Data migration patterns in rails
We can perform data migrations directly in the Rails console, although this is typically reserved for quick fixes, one-off updates, or during development and testing. But that is not a good idea for complex data. We need to apply patterns to data migration to handle errors also data changes.
We have many ways to run data migration in Rails. Same as Seed, Data script, Use as Raw SQL, alter in database migration,... In this article, I just talk about three useful patterns that are easy to execute.
- Data migrations with
db/migrate
- Data migrations with Rake task
- Data migrations with
data_migrate
gem
Data migrations with db/migrate
Placing data migrations in the db/migrate
directory alongside schema migrations is a common practice in Rails applications. You can quickly take advantage of core Rails functionality and start modifying records across environments. But whether it is considered good or not depends on the context and the complexity of the data migration.
How to do it?
Let's start with creating a migration
$ rails generate migration ChangeUserEmail
invoke active_record
create db/migrate/20240216022840_change_user_email.rb
In that file, you can write a statement like this one:
class ChangeUserEmail < ActiveRecord::Migration[7.0]
def up
User.where(email: nil).find_each do |user|
user.update(email: generate_email_for(user))
end
end
def down
# no-op
# We cannot revert this migration because we cannot know the original email address
end
end
Now you can run the this migration to migrate your data
$ rails db:migrate
== 20240216022840 ChangeUserEmail: migrating ==================================
== 20240216022840 ChangeUserEmail: migrated (0.0407s) =========================
**It's DONE. You run your migration and then you forget about it. After that you have to change user
to admin_user
. The next time you need to set up a new development environment, you create your database, you run your migrations, and you get a failure in your console: Rails does not know what User
is.
Pros
- Simplicity: For small and simple data changes, including data migrations in the
db/migrate
directory, can be straightforward and convenient. You use Rails core logic that makes sure you don’t run a migration more than once - Version Control: Keeping data migrations with schema migrations ensures that they are version-controlled and applied in the correct order.
- Consistency: Running
rails db:migrate
will apply both schema and data migrations, keeping your database schema and data in sync.
Cons
- Coupling: Data migrations can become coupled with schema changes, which might not be ideal if you need to run them independently.
- Reusability: Data migrations in
db/migrate
are not easily reusable across different environments or databases. - Performance: Large data migrations can be slow and resource-intensive, potentially causing downtime if run during a deployment.
- Maintenance: As the application evolves, the models and business logic used in old data migrations may change, potentially breaking these migrations if they need to be rerun.
Data migrations with a Rake task
If you have gotten bitten by the first approach before, you know there is a better way. You can write a bit more code and solve this problem using plain old rake tasks that follow a new set of conventions.
Temporary rake tasks allow us to decouple a deployment from completed migrations. It gives us more control of the data manipulation process by encapsulating it in one place. The downside is that we need remember to either add this rake task to our deployment script or run the rake task manually after deployment. We will also need to clean up after ourselves and remove the temporary rake task once the changes have been deployed and implemented.
How to do it?
Make a rake task file
# lib/tasks/temporary/users.rake
namespace :users do
desc "generate an email for the user that didn't have email"
task generate_email: :environment do
users = User.where(email: nil)
puts "Going to update #{users.count} users"
ActiveRecord::Base.transaction do
users.each do |user|
user.update(email: generate_email_for(user))
end
end
puts " All done now!"
end
end
Check your task
$ rails -T
...
rails users:generate_email # generate an email for the user that didn't have email
...
Execute rake task to migrate data
$ rails users:generate_email
** Execute users:generate_email
Going to update 10 users
All done now!
Pros
- Isolation from Schema Migrations: Rake tasks allow you to separate data migrations from schema migrations, reducing the complexity of individual migrations and minimizing the risk of issues during deployment.
- Flexibility: Rake tasks can be written to perform complex data manipulation and can be run independently of the deployment process, giving you more control over when and how the data migration occurs.
- Reusability: A well-written Rake task can be reused across different environments or even in different projects with similar data structures.
- Testing: It's easier to test Rake tasks in isolation without affecting the database schema, which can be beneficial for complex data migrations.
- Performance: You can optimize Rake tasks for performance, such as by processing data in batches, which is important for large datasets.
- Version Control: Rake tasks can be version-controlled along with the rest of your application's code, providing a record of the data migration and the ability to revert to previous versions if necessary.
Cons
- Lack of Standardization: Unlike schema migrations, which have a standard structure and are automatically timestamped and ordered, Rake tasks for data migration can vary widely in style and organization, potentially leading to maintenance challenges.
- Manual Intervention: Rake tasks typically require manual intervention to run, which can be a drawback in automated deployment processes.
- Risk of Data Loss: If not carefully managed, Rake tasks can lead to data loss, especially if they are not thoroughly tested before being run in production.
- No Rollback Mechanism: Unlike schema migrations, Rake tasks do not have a built-in mechanism for rolling back changes. You would need to write a separate Rake task to undo the migration if necessary.
- Potential for Code Duplication: You may end up writing similar code in multiple Rake tasks, leading to duplication and the associated maintenance issues.
- Dependency on Application Environment: Rake tasks depend on the Rails application environment, which means they can be affected by changes in the application code or configuration.
Data migrations with data_migrate
gem
I know what you are thinking: “There has to be a gem for this!”
You are right! You could use data_migrate for all your data migrations. Just like Rails with the table schema_migrations
, data_migrate
uses a table called data_migrations
to keep track of new and old migrations.
How to do it?
Add the gem to your project. Then bundle install
and you are ready to go.
# Gemfile
gem 'data_migrate'
Check rake tasks to more detail:
$ rake -T data
rake data:abort_if_pending_migrations # Raises an error if there are pending data migrations
rake data:dump # Create a db/data_schema.rb file that stores the current data version
rake data:forward # Pushes the schema to the next version (specify steps w/ STEP=n)
rake data:migrate # Migrate data migrations (options: VERSION=x, VERBOSE=false)
rake data:migrate:down # Runs the "down" for a given migration VERSION
rake data:migrate:redo # Rollbacks the database one migration and re migrate up (options: STEP=x, VERSION=x)
rake data:migrate:status # Display status of data migrations
rake data:migrate:up # Runs the "up" for a given migration VERSION
rake data:rollback # Rolls the schema back to the previous version (specify steps w/ STEP=n)
rake data:schema:load # Load data_schema.rb file into the database without running the data migrations
rake data:version # Retrieves the current schema version number for data migrations
rake db:abort_if_pending_migrations:with_data # Raises an error if there are pending migrations or data migrations
rake db:forward:with_data # Pushes the schema to the next version (specify steps w/ STEP=n)
rake db:migrate:down:with_data # Runs the "down" for a given migration VERSION
rake db:migrate:redo:with_data # Rollbacks the database one migration and re migrate up (options: STEP=x, VERSION=x)
rake db:migrate:status:with_data # Display status of data and schema migrations
rake db:migrate:up:with_data # Runs the "up" for a given migration VERSION
rake db:migrate:with_data # Migrate the database data and schema (options: VERSION=x, VERBOSE=false)
rake db:rollback:with_data # Rolls the schema back to the previous version (specify steps w/ STEP=n)
rake db:schema:load:with_data # Load both schema.rb and data_schema.rb file into the database
rake db:structure:load:with_data # Load both structure.sql and data_schema.rb file into the database
rake db:version:with_data # Retrieves the current schema version numbers for data and schema migrations
You can generate a data migration as you would a schema migration:
rails g data_migration generate_email_for_user
That will add a new data migration to the db/data
directory. You will need to define the up
and down
methods:
class GenerateEmailForUsers < ActiveRecord::Migration[5.1]
def up
User.where(email: nil).find_each do |user|
user.update(email: generate_email_for(user))
end
end
def down
TalentUser.find_each do |user|
# do something
end
end
end
To make sure that your data migrations don’t fall out of date, you can set up your build to run rake db:schema:load:with_data
before your test suite.
Pros
- Separation of Concerns: Keeps data migrations separate from schema migrations, which can help organize and manage changes more effectively.
- Integration with Rails Migrations: The gem integrates with Rails migrations and extends the existing migration functionality, making it familiar and easy to use for Rails developers.
- Version Control: Data migrations are timestamped and version-controlled, similar to schema migrations, ensuring that they are applied in the correct order.
- Rollback Capability: Provides a way to write reversible data migrations, allowing you to roll back changes if needed.
- Tracking: Data migrations are tracked in the
data_migrations
table, which helps prevent the same migration from being run multiple times. - Consistency Across Environments: Ensures that data migrations are run consistently across different environments, which is important for maintaining data integrity.
- Automated Deployment: Can be included as part of the automated deployment process, running data migrations with
rails db:migrate:data
or alongside schema migrations withrails db:migrate:with_data
.
- Additional Dependency: Adds an extra gem to your project, which means one more dependency to keep updated and secure.
- Complexity: Introduces additional complexity to your migration system, which may not be necessary for simple or infrequent data migrations.
- Learning Curve: There is a slight learning curve associated with using the gem, especially for developers who are not familiar with it.
- Overhead: For small projects or projects with very few data migrations, the overhead of adding a gem might not be justified.
- Potential for Misuse: Developers might be tempted to use data migrations for changes that should be handled by other means, such as seed data or one-off scripts.
- Risk of Slower Deployments: If not managed properly, including data migrations in your deployment process can slow down deployments, especially if the migrations are large or complex.
What’s the best strategy?
- For small, infrequent data changes that are directly related to a schema migration, using the db/migrate
directory is often sufficient and convenient.
- For one-off data migrations, especially those that are complex or need to be run outside of the deployment process, a Rake task is a good choice. It provides the flexibility to run the migration exactly when and how you want.
- For larger applications with regular data migrations, or when you want to keep a clear history of data changes separate from schema changes, the data_migrate
gem can be very useful. It provides structure and integrates well with the existing Rails migration system.
Ultimately, the "best" strategy is the one that aligns with your project's needs, your team's workflow, and the specific requirements of the data migration at hand. It's also not uncommon for projects to use a combination of these strategies as appropriate for different situations.