Database CI/CD Basics (T-SQL Tuesday #177)

Page content

This month’s invitation is from Mala Mahadevan (b) asking us how we manage the database code. Since this is my passion project, I have a few basic tips to share.

I have yet to see a perfect implementation but even a partial one benefits you. My experience is mostly with enterprise-scaled environments - large servers, minimal downtime, brownfield development and also mostly single-tenant.

But these concepts and building blocks should be generic enough.

Throughout this article, I’ll use the word schema. It’s an overloaded term but in this case, I’m referring to a general “shape” of the database (object definitions with their results sets, columns, etc.) and not the schema returned by sys.schemas.

With that out of the way, everything can be summed up by this meme:

SRGrafo pitchfork meme use same principals for database code as you would for application code

Source of Truth ๐Ÿ”—︎

You can’t handle the source of Truth โ€” Me

We should take inspiration from application development because it’s very similar. Database code is also a code.

Mr Incredible meme “Math is math” but with code instead

It’s just a tiny bit harder because you have to persist the state, so you can’t just drop the DB and rebuild it from scratch for any change.

If you want automatic deployments, you have to store your scripts somewhere. The hint is in the name: source control should store source code.

And when I say source control I mean git. Even though no one understands git, you’ll soon enough learn the golden path scenario of about 5-6 commands. And if it doesn’t work, you’ll nuke the branch and start again until it does (it’s the only way to be sure).

You should strive to:

  • Move the source of Truth from the production database to the source code (no half-measures)
  • Create an empty environment just from your source-controlled scripts
    • This will get you one step away from developing on a shared environment and one step closer to isolation

How should I version my scripts? ๐Ÿ”—︎

There are two main schools of thought: state-based and migration-based (or a hybrid). In the end eve

State-based approach ๐Ÿ”—︎

  • You define the start and end state (e.g. script out the table as is and how it should look in the end)
  • Usually, a schema compare tool generates the migration for you (e.g. alter tableโ€ฆ add column)
  • Comparing different schemas (dev <-> test or dev <-> prod) will produce different migration scripts
  • State-based approach is great for collaboration
    • You can see the desired state
    • Since source code is usually stored as one script file per object, you can see git conflicts when two people try to edit at the same time
  • The migration scripts are only as good as the schema compare tool. It often struggles with renames, online deployments or anything beyond AdventureWorks-level complexity.

Migration-based approach ๐Ÿ”—︎

  • Mostly the opposite of the state-based approach
  • You write numbered migration scripts and apply them in order
  • You store a version table on the target and only apply the missing migrations
  • Better control over deployment script (manually created)
  • Unless you run the migration scripts against an environment, you don’t see the desired state
  • Conflicts are harder to detect as changes against the same object can be in different script files
  • Possible out-of-order merges and deploys

Hybrid approach ๐Ÿ”—︎

  • The best of both worlds
    • state-based approach for source control
    • migration-based approach for deployments
  • More moving parts mean it’s more brittle

In the end, it’s all migration scripts and you can switch back and forth between the approaches to some extent.

And the rest ๐Ÿ”—︎

There is much more to cover. From the top of my head:

  • git branch strategy
  • monorepo vs manyrepo
    • Store application code along with the database code vs separate storage
    • Store all databases in a single repo vs one repo per DB
    • Variations of the above
    • You have to pick your poison either scaling or dependency hell
  • Unit testing
  • Hotfixes and schema drift
  • Empty schema vs minimal dataset vs curated dataset vs generated data
  • Online deployments
  • Rollback vs rollforward
  • Deployment feature flags
  • Free tools, paid tools, homegrown tools
  • Monitoring and alerting

As you can see this is a pretty complex topic which doesn’t have a silver bullet answer yet and the tools and approaches are still evolving. I hope you found this crash course useful.

Thank you for reading.