How did we deal with evolving our data model in Redshift

  • How do we know which data model version has each Redshift cluster?
  • What about different developers applying data model changes to the same database?
  • How to keep in sync the data model in different Redshift clusters/databases?
  • What about our ETL queries? What if we change the data model and this affects some of our queries?
  • NOTE: We had 2 AWS environments (QA & PRO) and 2 associated github branches (dev & master)
What happens when we push any data model change to our Github repo?
  1. We started versioning the data model by using a tool called Flyway, which we integrated into our CI/CD Jenkins job. Flyway has become quite popular for versioning applications data model since some years ago, but not so much for maintaining a DWH. It is based on changesets, which are the minimum unit of change. We execute it each time a change is pushed.
Folder structure of our github repo for the data model. Flyway changesets inside incremental subfolder
Absolute sql scripts inside absolute subfolder.
Process to copy data between environments and keep it up to date
  • A Jenkins job to periodically export tables from Redshift PRO to S3 bucket in QA account (with the possibility to select specific tables or filtering with queries)
  • A Jenkins job to import tables from S3 bucket to Redshift in QA (triggered from previous one).
  • A Jenkins job to periodically update database copies in QA with schema updates from PRO (triggered whenever a new data model version is pushed to master).

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alejo BS

Alejo BS

Data Engineer / Head of Data & Analytics at Netquest