How did we deal with evolving our data model in Redshift
As in many other companies, when we started using Redshift at Netquest we didn’t think about how to deal with database changes. We just applied the needed changes (first on QA environment!) as we evolved the data model and just maintained the data model scripts (DDL & DML) up-to-date.
But after a few iterations we just realized this was not scalable and maintainable. Some of the problems we had were:
- How do we know which data model version has each Redshift cluster?
- What about different developers applying data model changes to the same database?
- How to keep in sync the data model in different Redshift clusters/databases?
- What about our ETL queries? What if we change the data model and this affects some of our queries?
In order to overcome all these problems, we re-designed the development & deployment cycle as follows:
- NOTE: We had 2 AWS environments (QA & PRO) and 2 associated github branches (dev & master)
- We started versioning the data model by using a tool called Flyway, which we integrated into our CI/CD Jenkins job. Flyway has become quite popular for versioning applications data model since some years ago, but not so much for maintaining a DWH. It is based on changesets, which are the minimum unit of change. We execute it each time a change is pushed.
2. Apart from the Flyway changesets, we also keep the absolute data model up-to-date. In order to validate it, we re-create an empty, temporary db with this data model (via another jenkins job) each time a new change is pushed, so that we can check if scripts are working.
3. ETL queries are also validated at each change using a flyway callback against an empty db. This is because with some queries in Redshift it may take some time to generate the query plan because it needs to materialize temp tables (see: https://docs.aws.amazon.com/redshift/latest/dg/r_EXPLAIN.html). We keep this empty db up-to-date with Flyway.
4. Data model scripts (both absolute & incremental) are stored in S3 so that they can be used from anywhere (any other jenkins job).
5. Whenever a changeset involves modifying a database object, we usually face with the problem of regenerating the associated dependencies of that object. In order to do so, we have developed a command-line utility to get all dependencies from a Redshift object (direct & transitive) and generate the necessary drops & creates in the correct order. This makes this process much more agile.
Very related to this is the need of copying data between different environments (mainly from PRO to QA). This is because we need real data in QA databases so that we can really validate deployments before going into production.
NOTE: when we started our corporate datawarehouse we decided not to incorporate PII data from source systems so that it would be much easier and secure for us. In general data/business analysts and other data consumers (for instance data scientists) do not need PII data for their analysis (they can do the same with our internal identifiers). This means that we can copy data from PRO into QA without any data masking.
In order to be able to copy data from Redshift in PRO environment to QA environment, we have:
- A Jenkins job to periodically export tables from Redshift PRO to S3 bucket in QA account (with the possibility to select specific tables or filtering with queries)
- A Jenkins job to import tables from S3 bucket to Redshift in QA (triggered from previous one).
- A Jenkins job to periodically update database copies in QA with schema updates from PRO (triggered whenever a new data model version is pushed to master).
With this configuration any developer can easily automate the copying of certain tables from PRO to QA and do not worry about data model updates.
We want to explore if tools like DBT or the usage of stored procedures can help us in evolving our current etl queries and Redshift views in a composable & reusable way.
Also we want to try out some data modelling tools like Aginity or DBSchema to see if they can help us in better documenting our data model and automatically generating the flyway changesets via version comparison.
More info: https://martinfowler.com/articles/evodb.html