Recently, I was working on some tricky, destructive database manipulation code, that we needed in CrankWheel’s Accordium product to automate GDPR data erasure requests.
Because Accordium is an e-signature product, it’s not as simple as just deleting all data related to the user who is requesting that their personally identifiable information be removed. We still need to be able to attest whether a given contract was signed using our system, should there be a legal dispute that we end up being brought into.
For that reason, we delete some data, but for a lot of data instead of deleting it, we redact any personal information from it, in a way where we store a 14-byte portion of an SHA-256 hash over the original data. This truncated hash is enough to validate against documentation that parties to a contract may have (they will have the finalized PDF of the contract document that we sent them), so for example to validate that it’s virtually guaranteed that the name and email address from a contract were in fact parties to the contract according to our system, but on the other hand is short enough that a malicious party wouldn’t be able to reconstitute the original data with certainty for example with a dictionary attack.
Anyhoo, while working on this I found myself needing to test against data generated by our system, and with all the audit data we create it would have been an extreme amount of work to build all of the data up programmatically. So I found myself building up a database with example data by going through steps in our system, making a snapshot of that database, and then reverting back to that snapshot after running some tests.
My workflow for this initially was to use
pg_dump and then build a new database from that by piping the dump file into
psql (or I could also have used
pg_restore but it’s basically the same thing). This takes a while as the database is rebuilt from scratch, all indexes recalculated, and more.
I looked around for a faster tool and found Stellar, a Python tool that works well with PostgreSQL and has some support for MySQL, and uses a template database in PostgreSQL for each snapshot. This drops the restore time from tens of seconds (or minutes, depending on how big your DB snapshot is) to a couple of seconds.
Stellar hasn’t been updated since 2018, but it works well. I’m using Python 3 so instead of the installation instructions they provide, I used the following:
python3 -m pip install stellar python3 -m pip install psycopg2
For my current workflow, Stellar has been a big help. Hopefully this short writeup helps somebody out there to find it!