My company is working to develop an approach for dealing with sensitive data in our reporting. Our contact at Looker asked that I share our situation here, as apparently Looker is generating more and more enterprise customers who are all rolling their own solutions to deal with issues around sensitive data, and any light we shed on our situation may help to inform your product roadmap.
We have personally identifying information in our DB (names/emails/address etc.), and we have highly sensitive data pertaining to that user as well. In our production environment the sensitive data remains encrypted in the DB. Historically, when we need to run reports on that data, we run the report from the app which is able to decrypt the data, aggregate, and generate the necessary reports.
Now that Looker is gaining traction, certain business users would love to build out some reports in looker against that sensitive data. As it stands, this isn’t possible as the data is encrypted. The feeling here is that data needs to be encrypted at rest in the DB, as in case of a breach where someone were to grab the entire DB, then at least the sensitive data would remain secure.
There are a number of different approaches and techniques we’re considering including:
1. DB-level encryption.
We are using MySQL, and are thinking about doing some ETL into Redshift though that is quite early goings. MySQL does offer encrypt/decrypt functions, and perhaps Redshift could make use of the pgpcrypt module for encryption though I’m not sure.
Using DB-level encryption, we could potentially have the DB encrypt the data, and then have a Looker dimension that uses the decrypt method. In order for this to work, looker would need to know the key to decrypt, and so there is a whole host of security issues there. Perhaps if only certain users could pass a parameter through on each request that could work, but we’d need to make sure these keys don’t show up in any logs etc.
2. Obfuscate identifying information from any analytics DB
In the ETL process, we could decrypt the sensitive data, which has analytical value, and obfuscate or remove any identifying information which, for the most part, is not useful in aggregate analytics. This is probably the most straight forward solution, and I imagine is what we will pursue in the short term. There are a couple of downsides to this approach, however.
Access in Looker to identifying information is lost. This could be an issue for certain departments that need to generate audit reports etc. detailing all of the users and their details.
If a malicious user were to gain access to the analytics database and the production database, then they could merge the two and have everything in cleartext.
3. Leave everything in plaintext in the DB, and restrict access per user
Apparently this is what Looker does with its internal instance. From what I understand, you use different database connections to the same database, where the users of the different connections have visibility on different tables. That allows, for instance, users logged in as HR or Finance to see tables with data that should not be visible to Marketing or Tech users.
I don’t think this would be an acceptable solution to us, as it seems that if the Looker admin were to be compromised then all data would be compromised. A breach of the analytics DB would also end up exposing everything in plaintext this way as well.
We are still brainstorming how we will approach this. If anyone has suggestions or references to best practices in this area we would love to hear them!