Handling encrypted data

  • 6 October 2015
  • 3 replies

Userlevel 1

My company is working to develop an approach for dealing with sensitive data in our reporting. Our contact at Looker asked that I share our situation here, as apparently Looker is generating more and more enterprise customers who are all rolling their own solutions to deal with issues around sensitive data, and any light we shed on our situation may help to inform your product roadmap.

We have personally identifying information in our DB (names/emails/address etc.), and we have highly sensitive data pertaining to that user as well. In our production environment the sensitive data remains encrypted in the DB. Historically, when we need to run reports on that data, we run the report from the app which is able to decrypt the data, aggregate, and generate the necessary reports.

Now that Looker is gaining traction, certain business users would love to build out some reports in looker against that sensitive data. As it stands, this isn’t possible as the data is encrypted. The feeling here is that data needs to be encrypted at rest in the DB, as in case of a breach where someone were to grab the entire DB, then at least the sensitive data would remain secure.

There are a number of different approaches and techniques we’re considering including:

1. DB-level encryption.

We are using MySQL, and are thinking about doing some ETL into Redshift though that is quite early goings. MySQL does offer encrypt/decrypt functions, and perhaps Redshift could make use of the pgpcrypt module for encryption though I’m not sure.

Using DB-level encryption, we could potentially have the DB encrypt the data, and then have a Looker dimension that uses the decrypt method. In order for this to work, looker would need to know the key to decrypt, and so there is a whole host of security issues there. Perhaps if only certain users could pass a parameter through on each request that could work, but we’d need to make sure these keys don’t show up in any logs etc.

2. Obfuscate identifying information from any analytics DB

In the ETL process, we could decrypt the sensitive data, which has analytical value, and obfuscate or remove any identifying information which, for the most part, is not useful in aggregate analytics. This is probably the most straight forward solution, and I imagine is what we will pursue in the short term. There are a couple of downsides to this approach, however.

Access in Looker to identifying information is lost. This could be an issue for certain departments that need to generate audit reports etc. detailing all of the users and their details.

If a malicious user were to gain access to the analytics database and the production database, then they could merge the two and have everything in cleartext.

3. Leave everything in plaintext in the DB, and restrict access per user

Apparently this is what Looker does with its internal instance. From what I understand, you use different database connections to the same database, where the users of the different connections have visibility on different tables. That allows, for instance, users logged in as HR or Finance to see tables with data that should not be visible to Marketing or Tech users.

I don’t think this would be an acceptable solution to us, as it seems that if the Looker admin were to be compromised then all data would be compromised. A breach of the analytics DB would also end up exposing everything in plaintext this way as well.

We are still brainstorming how we will approach this. If anyone has suggestions or references to best practices in this area we would love to hear them!

3 replies

Userlevel 3

In the case of AWS ecosystem, when you first create your data warehouse cluster for Redshift, you are presented with a choice to encrypt your database:

KMS encryption will be transparent to your end users, but the contents of the disk will be encrypted (in the case that someone grabs the entire disk). Anyone connected to the db though, will still be seeing unencrypted data, so they could potentially still grab the entire db.

The best step here is actually not only encryption, but a private VPC with no outside access to Redshift. That way, if you are hosting Looker yourself within this VPC, Looker will still be able to access your data, but nothing on the outside can.

I’m sure similar setups are possible for other databases and clouds - not just Redshift.

Userlevel 1

Thanks @segahm. In our case we would like looker to be available publicly for doing embeds, so the VPC probably isn’t the best option. What would be really cool down the road as a looker feature would be to make use of something like the mysql decrypt or postgres pgcrypto module. That way data could remain encrypted inside the database, and you could require a looker user to supply a decryption password after login to grant access to those fields. You could mark the fields as encrypted in the lookml perhaps.

how to encrypt the data read from table in Looker, feild level and file level?