Your company's data stack

  • 20 January 2016
  • 9 replies

Userlevel 2

Just curious what other companies are using for their data stack and some of the pros/cons? My company is thinking about transitioning from what we currently use so wanted to see what others have tried.

9 replies

Userlevel 2

Production DBs: Postgres

Analytic DB: SQL Server

Analytic workbench: RStudio Server

Some of the issues include integrations with third-party vendors and data latency between the analytic db and the production db. It’s been challenging to setup every integration in SQL server. We’ve tried Segment/Fivetran but the way they ETL data isn’t always in the format or detail we’d like.

We’re exploring Redshift right now.

We use:

  • RDS - Postgres 9.4.5

  • We use postgres as our data warehouse. Our data volume isn’t large enough to warrant the need for a MPP solution such as Redshift, and we like functions that aren’t available in redshift (DISTINCT ON clause, jsonb, regex, materialized views, etc).

  • Salesforce

  • Custom batch script that runs on cron every 5 minutes that replicates our data using the SOAP SOQL Api. We manually generate the SOQL statements and load into Postgres.

  • Adwords

  • Same thing, custom batch script that queries the tables we need and loads the data. Makes sure to go back 60-90 days every day so that we don’t miss data that gets retroactively populated.

  • ETL Pipeline

  • Our production store is in mongo, so we have a pipeline which in realtime tails the Mongo oplogs and replicates the data to dw. We are working on version 2 where we were separate the producer and consumer and put a streaming queue (Kafka, or Kenesis) in the middle so we can better manage schema changes. This is the most intensive process, very similar to Alooma.

  • Series of Materialized Views

  • We have a series of materialized views that get run every 5 minutes to denormalize data and make it more efficient to load into looker.

we did a few posts on the asana eng blog about this topic -

overall, things are scaling reasonably well - although it’s very hard in our B2C2B “consumer-y” freemium model - we get millions of email addresses, and we have to lean heavily on automation, much of it inhouse. i think a key learning for us is that when you reach “big enough data” it is very hard to keep any third party data repositories - like email automation systems, or even salesforce - reliably in sync. tools like Looker that just connect to our first-party owned data are much more leveraged 🙂

Userlevel 2

Scott, I’m curious what data you pull from Google Calendar? In particular if this has to do with areas like CSMs and the onboarding process. .


Userlevel 2

Love this conversation 😃 I spent probably a week compiling all of what I could at the time find on this topic around the internet. The results are here. Asana’s infrastructure write-ups are definitely referenced, as are those at Braintree, Netflix, Spotify, Zulily, and more. I profiled a total of 11 companies.

Would definitely recommend folks check this out, and I’m happy to answer any questions from what I learned.

Userlevel 2

Thanks for sharing @jthandy!

Userlevel 4

Nice post from Donors Choose on their data stack here!

If you’re comparing data warehouses, check out this benchmark. Cost & performance comparison for Amazon Redshift, Google Bigquery, and Snowflake.

Our data stack looks like:

Application and Data: SQL Server: Amazon S3; Amazon Redshift; Node.js; Firebase.

Utilities: Google Analytics; Discourse; Dropbox; Shopify.

Business Tools: Jira; MailChimp; Zendesk; Salesforce Sales Cloud; Skyvia (for loading data to Redshift to analyze it with Looker)