Just curious what other companies are using for their data stack and some of the pros/cons? My company is thinking about transitioning from what we currently use so wanted to see what others have tried.
Production DBs: Postgres
Analytic DB: SQL Server
Analytic workbench: RStudio Server
Some of the issues include integrations with third-party vendors and data latency between the analytic db and the production db. It’s been challenging to setup every integration in SQL server. We’ve tried Segment/Fivetran but the way they ETL data isn’t always in the format or detail we’d like.
We’re exploring Redshift right now.
We use:
RDS - Postgres 9.4.5
We use postgres as our data warehouse. Our data volume isn’t large enough to warrant the need for a MPP solution such as Redshift, and we like functions that aren’t available in redshift (DISTINCT ON clause, jsonb, regex, materialized views, etc).
Salesforce
Custom batch script that runs on cron every 5 minutes that replicates our data using the SOAP SOQL Api. We manually generate the SOQL statements and load into Postgres.
Adwords
Same thing, custom batch script that queries the tables we need and loads the data. Makes sure to go back 60-90 days every day so that we don’t miss data that gets retroactively populated.
ETL Pipeline
Our production store is in mongo, so we have a pipeline which in realtime tails the Mongo oplogs and replicates the data to dw. We are working on version 2 where we were separate the producer and consumer and put a streaming queue (Kafka, or Kenesis) in the middle so we can better manage schema changes. This is the most intensive process, very similar to Alooma.
Series of Materialized Views
We have a series of materialized views that get run every 5 minutes to denormalize data and make it more efficient to load into looker.
we did a few posts on the asana eng blog about this topic -
https://blog.asana.com/2014/11/stable-accessible-data-infrastructure-startup/
https://blog.asana.com/2014/11/great-data-delivery/
overall, things are scaling reasonably well - although it’s very hard in our B2C2B “consumer-y” freemium model - we get millions of email addresses, and we have to lean heavily on automation, much of it inhouse. i think a key learning for us is that when you reach “big enough data” it is very hard to keep any third party data repositories - like email automation systems, or even salesforce - reliably in sync. tools like Looker that just connect to our first-party owned data are much more leveraged 🙂
Scott, I’m curious what data you pull from Google Calendar? In particular if this has to do with areas like CSMs and the onboarding process. .
Thanks!
Love this conversation 😃 I spent probably a week compiling all of what I could at the time find on this topic around the internet. The results are here. Asana’s infrastructure write-ups are definitely referenced, as are those at Braintree, Netflix, Spotify, Zulily, and more. I profiled a total of 11 companies.
Would definitely recommend folks check this out, and I’m happy to answer any questions from what I learned.
If you’re comparing data warehouses, check out this benchmark. Cost & performance comparison for Amazon Redshift, Google Bigquery, and Snowflake. https://blog.fivetran.com/warehouse-benchmark-dce9f4c529c1
Our data stack looks like:
Application and Data: SQL Server: Amazon S3; Amazon Redshift; Node.js; Firebase.
Utilities: Google Analytics; Discourse; Dropbox; Shopify.
Business Tools: Jira; MailChimp; Zendesk; Salesforce Sales Cloud; Skyvia (for loading data to Redshift to analyze it with Looker)
Reply
Enter your username or e-mail address. We'll send you an e-mail with instructions to reset your password.