Your company's data stack

  • 20 January 2016
  • 9 replies
  • 167 views

Userlevel 2

Just curious what other companies are using for their data stack and some of the pros/cons? My company is thinking about transitioning from what we currently use so wanted to see what others have tried.


9 replies

Userlevel 2

Production DBs: Postgres

Analytic DB: SQL Server

Analytic workbench: RStudio Server


Some of the issues include integrations with third-party vendors and data latency between the analytic db and the production db. It’s been challenging to setup every integration in SQL server. We’ve tried Segment/Fivetran but the way they ETL data isn’t always in the format or detail we’d like.


We’re exploring Redshift right now.

We use:




  • RDS - Postgres 9.4.5




  • We use postgres as our data warehouse. Our data volume isn’t large enough to warrant the need for a MPP solution such as Redshift, and we like functions that aren’t available in redshift (DISTINCT ON clause, jsonb, regex, materialized views, etc).




  • Salesforce




  • Custom batch script that runs on cron every 5 minutes that replicates our data using the SOAP SOQL Api. We manually generate the SOQL statements and load into Postgres.




  • Adwords




  • Same thing, custom batch script that queries the tables we need and loads the data. Makes sure to go back 60-90 days every day so that we don’t miss data that gets retroactively populated.




  • ETL Pipeline




  • Our production store is in mongo, so we have a pipeline which in realtime tails the Mongo oplogs and replicates the data to dw. We are working on version 2 where we were separate the producer and consumer and put a streaming queue (Kafka, or Kenesis) in the middle so we can better manage schema changes. This is the most intensive process, very similar to Alooma.




  • Series of Materialized Views




  • We have a series of materialized views that get run every 5 minutes to denormalize data and make it more efficient to load into looker.



we did a few posts on the asana eng blog about this topic -


https://blog.asana.com/2014/11/stable-accessible-data-infrastructure-startup/

https://blog.asana.com/2014/11/great-data-delivery/


overall, things are scaling reasonably well - although it’s very hard in our B2C2B “consumer-y” freemium model - we get millions of email addresses, and we have to lean heavily on automation, much of it inhouse. i think a key learning for us is that when you reach “big enough data” it is very hard to keep any third party data repositories - like email automation systems, or even salesforce - reliably in sync. tools like Looker that just connect to our first-party owned data are much more leveraged 🙂

Userlevel 2

Scott, I’m curious what data you pull from Google Calendar? In particular if this has to do with areas like CSMs and the onboarding process. .


Thanks!

Userlevel 2

Love this conversation 😃 I spent probably a week compiling all of what I could at the time find on this topic around the internet. The results are here. Asana’s infrastructure write-ups are definitely referenced, as are those at Braintree, Netflix, Spotify, Zulily, and more. I profiled a total of 11 companies.


Would definitely recommend folks check this out, and I’m happy to answer any questions from what I learned.

Userlevel 2

Thanks for sharing @jthandy!

Userlevel 4

Nice post from Donors Choose on their data stack here!

If you’re comparing data warehouses, check out this benchmark. Cost & performance comparison for Amazon Redshift, Google Bigquery, and Snowflake. https://blog.fivetran.com/warehouse-benchmark-dce9f4c529c1

Our data stack looks like:


Application and Data: SQL Server: Amazon S3; Amazon Redshift; Node.js; Firebase.


Utilities: Google Analytics; Discourse; Dropbox; Shopify.


Business Tools: Jira; MailChimp; Zendesk; Salesforce Sales Cloud; Skyvia (for loading data to Redshift to analyze it with Looker)

Reply