Your company's data stack

  • 20 January 2016
  • 9 replies
  • 167 views

Userlevel 2

Just curious what other companies are using for their data stack and some of the pros/cons? My company is thinking about transitioning from what we currently use so wanted to see what others have tried.


9 replies

Userlevel 2

Production DBs: Postgres


Analytic DB: SQL Server


Analytic workbench: RStudio Server



Some of the issues include integrations with third-party vendors and data latency between the analytic db and the production db. It’s been challenging to setup every integration in SQL server. We’ve tried Segment/Fivetran but the way they ETL data isn’t always in the format or detail we’d like.



We’re exploring Redshift right now.

We use:







  • RDS - Postgres 9.4.5







  • We use postgres as our data warehouse. Our data volume isn’t large enough to warrant the need for a MPP solution such as Redshift, and we like functions that aren’t available in redshift (DISTINCT ON clause, jsonb, regex, materialized views, etc).







  • Salesforce







  • Custom batch script that runs on cron every 5 minutes that replicates our data using the SOAP SOQL Api. We manually generate the SOQL statements and load into Postgres.







  • Adwords







  • Same thing, custom batch script that queries the tables we need and loads the data. Makes sure to go back 60-90 days every day so that we don’t miss data that gets retroactively populated.







  • ETL Pipeline







  • Our production store is in mongo, so we have a pipeline which in realtime tails the Mongo oplogs and replicates the data to dw. We are working on version 2 where we were separate the producer and consumer and put a streaming queue (Kafka, or Kenesis) in the middle so we can better manage schema changes. This is the most intensive process, very similar to Alooma.







  • Series of Materialized Views







  • We have a series of materialized views that get run every 5 minutes to denormalize data and make it more efficient to load into looker.





we did a few posts on the asana eng blog about this topic -



https://blog.asana.com/2014/11/stable-accessible-data-infrastructure-startup/


https://blog.asana.com/2014/11/great-data-delivery/



overall, things are scaling reasonably well - although it’s very hard in our B2C2B “consumer-y” freemium model - we get millions of email addresses, and we have to lean heavily on automation, much of it inhouse. i think a key learning for us is that when you reach “big enough data” it is very hard to keep any third party data repositories - like email automation systems, or even salesforce - reliably in sync. tools like Looker that just connect to our first-party owned data are much more leveraged 🙂

Userlevel 2


Scott, I’m curious what data you pull from Google Calendar? In particular if this has to do with areas like CSMs and the onboarding process. .



Thanks!

Userlevel 2

Love this conversation 😃 I spent probably a week compiling all of what I could at the time find on this topic around the internet. The results are here. Asana’s infrastructure write-ups are definitely referenced, as are those at Braintree, Netflix, Spotify, Zulily, and more. I profiled a total of 11 companies.



Would definitely recommend folks check this out, and I’m happy to answer any questions from what I learned.

Userlevel 2

Thanks for sharing @jthandy!

Userlevel 4

Nice post from Donors Choose on their data stack here!

If you’re comparing data warehouses, check out this benchmark. Cost & performance comparison for Amazon Redshift, Google Bigquery, and Snowflake. https://blog.fivetran.com/warehouse-benchmark-dce9f4c529c1

Our data stack looks like:



Application and Data: SQL Server: Amazon S3; Amazon Redshift; Node.js; Firebase.



Utilities: Google Analytics; Discourse; Dropbox; Shopify.



Business Tools: Jira; MailChimp; Zendesk; Salesforce Sales Cloud; Skyvia (for loading data to Redshift to analyze it with Looker)

Reply