ETL Tool Recommendations


Userlevel 1

I am thinking about getting a commercial ETL tool. We have a bunch of data sources (Facebook, Salesforce, NetSuite etc.). They are important and right now we have jobs scheduled in all kinds of places that are owned by multiple employees. In an effort to operationalize this, I am looking at some ETL tools.


Do you guys have any recommendations? Does anyone know of “the Looker of ETL tools?”


Thanks,

Lucas


38 replies

We’ve made very good experiences with datavirtuality(.com), with in your case pre-built connectors to facebook, salesforce and many others. They’ve become a native looker connector (still a bit beta though).

If you want it to go into PostgreSQL, MySQL, or Redshift, check out Fivetran. They’re a hot new player in the space. I’ve been really impressed with their speed and effectiveness.


Another name that comes to mind is Xplenty, although I know less about them.


I’ve also heard good things about Datavirtuality.

Alooma is also a great option; check them out: https://www.alooma.io/

I’ve had a great experience with Fivetran. They pair well with Looker because they focus primarily on the E and L steps of the ETL, leaving the T step to Looker, which makes it easy with PDTs.

@weitzenfeld: One you got set up, were there any issues or failures that required your attention?

@gruen: nope, it has been very smooth sailing.

@kevin or @weitzenfeld - seen any similar options as Fivetran that work with Oracle on AWS RDS moving data to Redshift? I was looking at Attunity but curious if you know of other options.

@zhill I haven’t - I imagine one of the bigger players like Informatica would work. FWIW - Fivetran is going to support Oracle in 3-6ish months, if you can wait that long.

@zhill Xplenty might support that, although their website doesn’t seem to suggest anything. And as @weitzenfeld mentioned, Informatica (Cloud) might also be a viable option.

We are pretty dependent on RDS Oracle instances. So decided to build our own tool to pull out the data. We are in the testing phase of our ETL app and it’s looking promising.

RJMetrics launched a new product - RJMetrics Pipeline into open beta and we are partnering with Looker.


It is a self-serve product, and you can setup your first sync in 5 minutes. The first 5 million rows/events per month are free, forever.


We would love it if you tried it out and gave us your thoughts.


Thanks,

Nouras

Userlevel 2

@kevin or @weitzenfeld Are you all still in the “I would recommend FiveTran camp”. We’ve tested them out for some of our other tools, such as Hubspot and it seems okay (some tables missing/partial loads), but we’d like faster refresh speeds.

Userlevel 2

Another player is Segment who has a Salesforce pipeline in Beta with intentions to add many more integrations soon. We’re testing them out, so I can’t speak to functionality yet.

Userlevel 2

What integrations does Segment allow you to ETL into Redshift @Bridge

I’m in the same boat - In addition to the ones mentioned above we’re looking at:



Alooma.com looks the coolest so far by a long shot, but they are very new and quite expensive.

Yes, I would still recommend Fivetran, but I haven’t tried any of the (growing number of) alternatives.

Our most time-sensitive integration is with MySQL, and Fivetran refreshes it every 10-15 minutes. For some integrations, the limiting factor is the third party - e.g. I wouldn’t mind if Salesforce synced more frequently, but SFDC limits your daily API hits, so it’s not really an option.

Userlevel 2

Just to chime in, when we compared Segment to FiveTran for Salesforce, the amount of data (based on # tables) available was far more comprehensive in FiveTran than Segment.

If anyone needs scripts for replicating a salesforce to postgres, though easily transferrable to other environments, let me know. They aren’t the best coded but they do the job.

Userlevel 3

A lot of names are thrown around here, so I feel this thread needs a bit more structure.


First of all, ETL solutions can roughly be categorized as:



  • connectors

  • job workflow management


Most of what has been discussed above, refers to connectors. In the job flow category there are a lot more free open-source solutions - these are the solutions that focus on scheduling, dynamic interdependencies between data flows, and complex streaming requirements. Some paid-solutions offer both: job flows and connectors. And, of course, behind most connectors, there is some implementation of a job workflow management - it is just not exposed to the end user.


Second, the devil is in the details. It’s typically not enough to just have the data appear in your data warehouse initially. You want to test a number of scenarios. So learn about the solution’s ability to handle:



  • updates

  • streaming data

  • schema changes

  • datatype changes

  • lost records (just because you don’t see it, does not mean it does not happen)

  • etc


As a Looker employee, it would be unethical for me to recommend any one solution. That said, this is a very popular question, so I have previously led some research internally to understand the landscape. The criteria that we identified and that might be useful to others were:



  • Time to “Ready Data”

  • Data Source Diversity

  • Transparency / Monitoring Capability

  • Support for Redshift features or other Data Warehouses (BigQuery, Vertica)

  • Affordability

  • Simplicity


Finally, in some cases, you might be interested in HIPAA compliance and other stuff like that. Here you will likely be limiting your cloud vendors to only those who host a separate instance for you (just as you do with Looker).


Of course, I am happy to provide guidance in individual cases, where I understand the data. So @brad @Bridge @Lucas pm. if you have further questions specific to your datasets.

Most of the tools listed here only support Redshift as the destination warehouse.


Being a Google customer, can you recommend a vendor who supports BigQuery as the warehouse?


I’m sourcing from MySQL and am interested in “snapshotting” certain data in daily intervals.

Userlevel 6
Badge

Hey @aaron_bostick, good to see you.


I’ve been spending lots of time with BigQuery and it is pretty awesome. Here is a simple script to move files from mysql to BigQuery.


https://gist.github.com/shantanuo/4466298


Data load in BigQuery is pretty easy and straight-forward.


Loading Data into BigQuery

Hey guys! @segahm is right when he says that there’s a lot more to consider when it comes to choosing an ETL solution.


Some other thoughts: you also want to make sure that you find a tool with some flexibility. A big problem with many of the options available today is that the user can’t customize their data delivery schedule, extract specific relevant data or connect all of their sources.


In other words, they can’t choose what data they want to move, where they want it to go or how they want it to look and this can be a really frustrating limitation.


@zhill @kevin, Xplenty lets you easily integrate, transform and process data from more than 100 data stores and SaaS applications. Their platform is also very straightforward and user friendly - no code and no developers. www.xplenty.com

Userlevel 2

Appreciate the kind words in this thread about Fivetran (of which I am CEO); I just want to chime in to say we do now support Oracle as a source and BigQuery as a destination, to agree with @segahm general formulation, and to share a couple general observations we’ve made after seeing many company’s projects:


First, the key questions you should ask about any ETL connector are:



  • What happens when a column is added in the source?

  • What happens when the TYPE of a column is changed in the source?

  • What happens when a row is deleted in the source?


Second, nearly everyone should use SOME commercial ETL tool rather than build it themselves. It’s not that hard to write some scripts that sync your data; the hard part is keeping up with schema changes, and the REALLY hard part is dealing with turnover on your data team. For whatever reason, data engineering teams experience a lot of turnover. In my experience, there are only two scenarios where it CAN make sense to write your own ETL:



  1. Your source schema is extremely stable so you rarely have to update your scripts.

  2. ETL is part of your core business and you have a dedicated team that will maintain and monitor your ETL.


Lastly, as has been pointed out Fivetran does not support transformation, because we believe this to be an anti-feature: in the long run, it’s a lot easier to maintain looker derived-tables. But not everyone agrees, and this is an important difference between the no-transformation vendors (us, Stitch) and our pro-transformation friends (Alooma, ETLeap, Boomi, …).

For RedShift, Matillion has been pretty sweet . . .

This is a great discussion. Here’s one more aspect that may be helpful as you’re evaluating options:


The trend we’re seeing at Stitch (where I’m co-founder & CEO) is that, in addition to popular data sources like MySQL and Salesforce, more and more people have a few data sources that are much less common. There’s been an explosion in the number of tools that companies use to run their business, with a 2017 survey of the marketing technology landscape finding over 5,000 different products (up from 150 in 2011).


This is obviously important if your current ETL needs include a data source that no commercial ETL vendors support. However, it’s also important to choose a tool that will enable you to incorporate whatever new sources you will need in the future.


To do this, make sure that you choose a vendor that has a great API for importing data (which is true of many of the folks mentioned on this thread). With that, you can always write a script that pushes data to their API to be consolidated with your other data sources.


There are some drawbacks to this approach that you should keep in mind:



  • If you’re reading this thread, you’re trying get away from building ETL scripts in the first place

  • You’re on the hook for maintaining these scripts, which is a much bigger job than the initial build

  • You then have to do the operational work to make sure that these scripts run indefinitely and notify the right people when there is a problem with authorization, data volume, API quotas, etc. This is the biggest job of all.


This is why we launched Singer.io, an open source project for ETL integrations. There are more than 20 open source integrations to data sources (or “taps” as we call them), and more are being built all of the time. If you’re building a new one, you can reuse code from the existing taps and helper utilities. We also have partners who can build custom taps.


These taps are maintained by Stitch and the community, so you don’t have to bear that burden on your own. Finally, we can incorporate any integration written to the Singer specification directly into our system, so you can run Singer taps on your own hardware or from within Stitch.

Reply