ETL Tool Recommendations



Show first post

38 replies

To see how easy it is to use Fivetran with Looker, check out this video: https://www.youtube.com/watch?v=fWfyVtAQ6WI&feature=youtu.be

Etleap does exactly that: ETL data from Salesforce, Facebook, NetSuite, S3, Marketo, MySQL, and many others sources into a Redshift data warehouse.



A common use case we hear at Etleap (I’m on the team) is wishing to give analysts the ability to connect and manage the data pipelines they need to get more done in Looker. Otherwise they would need to depend on the engineering team to set up and maintain these pipelines, which can take weeks and break often. Meanwhile, engineering are happy to free their time from creating and maintaining these pipelines.



Since you can do transformations within Looker, for some companies it’s enough to have a data pipeline service that just ingests data and loads it into the warehouse (that is, ETL without the T, or ELT), such as Segment, Stitch, and Fivetran.



Load first, ask questions transform later.



However, as complexity of data goes up, the need for transformations outside the warehouse (and Looker) does as well.



What’s more, if your company has strict security requirements around data—think health, financial services, enterprise software providers, etc—loading everything sight-unseen into a warehouse is a non-starter. For cases like that, being able to transform data before it hits the warehouse (and Looker) can be the difference between passing and failing a security review.



All that to say…



It depends on the specific use case and requirements.



Hopefully this helps you think through that and points you to a few good options.

On my opinion, there are quite a lot of cool and interesting tools. I personally use Skyvia and I am completely satisfied with it.


In my company we use Skyvia to load data from Salesforce to Redshift on a daily basis, it is enough for our analytics department.

I’m looking for some ETL (or ELT) /DW Automation tool recommendations too - but from a slightly different perspective.



Source is SQL Server, destination is PostgreSQL with the option of Redshift or Snowflake at a later date. After something that plays well with Looker maybe even generating some base LookML from it’s internal models.



The real tricky parts of us are





  1. Easy to use - we want our autonomous development teams to be able to manage their own destination schema and ETL. - And they are not ETL developers but know SQL every well.


  2. We are in a multi-tenant environment. - Around 400+ source databases to a similar number of destination databases. (but all same-schema)


Userlevel 2

Multi-tenant is really tricky. We (Fivetran) still don’t have a great out-of-the-box solution for sharded DBs, though we have some things coming out later this year that will drastically improve the situation. You should look at all the vendors, and I recommend talking to DAS42, a consultancy we’ve worked with on sharded DBs. Some key questions are:





  1. Is the set of 400 source databases relatively fixed, or is it adjusting dynamically all the time?


  2. Do all 400 source databases have the exact same schema, or is it possible for the schemas to differ?


  3. What’s the total data size?




DO NOT try to put 400 SQL Servers into a single Postgres destination. It will die. Really, your only option here is going to be Snowflake. Postgres will get killed by the data volume, Redshift will get killed by the concurrency. BigQuery might be a workable option, though you didn’t mention it.

@George_Fraser it’s 400 and increasing at a small rate each year.



There is no requirement to aggregate data across the 400 source DB’s into a single DW - ie, we could have 400 DW’s. 400 ETL processes with each customer reporting off their own DW using Looker.



Obviously automation is key here - we don’t want to manually manage 400 schemas, etl definitions etc.



The 400 source DB’s have the exact same schema, however the destination DW’s schemas will be ~90% the same schema. There are some attribute-value pair tables & other flexible data structures that will be loaded into customer specific dimensions. This transforms can all be rule/code driven (ability for us to hook custom code into the pipeline is a must)



Most source data is <10gb per tenant with a few up around the 150gb - nothing over 250gb.



Total size in the largest region is <2tb. - This is spread across 6 regions.



Current thinking is a bunch of Postgres RDS instances. - Snowflake would be REALLY nice to for this use-case, however data sovereignty/control adds another layer of complexity - still do-able if it’s the best option.

Most of the tools mentioned in this thread should be able to go from SQL Server to PostgreSQL (or Redshift or Snowflake).



Some of Looker’s partners have generated Looker Blocks for popular sources (for example, Stitch has contributed ones for Facebook Ads, Salesforce, and others), but I’m not sure you’ll find a solution from an ETL vendor that generates LookML based on your custom DB schema.



For easy-to-use and SQL-based, you want to go ELT rather than ETL, and then use dbt for transformation. ELT tools eliminate a lot of the complexity of traditional ETL by separating out the SQL. Dbt lets you manage SQL-based models and transforms.



Multi-tenant: Do you actually want to go to 400 destinations? If you’re going to be using Looker on top of the data, it would probably make your life easier if you consolidated the data to a single destination, or as few as possible.



Either way, if you have 400+ source DBs (and possibly another 400+ destinations), you may want to set up these connections programmatically rather than via a UI.

Automation is an absolute must - which is why we love Looker so much (we can generate LookML models in code where required)



dbt’s model-as-code & transforms-as-code aligns well to our philosophy.



Looking forward to getting my hands dirty with dbt.



Thank you for the suggestion.

Renta is my favorite ETL tools, try them: https://renta.im/

I also finished now validation for my company between 3 of the mentioned above.



The chosen one was Rivery (Rivery.io). we found it the most easier to implement especially when working with snowflake.



We got access to variety of APIs we need, the only one we missed they promised to deploy in 1 week, meanwhile they helped us to build temporary API by using their Custom API feature.



The main value for us was the logic layer where we can run our transformations & algorithms.



The POC & trial was really good, hope also to have same experience in real life 🙂

We originally gave Talend a shot, but since have settled comfortably on Apache Airflow.


The latter wold require more development skills, though.


I’ve found this list to be quite helpful in searching for Big Data tools




Good discussion here.



Full disclosure, I work for Bedrock Data. Our newest product, Fusion, could help folks here by getting data from applications (NetSuite, Salesforce, Marketo, HubSpot, Jira, Shopify, etc.)., into a MySQL warehouse you can connect to Looker. Good if you’re worried about hitting API limits and want to get up and running with LookML dashboards lickety-split.

I am thinking about getting a commercial ETL tool. We have a bunch of data sources (Facebook, Salesforce, NetSuite etc.). They are important and right now we have jobs scheduled in all kinds of places that are owned by multiple employees. In an effort to operationalize this, I am looking at some ETL tools.

 

Do you guys have any recommendations? Does anyone know of “the Looker of ETL tools?”

 

 

Thanks,

Lucas

 

You can try an ETL tools like https://www.windsor.ai/fivetran-alternative/   , going to help you a lot , gives you feature like ,  no of reports , no of data connectors , users , dashboards budget optimizer , api + database access . etc and many more https://www.windsor.ai/pricing/ 

Reply