Question

Database Overview Series I: Transactional Databases

  • 28 April 2016
  • 0 replies
  • 132 views

Userlevel 3

Selecting the right database for analytics for your business is a daunting task. This series will try to cover the different types of databases we’ve encountered at Looker and how to get the most out of them as an analytics organization. We hope this guide will help you select or refine the best database for your analytics requirements. The first in this series we will cover are transactional databases.



Transactional databases are the most prevalent amongst analytical databases. They are convenient, familiar, and accessible. If an organization does not have a pre-existing analytics stack, this option is the fastest and easiest to setup.



##Popular transactional databases##


MySQL / MariaDB / Aurora


PostgreSQL


SQL Server / [SQL Database] (https://azure.microsoft.com/en-us/services/sql-database)


Database 12c



##Architecture##



When transactional databases are used for analytical services, they are mostly configured as a slave to the production database. This setup is widely documented and fairly accessible for most DBAs or Engineers to set up. Cloud database vendors offer easy ways to provision the slave. It is preferable to use this setup because it allows us to avoid running queries on the production database which should be dedicated to writing data. The slave is also fairly synchronized in real time with its master.



Another setup we see in the wild are analytic databases set up as master which are the downstream of the production database or other various sources. This requires building out an ETL pipeline between different data sources to the central database for analytics.



##Constraints##



Some transactional databases do not have write access on the slave. This is because they are using binary streaming in order to replicate between the master and slave. This constraint prevents materializing views and also incorporating other data sources. PostgreSQL is the most commonly used analytics database with this constraint.



Transactional databases are mostly row stores, which is expensive to read from when rows are wide. This must be taken into consideration when working with large or sparse datasets.



##Optimization##



The lowest hanging fruit to optimize your database is to decrease the isolation level. Lowering this to read committed will decrease replication lag and moderate locking conditions. This configuration is safe on a read-only machine because there are no transactions running on analytical slaves or ETL dump destinations.



The greatest performance boost for your transactional database is to set up proper indexes. This will greatly improve the query plan that the database can take when running JOINs and WHEREs. Foreign keys for commonly joined tables and timestamps are generally the most important to be indexed for analytic queries. This is to reduce number of rows scanned for JOINs as well as for time filtered reports, which are the most common features used for analytical reports.



Reducing the redundancy of the data is also a great way to increase the performance of reading tables. Storing data in the third normal form is one common pattern of achieving this. This can dramatically reduce the widthwideness of commonly read rows and move less used parts of the data into their own tables and can be JOINed in when appropriate.



If the nature of your data is a sparse data matrix, this may severely impact performance as well. The most common datasets that have sparse matrices are things like medical records, server logs, and anything usually unstructured. Columnar databases are designed, at the core, to handle this edge case, but in a row store there are some tips we can use as well. One way to deal with this is to fragment out columns in your table to multiple tables. This strategy can save the amount of columns utilized per query. Setting up an EAV table may also help with extremely sparse matrices, but increases the complexity of analytics queries. A tool with a robust modeling layer such as Looker can handle this strategy well, but using handwritten SQL can become unwieldy.



Finally, a great way to optimize queries is to perform expensive calculations in SQL before running the query. Creating transformations that process the raw rows to create aggregates or DISTINCTs is a great way to speed up queries. The main tradeoff is that the data is being rolled up and this needs to be constantly maintained and updated as new data is added.



##What are they amazing at?!##



Transactional database are the best way to get started as analytical databases for making general business decisions. Working with OLTP data or any low to moderate levels of data for high level analytics is a perfect use case for transactional databases.



They are also always valuable for operational use cases for real-time data analysis because they keep up with production data at real time. Customer success teams or operational teams that need real-time data will always be pulling the latest records from the transactional database.



##What are they not great at?##



Transactional databases will start hitting performance limitations and require a great deal of tuning once you reach large-scale datasets. At this point, it is a good idea to start looking at various MPP databases or dedicated analytical databases(link to next article).


0 replies

Be the first to reply!

Reply