LookR: Example Code for Predictive Analytics With ...

marcell1 · 05-13-2016 03:07 PM

Sometimes, doing SQL-based analytics just won’t get us the power that statistics and machine learning can offer. For scenarios like this, other tools like Python or R can be extremely powerful. That said, MPP database technologies will always be faster when it comes to data transformations. Additionally, having a single set of easily extensible definitions of dimensions and measures is invaluable for data science teams. Thus, we can combine the ability to easily define/execute transformations (Looker) with a statistical capabilities of a software package ®, to get the best of both worlds.

Furthermore, we can then operationalize the results of a statistical model by bringing that data back into Looker, slicing it, visualizing it, and using it in all of our relevant decision making processes.

Below are snippets code needed to be able to together a simple predictive model (I’m using the CRAN library’s random forest algorithm) and put it into production with Looker. This will use Lookers, integration with R, LookR. DISCLAIMER: This is not a lesson in statistical modeling / machine learning methods. I assume you know how to do that, before reading this.

Let’s start by getting our relevant libraries within R:

# My R script
library(RJDBC)
library(plyr)
library("rjson") 
library("rJava")
library("randomForest")
library("LookR")

Great, once we’re set with that, we can run the Looker setup to connect to our instance. You’ll need to generate an api id and secret ahead of time, then plug that into the LookR function:

looker_setup(id = "myapiid",
             secret = "myapisecret",
             api_path = "https://mycompany.looker.com:19999/api/3.0"
)

Now that we’ve established our credentials, we can start accessing data from Looker. We have two options, running a query, or running a look. Let’s run a query first to get data for training:

getTrainingData <- function(){
  returnValue <- run_inline_query(model = "my_model", 
                                 view = "my_explore", 
                                 fields = c(
                                   "my_view_1.input_variabl_a",
                                   "my_view_1.input_variabl_b",
                                   "my_view_2.input_variabl_c",
                                   "my_view_2.input_variabl_d",
                                   "my_view_2.input_variabl_e.",
                                   "my_view_3.output_variable"
                                 ),
                                 filters = list(
                                   c("my_view_1.vilter_variable_a", "my_desired_value_a"),
                                   c("my_view_1.vilter_variable_b", "my_desired_value_b")
                                 )
  )
}

We can also run a simple look. Note that look limits are enforced, so use caution with this method:

getProdData <- function(){
  returnValue <- run_look(42)
}

Now we’re going to get into some R/statistic specifics. I wanted to include this here so that you can get the full example code for a simple predictive model. All this function is doing is taking our training data and splitting it into a training and testing set. If you want to add some checks to accuracy before moving into production, you would use this test set.

# GLOBAL VARIABLES
mySeed = 321

splitData <- function(data,sampleSize,seed) {
  smp_size <- floor(0.75 * nrow(data))
  if (!is.null(seed)) set.seed(seed)
  set.seed(seed)
  train_ind <- sample(seq_len(nrow(data)), size = smp_size)
  train <- data
  test <- data[-train_ind, , drop = FALSE]
  return(list(train,test))
}

Alright, let’s train our model with some data! This is a very simple way of giving our random forest some data. We start with the output variable (what we’re trying to predict), and list out the input variables. The output of this function is not a data frame, but the machine learning model itself:

trainForest <- function(thisData){
  rf <- randomForest( my_view_3.output_variable ~ 
                      my_view_1.input_variabl_a +
                      my_view_1.input_variabl_b +
                      my_view_2.input_variabl_c +
                      my_view_2.input_variabl_d +
                      my_view_2.input_variabl_3
                      , data=thisData)
}

Great, now we have a fantastic model, rf, let’s use that to actual score our production data (data that the model hasn’t seen before), and merge our predictions with that table as another column.

scoreForest <- function(rf,thisData){
  thisData$win_score <- predict(rf,thisData, "prob")
  thisData$predict <- predict(rf,thisData)
  return(thisData)
}

Okay, we’ve gone over all of the pieces, now let’s put them together into larger functions that call of those in the proper order. There are two main components for this, training and production:

mainTrain <- function(){
  myTrainData <- getTrainingData() 
  mySplitData <- splitData(myTrainData,.7,mySeed)
  trainData <- mySplitData[[1]]
  testData <- mySplitData[[2]]
  rf <- trainForest(trainData)
  testValues <- scoreForest(rf,testData)
  return(rf)
}


mainProd <- function(randomForst){
  myProdData <- getProdData()
  scoredData <- scoreForest(randomForst,myProdData)
  cleanData <- cleanColumns(scoredData)
  writeToDatabase(cleanData,Sys.Date())
  return(cleanData)
}

I jumped the gun a little bit. I added a “writeToDatabase” function in there that I have’t defined yet. I’ll define that below. In this case, I’m writing the data to redshift. WARNING: THIS IS HOW TO DO IT FOR REDSHIFT ONLY!!! FOR OTHER DATABASES, PLEASE SEARCH FOR THE APPROPRIATE API ONLINE:

writeToDatabase <-function(dataframe,date){
  dataframe$score_date <- rep(date,nrow(dataframe))
  driver <- JDBC("com.amazon.redshift.jdbc41.Driver", "RedshiftJDBC41-1.1.9.1009.jar", identifier.quote="`")
  url <- "jdbc:redshift://MYCOMPANY-REDSHIFT-URL:5439/mycompany?user=USERNAME&password=MYPASSWORD" 
  
  con <- dbConnect(driver, url)
  
  dbWriteTable(con, "public.my_predictions_table", dataframe, overwrite=FALSE, append=TRUE) 
  dbDisconnect(con)
}

Fantastic, we’ve come full circle. We can get data out of looker, train a model on that, get another set of data, run the model on it, add the predictions on as a new column, then write this back to the DB! That’s quite a lot, but that’s when the fun really begins. We can then model around this table in Looker, and combine these

So, let’s actually put this all together in one neat function that we can call to do the entire process. Note that this will retrain the model every time it is called, which may be a good thing if you want to keep your model fully up to date:

main <- function(){
  random_forest_model <- mainTrain()
  mainProd(random_forest_model)
}

main()

Voila! You can get the example code here.

weitzenfeld

This pattern has been huge for us. It means we only have to do data munging / feature generation once - in Looker. We can write model outputs back into Redshift, and users can see model outputs right next to model inputs.

Andrew_Kraemer

This has has significantly increased our ability to do on-the-fly analysis for us. Looker is amazing, but sometimes you need to roll your sleeves and do some manual analysis.

I also use it to power some shiny apps.

A few requests of lookR, though:

I’d love to be able to use the run_look() function without having to worrying about row limits.
If that won’t happen, can we use the query id to pull data from the sql runner? (resulting work flow: create & maintain complex query using a explore/look -> paste query without limit in sql runner -> run_query() in R)
All fields come back as factors. Any way dates can come back as dates and measures as numeric?

LookR: Example Code for Predictive Analytics With Looker & R