parsing json arrays coming from mongoDB in BigQuery

Parsing json and array of json is usually a pain, so I just wanted to share what I use to do this. In the following example, I used the dataflow mongoDB template, which is a no code way of piping your data from mongoDB to Bigquery. In this case, the flattened option was not selected. 

So this template created a table called delivery_orders in BigQuery with 2 columns, one called id and one called source_data which would include a json blob with the following structure:

{
"id": "gkdlvjhkfdld",
  "user_id": "a7087yv9a8fds7vya9,
  "billing_address": {
    "name": "john doe",
    "street": "7 fun plaza",
    "number": "193",
    "zip": "5123456",
    "city": "zombieland"
  },
  "orders": [
    {
      "store_id": "9bb9dc1b1a58c5aa7",
      "products": [
        {
          "gtin": "6234623462346",
          "quantity": 12,
          "price": "11.99",
          "status": "DELIVERED"
        },
        {
          "gtin": "4547413453466",
          "quantity": 1,
          "price": "1.99",
          "status": "DELIVERED"
        }
      ],
      "total": "13.98"
    },
    {
      "storeId": "a0545518b9adc6",
      "products": [
        {
          "gtin": "4260745675612",
          "quantity": 1,
          "price": "6.9",
          "status": "DELIVERED"
        }
      ],
      "total": "6.9"
    }
  ]
}

bWe notice that the delivery_orders blob includes an array (a repetitive pattern of data with the same structure) for all the orders which has a variable size as it can include from 1 to x amount of orders inside it. It starts and ends with a square brackets.  And within an order we can also have a variable amount of products, also stored in an array. 

When creating the view in Looker, I would get the following:

view: delivery_orders {
  sql_table_name: `mongodb.delivery_orders`  ;;

  dimension: id {
    primary_key: yes
    hidden: yes
    type: string
    sql: ${TABLE}.id ;;
  }

  dimension: source_data {
    hidden:yes
    type: string
    sql: ${TABLE}.source_data ;;
  }
}

To get all the first level metrics, like user_id, I just need to add the following in my delivery_orders view:

dimension: user_id {
    type: string
    sql:REPLACE(JSON_EXTRACT(${source_data}, '$.user_id'),"\"","")   ;;
  }

The replace is used to remove the double quotes. 

Now to parse the array of orders, here is the process: In delivery_orders view, create a new dimension called orders which will include all the orders:

dimension: orders {
    hidden: yes
    type: string
    sql: JSON_EXTRACT_array(${source_data} , '$.orders');;
  }

this will just extract the json into an array, which Bigquery understands.

Now to get all the orders form the array, wer are going to create a new view called orders:

 

view: orders {
  sql_table_name: `orders`;;

  dimension: products {
    hidden: yes
    type: string
    sql: JSON_EXTRACT_array(${TABLE} , '$.products');;
  }

  dimension: total {
    type: number
    sql: cast(REPLACE(JSON_EXTRACT(${TABLE}  , '$.total'),"\"","") as float64) ;;
  }

  dimension: store_id {
    type: string
    sql: cast(REPLACE(JSON_EXTRACT(${TABLE}  , '$.store_id'),"\"","") as string)  ;;
  }
}

Notice that this view also includes an array for all the products included in an order. I am also creating new dimensions for all the “first level” dimensions like total or store_id

Now to flatten this orders view within my delivery_orders explore, in my model, I write the following code

explore: delivery_orders {
  join: orders {
    sql: LEFT JOIN unnest(${delivery_orders.store_orders}) orders ;;
    relationship: one_to_many
  }
  }

Make sure the orders at the end before the semicolons matches the name of your view. 

Now for the products , I am also creating a new view:

 

view: products {
  sql_table_name: `products` ;;

  dimension: gtin {
    type: string
    sql: cast(REPLACE(JSON_EXTRACT(${TABLE}  , '$.gtin'),"\"","")  as string) ;;
  }

  dimension: price {
    type: number
    sql: cast(REPLACE(JSON_EXTRACT(${TABLE} , '$.pricePerUnit'),"\"","")  as float64) ;;
  }

  dimension: status {
    suggest_persist_for: "24 hours"
    type: string
    sql: cast(REPLACE(JSON_EXTRACT(${TABLE}  , '$.status'),"\"","")  as string) ;;
  }

  dimension: quantity {
    type: number
    sql: cast(REPLACE(JSON_EXTRACT(${TABLE}  , '$.quantity'),"\"","")  as float64) ;;
  }
}

And I can add this to my explore in the model:

 

explore: delivery_orders {
  join: orders {
    sql: LEFT JOIN unnest(${delivery_orders.store_orders}) orders ;;
    relationship: one_to_many
  }

  join: products {
    view_label: "Products"
    sql: LEFT JOIN UNNEST(${orders.products}) products ;;
    relationship: one_to_many
  }
}

And that's it. Now BigQuery can easily access all the data form your json!

2 0 653
0 REPLIES 0
Top Labels in this Space
Top Solution Authors