This content, written by Zachary Aragosa, was initially posted in Looker Blog on Nov 14, 2018. The content is subject to limited support.
Amazon’s Athena database supports a wide array of geospatial functionality that allows for building complex analysis with any data containing geographies or individual locations. With Looker, you can query data directly from Athena and leverage all of their geospatial functionality to give users the ability to work with massive geospatial data sets.
A particularly powerful geospatial data set that is available to the public is data. Since this data is available in a public S3 bucket, you’ll be able to very easily and start querying it. While it’ll be easy to get up and running right away running basic SELECT
queries, the complex structure of the dataset means that writing manual SQL queries every time you have a new question is going to be a challenge.
This is where combining with Looker’s LookML modeling layer becomes incredibly powerful. To demonstrate some of this functionality, I thought it would be interesting to explore the natural environment of Southern California by county.
Geospatial data example with Amazon Athena
OpenStreetMap data contains a tag for “” that describes mountains, beaches, capes, trees, etc. around the world. Here’s an example:
Location | Nature Type | Name | Tags | |
---|---|---|---|---|
52 | -0.062714,-78.8328985 | beach | Playa del Río Saloya | {natural=beach, surface=sand, name=Playa del Río Saloya} |
53 | -0.0647157,-77.9743917 | wetland | río volteado | {natural=wetland, name=río volteado} |
54 | -0.0657521,-80.1585446 | beach | La Boca de tabuga | {natural=beach, name=La Boca de tabuga} |
55 | -0.0665063,-78.3998779 | peak | Gualagüincha | {is_in:country_code=EC, natural=peak, gns_uni=-1372397, gns_classification=MT, name=Gualagüincha, source=GNS, is_in:state=Provincia de Pichincha, is_in:country=Ecuador} |
In order to compare which county has the most “Nature”, I loaded up the boundaryshapes
of into Athena. The boundary shape field can be read as a type ‘Polygon’ in Athena and represents a set of points.
name | boundaryshape | |
---|---|---|
1 | San Francisco | 00 00 00 00 03 05 00 00 00 45 d6 1a 4a ed a0 5e c0 23 a0 |
2 | Madera | 00 00 00 00 03 05 00 00 00 84 d8 99 42 e7 22 5e c0 c0 79 |
3 | San Mateo | 00 00 00 00 03 05 00 00 00 56 0c 57 07 40 a1 5e c0 65 69 |
Identifying nature
To limit the OpenStreetMap data down to just its natural elements and improve query performance, we can focus our dataset by creating a in Looker to shrink the data set considerably.
view: nature {
derived_table: {
sql:
SELECT
tags,
id,
lat,
lon,
FROM planet
WHERE type = 'node' and
tags['natural'] is not null
;;
}
}
Joining geographies
To figure out which of our natural points fall within in each county, we can join the counties
table with the nature
table using Athena’s ST_CONTAINS
function. ST_CONTAINS
returns TRUE
when a specific nature point falls within the boundaries of each county’s polygon:
explore: nature {
join: counties {
relationship: many_to_one
sql_on: st_contains(${counties.boundaryshape}, ST_POINT(${nature.lon}, ${nature.lat})) ;;
}
}
Unnesting data
Since our nature data is highly nested, we can use Athena’s un-nesting functions to parse out the details about each point of interest and codify that logic in LookML.
dimension: type {
type: string
sql: ${TABLE}.tags['type'] ;;
description: "Nature Type as Defined by OpenStreetMaps"
}
Diving into nature
From here, we can start to analyze the geographic features of southern California and was able to compare the number of beaches by county. Turns out Looker’s headquarters in Santa Cruz County has the second most beaches:
San Bernardino County, not surprisingly, contains the most peaks over 1000 feet in elevation.
Drilling into this data, I found that the highest peak in San Bernardino is San Gorgonio Mountain at 3,502’.
Nature Name | Nature Elevation | |
---|---|---|
1 | San Gorgonio Mountain | 3,502 |
2 | Jepson Peak | 3,416 |
3 | Anderson Peak | 3,304 |
4 | Charlton Peak | 3,291 |
5 | Shields Peak | 3,261 |
Creating a more complex geospatial data analysis
Say you were interested in planning a hiking vacation in California, an important decision for your trip will be picking a hotel that is closest to the highest peaks you intend to climb. Luckily, we can pull hotel information from OpenStreetMap by filtering our data on WHERE tags['tourism'] = 'hotel'
.
ID | Tags | Lon | Lat | |
---|---|---|---|---|
1 | 595970835 | {addr:housenumber=715, addr:country=US, name=The Beach Cottages, tourism=hotel, source=SanGIS Addresses Public Domain (http://www.sangis.org/), addr:street=Thomas Avenue, addr:postcode=92109, addr:city=San Diego} | -117.255362 | 32.79297 |
2 | 595978669 | {is_in:country_code=US, addr:housenumber=1558, is_in:state_code=CA, addr:country=US, name=Sheraton Carlsbad Resort and Spa, tourism=hotel, source=SanGIS Addresses Public Domain (http://www.sangis.org/), addr:street=East Balboa Court, is_in:state=California, addr:postcode=92008, is_in:country=United States of America, addr:city=Carlsbad} | -117.311829 | 33.134073 |
3 | 596004616 | {addr:housenumber=4767, addr:country=US, name=Capri by the Sea by All Seasons Resort Lodging, tourism=hotel, source=SanGIS Addresses Public Domain (http://www.sangis.org/), addr:street=Ocean Boulevard} | -117.2583404 | 32.8004225 |
4 | 596022506 | {rooms=23, internet_access=wlan, addr:state=CA, addr:country=US, internet_access:fee=no, tourism=hotel, stars=3, source=SanGIS Addresses Public Domain (http://www.sangis.org/), addr:postcode=92101, addr:city=San Diego, addr:housenumber=505, smoking=no, name=Found Hotel San Diego, addr:street=West Grape Street} | -117.16751 | 32.725654 |
In Athena, I used the ST_BUFFER()
function to find the natural elements nearby each hotel in the data set.
explore: hotel {
join: nature {
relationship: many_to_one
sql_on: st_contains(ST_BUFFER(ST_POINT(${hotel.lon},${hotel.lat}), .25),
ST_POINT(${nature.lon}, ${nature.lat})) ;;
}
}
Using LookML, you can create a metric that will allow you to count only the natural elements that have peaks greater than 1000 feet high.
measure: high_peak_count {
type: count
filters: {
field: nature_type
value: "peak"
}
filters: {
field: elevation
value: ">1000"
}
drill_fields: [detail*]
}
Then, we could use Looker’s explore section to build a query that counts the number of high peaks within a short drive from each hotel.
It turns out that one region in California in particular, the Big Sur area, offers 4 hotels that are each within a short drive of over 20 high peaks. The Big Sur Lodge seems like it might be a great place to take a hiking vacation.
We can drill into the count of high peaks and plot the exact locations of each peak.
To pinpoint exactly which mountains are closest to my hotel, we can use the ST_DISTANCE
function to calculate distance as a crow flies between two points. (In degrees):
dimension: distance_hotel_to_nature {
type: number
value_format_name: decimal_2
sql: ST_DISTANCE(ST_POINT(${nature.lon}, ${nature.lat} ), ST_POINT(${lon}, ${lat})) ;;
}
Name | Elevation | Distance Hotel to Nature | |
---|---|---|---|
1 | Manuel Peak | 1,074 | 0.03 |
2 | Port Summit | 1,050 | 0.04 |
3 | Pico Blanco | 1,126 | 0.07 |
4 | Ventana Double Cone | 1,477 | 0.09 |
5 | Mount Olmstead | 1,094 | 0.09 |
Conclusion
Geospatial datasets like OpenStreetMap can help users answer many important and challenging questions about our environment. However, historically, geospatial data was notoriously cumbersome and clunky to work with. Data scientists were always required to meticulously transform the data for every new question asked by a non-technical user. Now, with a combination of Amazon Athena’s geospatial functionality and LookML, analysts can create an environment that enables non-technical end users to explore these types of data sets and answer questions on their own.