Using Amazon Redshift’s new Spectrum Feature

bruce_sandell · 05-16-2017 05:04 PM

At the AWS Summit on Wednesday, April 19th, 2017, Amazon announced a new Redshift feature called Spectrum. You can take full advantage of Redshift Spectrum’s amazing performance from within Looker.

Spectrum significantly extends the functionality and ease of use of Redshift by letting users access data stored in S3 without having to load it into Redshift first. You can even join S3 data to data stored in Redshift, and the Redshift optimizer will take care of maximizing your query performance, optimizing both the S3 and Redshift portions of your query.

Prior to Spectrum, you were limited to the storage and compute resources that had been dedicated to your Redshift cluster. Spectrum now provides federated queries for all of your data stored in S3 and allocates the necessary resources based on the size of the query. These resources are not tied to your Redshift cluster, but are dynamically allocated by AWS based on the requirements of your query. And, perhaps most importantly, taking advantage of the new Spectrum feature is a seamless experience for end-users.

Pricing for data stored in and queried against your existing, relational Redshift cluster will not change. Queries against data that is stored in S3 will be charged on a per-query basis. As I write this, the cost is $5 per terabyte.

In order to use Spectrum, you’ll need to be sure that you are running at least version 1.0.1294 of Redshift. To verify the version of Redshift that you are using you can run this command from psql:

select version();

Looker release 4.14 offers full support for Redshift Spectrum. Earlier versions of Looker can also support Spectrum, but some manual steps are required. If you are interested in running Spectrum on an older version of Looker please contact your Customer Success Manager for the details.

Currently, Redshift is only able to access S3 data that is in the same region as the Redshift cluster.

In order for Redshift to access the data in S3, you’ll need to complete the following steps:
1. Create an IAM Role for Amazon Redshift.
2. Associate the IAM Role with your cluster.
3. Create an External Schema.
4. Create some external tables.
5. Query your tables.

Details of all of these steps can be found in Amazon’s article “Getting Started With Amazon Redshift Spectrum”.

Creating an external schema requires that you have an existing Hive Metastore (if you were using EMR, for instance) or an Athena Data Catalog. Creating an Athena Data Catalog is easy to do and is free to create. For more information on getting started with Athena, please refer to Amazon’s article “What is Amazon Athena”. The database specified in your “create external schema” statement must already exist in Athena or Hive.

To create an external schema using the IAM role created in step 1 above, run the following commands from psql:

create external schema if not exists s3
from data catalog database 'default' region 'us-west-2' 
iam_role 'arn:aws:iam::1111222233334444:role/my-redshift-role';

GRANT USAGE on SCHEMA s3 to <username that will be connecting>;

To create an external table in the newly created external schema, run the following command from psql:

CREATE EXTERNAL TABLE inventory_items (
    id int,
    product_id int,
    created_at timestamp,
    sold_at timestamp,
    cost double,
    product_category string,
    product_name string,
    product_brand string,
    product_retail_price double,
    product_department string,
    product_sku string,
    product_distribution_center_id int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE
LOCATION 's3://my-test-data/inventory_items';

To query data from the newly created external table, you’ll need to specify the name of the external schema when referencing the table. From psql or the Looker SQL Runner utility, run the following command:

SELECT product_name, product_category, product_brand from s3.inventory_items;

To access the Spectrum tables from within Looker just access the “Create Views from Tables” option (as you normally would) and look for the name of your external schema:

You can now access S3 data using Redshift Spectrum. This includes joining S3 data to data stored in Redshift itself with the Redshift optimizer maximizing both parts of the query.

References

Amazon Redshift - Cloud Data Warehouse - Amazon Web Services

Learn about Amazon Redshift cloud data warehouse. Amazon Redshift is a fast, simple, cost-effective data warehousing service. Amazon Redshift gives you the best of high performance data warehouses with the unlimited flexibility and scalability of...

http://docs.aws.amazon.com/redshift/latest/dg/c-getting-started-using-spectrum.html
http://docs.aws.amazon.com/athena/latest/ug/what-is.html

Rich1000

So so good, this makes half our ETL process redundant. My only worry is the bill!