This content, written by Dillon Morrison, was initially posted in Looker Blog on Aug 21, 2017. The content is subject to limited support.
As the breadth of AWS products and services continues to grow (hundreds of new products last year alone!), customers are able to move ever-increasing portions of their tech stack and core infrastructure to the cloud.
From a pricing perspective, this can be incredibly attractive. Rather than paying heavy upfront costs for large on-premise systems, customers can pay for services on-demand or reserve them for specific periods of time, and automatically scale resources as needed. While often cheaper, the complexity of pricing each service by the hour naturally lends itself to a more complicated and nuanced cost-structure, making it difficult to fully understand where you’re spending the most money, and how to reduce costs.
While the is great for aggregated reporting, conducting analysis on the raw data using the flexibility and power of SQL allows for much richer detail and insight, and is ultimately the better choice for the long term. Thankfully, with the , monitoring and managing these costs is now easier than ever. With Athena, there’s no need to create hundreds of Excel reports, move data around, or deploy clusters to house and process data. Analysis can be performed directly on raw data in S3. Conveniently, Amazon exports raw cost and usage data directly into a user-specified S3 bucket, making it simple to start querying with Athena quickly. This makes continuous monitoring of costs virtually seamless, since there is no infrastructure to manage. Instead, users can leverage the power of the Athena SQL engine to easily perform ad-hoc analysis and data discovery without needing to set up a data warehouse.
Once the data pipeline is established, the cost and usage data (the recommended billing data, per AWS documentation) provides a plethora of comprehensive information around usage of AWS services and the associated costs. Whether you need the report segmented by product type, user identity, or region, this report can be cut-and-sliced any number of ways to properly allocate costs for any of your business needs. You can then drill into any specific line item to see even further detail, such as the selected operating system, tenancy, purchase option (on-demand, spot, or reserved), etc..
Athena utilizes Apache Hive’s data definition language to create tables, and the Presto querying engine to process queries. By default, the Cost and Usage report exports CSV files, which you can compress using gzip (recommended for performance). There is some additional configuration and options for tuning performance further, which we discuss below.
In the blog post, we’ll walk through setting up the data pipeline for Cost and Usage Reports, S3, and Athena, and discuss some of the most common levers for cost savings.
1. Setting up S3 and Cost and Usage reports
First, you’ll want to create a new S3 bucket. Then, you’ll need to enable the Cost and Usage report (AWS provides clear instructions on how to ). Check the boxes to “Include ResourceID” and receive “Hourly” reports. All options are prompted in the report-creation window. Lastly, be sure to assign the appropriate IAM permissions to the bucket for the report. The permission policy can be found in step two of the report creation process (using the link above).
2. Configuring the S3 bucket for Athena querying
The Cost and Usage report dumps CSV files into the specified bucket. As with any AWS service, make sure that you’ve for Athena to that bucket.
In addition to the CSV, AWS also creates a JSON manifest file for each report. Athena requires that all of the files in the S3 bucket are in the same format, so we need to get rid of all these manifest files. If you’re looking to get started with Athena quickly, you can simply go into your S3 bucket and delete the manifest file manually, skip the automation described below, and move on to step 3.
If you want to automate the process of removing the manifest file each time a new report is dumped into S3 (recommended, especially as you scale), there are a few additional steps. The folks at Concurrency labs wrote a great overview and set of scripts for this, which you can find in their .
These scripts take the data from an input bucket, remove anything unnecessary, and dump it into a new output bucket. We can utilize AWS Lambda to trigger this process whenever new data is dropped into S3, or on a nightly basis, or whatever makes most sense for your use-case, depending on how often you’re querying the data. Please note that enabling the “hourly” report means that data is reported at the hour-level of granularity, not that a new file is generated every hour.
Following these scripts, you’ll notice that we’re adding a date partition field, which isn’t necessary but increases query performance. In addition to compression (taken care of for us automatically) and partitioning, the third lever for performance improvements is converting the data from CSV to a columnar format like ORC or Parquet. We can also automate this process using Lambda whenever new data is dropped in our S3 bucket. Amazon discusses columnar conversion at length, and provides walkthrough examples, in their .
As a long-term solution, best practice is to use compression, partitioning, and conversion. However, for purposes of this walkthrough, we’re not going to worry about them so we can get up-and-running quicker.
3. Set up an AWS Athena querying engine
In your AWS console, navigate to the Athena service, and click “Get Started”. Follow the tutorial and set up a new database (we’ve called ours “AWS Optimizer” in this example). Don’t worry about configuring your initial table, per the tutorial instructions. We’ll be creating a new table for cost and usage analysis. Once you walked through the tutorial steps, you’ll be able to access the Athena interface, and can begin running Hive DDL statements to create new tables.
For Cost and Usage, we recommend using the DDL statement below. Since our data is in CSV format, we don’t need to use a SerDe, we can simply specify the “separatorChar, quoteChar, and escapeChar”, and the structure of the files (“TEXTFILE”). Note that AWS does have an OpenCSV SerDe as well, if you prefer to use that.
CREATE EXTERNAL TABLE IF NOT EXISTS cost_and_usage ( identity_LineItemId String, identity_TimeInterval String, bill_InvoiceId String, bill_BillingEntity String, bill_BillType String, bill_PayerAccountId String, bill_BillingPeriodStartDate String, bill_BillingPeriodEndDate String, lineItem_UsageAccountId String, lineItem_LineItemType String, lineItem_UsageStartDate String, lineItem_UsageEndDate String, lineItem_ProductCode String, lineItem_UsageType String, lineItem_Operation String, lineItem_AvailabilityZone String, lineItem_ResourceId String, lineItem_UsageAmount String, lineItem_NormalizationFactor String, lineItem_NormalizedUsageAmount String, lineItem_CurrencyCode String, lineItem_UnblendedRate String, lineItem_UnblendedCost String, lineItem_BlendedRate String, lineItem_BlendedCost String, lineItem_LineItemDescription String, lineItem_TaxType String, product_ProductName String, product_accountAssistance String, product_architecturalReview String, product_architectureSupport String, product_availability String, product_bestPractices String, product_cacheEngine String, product_caseSeverityresponseTimes String, product_clockSpeed String, product_currentGeneration String, product_customerServiceAndCommunities String, product_databaseEdition String, product_databaseEngine String, product_dedicatedEbsThroughput String, product_deploymentOption String, product_description String, product_durability String, product_ebsOptimized String, product_ecu String, product_endpointType String, product_engineCode String, product_enhancedNetworkingSupported String, product_executionFrequency String, product_executionLocation String, product_feeCode String, product_feeDescription String, product_freeQueryTypes String, product_freeTrial String, product_frequencyMode String, product_fromLocation String, product_fromLocationType String, product_group String, product_groupDescription String, product_includedServices String, product_instanceFamily String, product_instanceType String, product_io String, product_launchSupport String, product_licenseModel String, product_location String, product_locationType String, product_maxIopsBurstPerformance String, product_maxIopsvolume String, product_maxThroughputvolume String, product_maxVolumeSize String, product_maximumStorageVolume String, product_memory String, product_messageDeliveryFrequency String, product_messageDeliveryOrder String, product_minVolumeSize String, product_minimumStorageVolume String, product_networkPerformance String, product_operatingSystem String, product_operation String, product_operationsSupport String, product_physicalProcessor String, product_preInstalledSw String, product_proactiveGuidance String, product_processorArchitecture String, product_processorFeatures String, product_productFamily String, product_programmaticCaseManagement String, product_provisioned String, product_queueType String, product_requestDescription String, product_requestType String, product_routingTarget String, product_routingType String, product_servicecode String, product_sku String, product_softwareType String, product_storage String, product_storageClass String, product_storageMedia String, product_technicalSupport String, product_tenancy String, product_thirdpartySoftwareSupport String, product_toLocation String, product_toLocationType String, product_training String, product_transferType String, product_usageFamily String, product_usagetype String, product_vcpu String, product_version String, product_volumeType String, product_whoCanOpenCases String, pricing_LeaseContractLength String, pricing_OfferingClass String, pricing_PurchaseOption String, pricing_publicOnDemandCost String, pricing_publicOnDemandRate String, pricing_term String, pricing_unit String, reservation_AvailabilityZone String, reservation_NormalizedUnitsPerReservation String, reservation_NumberOfReservations String, reservation_ReservationARN String, reservation_TotalReservedNormalizedUnits String, reservation_TotalReservedUnits String, reservation_UnitsPerReservation String, resourceTags_userName String, resourceTags_usercostcategory String ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION 's3://<<your bucket name>>';
Once you’ve successfully executed the command, you should see a new table named “cost_and_usage” with the below properties. Now we’re ready to start executing queries and running analysis!
Major cost saving levers
- Now that we have our data pipeline configured, we can dive into the most popular use-cases for cost savings. In this blog, we’ll focus on:
- Purchasing of Reserved vs On-Demand instances
- Data Transfer costs
- Allocating costs over Users or other Attributes (denoted with resource tags)
On-demand, spot, and reserved instances
Purchasing Reserved Instances vs On-Demand instances is arguably going to be the biggest cost lever for heavy AWS users (Reserved Instances run up to 75% cheaper!). AWS offers three options for purchasing instances, including On-Demand, Spot (variable cost), and Reserved instances. On-Demand instances allows you to simply pay as you use, Spot instances allow you to bid on spare Amazon EC2 computing capacity, while Reserved instances allows you to pay for an Instance for a specific, allotted period of time. When purchasing a Reserved instance, you can also choose to pay all-upfront, partial-upfront, or monthly. The more you pay upfront, the greater the discount.
If your company has been using AWS for some time now, you should have a good sense of your overall instance usage on a per-month or per-day basis. Rather than paying for these instances On-Demand, you should try to forecast the number of instances you’ll need, and reserve them with upfront payments. The total amount of usage with reserved instances versus overall usage with all instances is called your coverage ratio. It’s important not to confuse your coverage ratio with your RI utilization. Utilization represents the amount of reserved hours that we’re actually used. Don’t worry about exceeding capacity, you can still set-up auto-scaling preferences so that more instances get added whenever your coverage or utilization crosses a certain threshold (we often see a target of 80% for both coverage and utilization among savvy customers).
Calculating the reserved costs and coverage can be a bit tricky with the level of granularity provided by the Cost and Usage Report. The below query shows your total cost over the last 6 months, broken out by reserved vs non-reserved instance usage. You can substitute the cost field for usage if you’d prefer to view it by usage. Please note that if you’ll only have data for the time period since the Cost and Usage report has been enabled, so this query will only show a few days if you’re just getting started.
SELECT DATE_FORMAT(from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate),'%Y-%m') AS "cost_and_usage.usage_start_month", COALESCE(SUM(cost_and_usage.lineitem_blendedcost ), 0) AS "cost_and_usage.total_blended_cost", COALESCE(SUM(CASE WHEN (CASE WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item' WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item' WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item' ELSE 'Non RI Line Item' END = 'RI Line Item') THEN cost_and_usage.lineitem_blendedcost ELSE NULL END), 0) AS "cost_and_usage.total_reserved_blended_cost", 1.0 * (COALESCE(SUM(CASE WHEN (CASE WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item' WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item' WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item' ELSE 'Non RI Line Item' END = 'RI Line Item') THEN cost_and_usage.lineitem_blendedcost ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_blendedcost ), 0)),0) AS "cost_and_usage.percent_spend_on_ris", COALESCE(SUM(CASE WHEN (CASE WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item' WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item' WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item' ELSE 'Non RI Line Item' END = 'Non RI Line Item') THEN cost_and_usage.lineitem_blendedcost ELSE NULL END), 0) AS "cost_and_usage.total_non_reserved_blended_cost", 1.0 * (COALESCE(SUM(CASE WHEN (CASE WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item' WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item' WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item' ELSE 'Non RI Line Item' END = 'Non RI Line Item') THEN cost_and_usage.lineitem_blendedcost ELSE NULL END), 0)) / NULLIF((COALESCE(SUM(cost_and_usage.lineitem_blendedcost ), 0)),0) AS "cost_and_usage.percent_spend_on_non_ris" FROM aws_optimizer.cost_and_usage_raw AS cost_and_usage WHERE (((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE)))))))) GROUP BY 1 ORDER BY 2 DESC LIMIT 500
The resulting table should look something like the image below (we’re surfacing tables through Looker, though the same table would result from querying via command line or any other interface).
It’s an iterative process to understand the appropriate number of Reserved instances to meet your business needs. Once you’ve properly integrated Reserved instances into your purchasing patterns, the savings can be significant. If your coverage is consistently below 70%, you should seriously consider adjusting your purchase types and opt for more Reserved instances.
Data transfer costs
One of the great things about AWS, is that you don’t get charged for storing data, you only get charged for moving and processing data. Depending on the size, volume, and location of your data movement, you could end up paying a sizable portion of your monthly bill on transfer costs alone! There are several different prices for transferring data, broken out largely by transfers between regions and availability zones. Transfers between regions are the most costly (from $0.02-$0.12/GB), followed by transfers between Availability Zones ($0.01/GB). Transfers within the same region and same availability zone are free unless using elastic or public IP addresses, in which case there is a cost ($0.01/GB). You can find more detailed information in the . With this in mind, there are several simple strategies for helping reduce costs here.
First, you should ensure that whenever two or more AWS services are exchanging data, those AWS resources are located in the same region. Transferring data between AWS regions has a cost of $0.02-$0.12 per GB depending on the region. The more you can localize the services to one specific region, the lower your costs will be.
Second, be careful that you’re routing data directly within AWS services and IPs, and minimize the number of transfers occurring out of AWS to the open internet. Sending data to external sources is, by far, the most costly and least performant mechanism of data transfer, costing anywhere from $0.09-$0.12 per GB. You should avoid these transfers as much as possible.
Lastly, data transferred between private IP addresses are cheaper than elastic or public IPs. There’s no field in this report that denotes what type of IP a service uses, but it’s a good consideration when thinking through your architecture and launching new instances.
The below query provides a table depicting the total costs for each AWS product, broken out transfer cost type. Substitute the “lineitem_productcode” field in the query to segment the costs by any other attribute. If you notice any unusually high spikes in cost, you’ll need to dig deeper to understand what’s driving that spike: location, volume, etc.. Drill down into specific costs by including “product_usagetype” and “product_transfertype” in your query to identify the types of transfer costs that are driving up your bill.
SELECT cost_and_usage.lineitem_productcode AS "cost_and_usage.product_code", COALESCE(SUM(cost_and_usage.lineitem_usageamount ), 0) AS "cost_and_usage.total_usage_amount", COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer') THEN cost_and_usage.lineitem_blendedcost ELSE NULL END), 0) AS "cost_and_usage.total_data_transfer_cost", COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-In') THEN cost_and_usage.lineitem_blendedcost ELSE NULL END), 0) AS "cost_and_usage.total_inbound_data_transfer_cost", COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer-Out') THEN cost_and_usage.lineitem_blendedcost ELSE NULL END), 0) AS "cost_and_usage.total_outbound_data_transfer_cost" FROM aws_optimizer.cost_and_usage_raw AS cost_and_usage WHERE (((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE))))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('month', 6, DATE_ADD('month', -5, DATE_TRUNC('MONTH', CAST(NOW() AS DATE)))))))) GROUP BY 1 ORDER BY 3 DESC LIMIT 500
When moving between regions or over the open web, many data transfer costs also include the origin and destination location of the data movement. Using a BI tool with mapping capabilities, we can get a nice visual of data flows (we’re using Looker in this example). The point at the center of the map is used to represent external data flows over the open internet.
Analysis by tags
AWS provides the option to apply custom tags to individual resources, so you can allocate costs over whatever customized segment makes the most sense for your business. For a SaaS company that hosts software for customers on AWS, maybe you’d want to tag the size of each customer. The below query uses custom tags to display the reserved, data transfer, and total cost for each AWS service, broken out by tag categories, over the last 30 days. You’ll want to substitute the text highlighted in red with the name of your customer field.
SELECT * FROM ( SELECT *, DENSE_RANK() OVER (ORDER BY z___min_rank) as z___pivot_row_rank, RANK() OVER (PARTITION BY z__pivot_col_rank ORDER BY z___min_rank) as z__pivot_col_ordering FROM ( SELECT *, MIN(z___rank) OVER (PARTITION BY "cost_and_usage.product_code") as z___min_rank FROM ( SELECT *, RANK() OVER (ORDER BY CASE WHEN z__pivot_col_rank=1 THEN (CASE WHEN "cost_and_usage.total_blended_cost" IS NOT NULL THEN 0 ELSE 1 END) ELSE 2 END, CASE WHEN z__pivot_col_rank=1 THEN "cost_and_usage.total_blended_cost" ELSE NULL END DESC, "cost_and_usage.total_blended_cost" DESC, z__pivot_col_rank, "cost_and_usage.product_code") AS z___rank FROM ( SELECT *, DENSE_RANK() OVER (ORDER BY CASE WHEN "cost_and_usage.user_cost_category" IS NULL THEN 1 ELSE 0 END, "cost_and_usage.user_cost_category") AS z__pivot_col_rank FROM ( SELECT cost_and_usage.lineitem_productcode AS "cost_and_usage.product_code", cost_and_usage.resourcetags_customersegment AS "cost_and_usage.customer_segment", COALESCE(SUM(CASE WHEN (CASE WHEN cost_and_usage.lineitem_lineitemtype = 'DiscountedUsage' THEN 'RI Line Item' WHEN cost_and_usage.lineitem_lineitemtype = 'RIFee' THEN 'RI Line Item' WHEN cost_and_usage.lineitem_lineitemtype = 'Fee' THEN 'RI Line Item' ELSE 'Non RI Line Item' END = 'RI Line Item') THEN cost_and_usage.lineitem_blendedcost +10 ELSE NULL END), 0) AS "cost_and_usage.total_reserved_blended_cost", COALESCE(SUM(CASE WHEN REGEXP_LIKE(cost_and_usage.product_usagetype, 'DataTransfer') THEN cost_and_usage.lineitem_blendedcost + 5 ELSE NULL END), 0) AS "cost_and_usage.total_data_transfer_cost", COALESCE(SUM(cost_and_usage.lineitem_blendedcost + 30 ), 0) AS "cost_and_usage.total_blended_cost" FROM aws_optimizer.cost_and_usage_raw AS cost_and_usage WHERE (((from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) >= ((DATE_ADD('day', -29, CAST(NOW() AS DATE)))) AND (from_iso8601_timestamp(cost_and_usage.lineitem_usagestartdate)) < ((DATE_ADD('day', 30, DATE_ADD('day', -29, CAST(NOW() AS DATE))))))) GROUP BY 1,2) ww ) bb WHERE z__pivot_col_rank <= 16384 ) aa ) xx ) zz WHERE z___pivot_row_rank <= 500 OR z__pivot_col_ordering = 1 ORDER BY z___pivot_row_rank
The resulting table in this example looks like the results below. In this example, you can tell that we’re making poor utilization of reserved instances because they represent such a small portion of our overall costs.
Saving costs on your AWS spend will always be an iterative, ongoing process. Hopefully with these queries alone, you can start to understand your spending patterns and identify opportunities for savings. However, this is just a peek into the many opportunities available through analysis of the Cost and Usage report. Each company is different, with unique needs and usage patterns. To achieve maximum cost savings, I encourage you to set up an analytics environment that enables your team to explore all potential cuts and slices of your usage data, whenever it’s necessary. Exploring different trends and spikes across regions, services, user types, etc. will help you gain comprehensive understanding of your major cost levers and consistently implement new cost reduction strategies.
If you’re already a Looker customer, you can get all of this analysis, additional pre-configured dashboards, and much more using .
Want to learn more? I’ll be talking about these cost optimization strategies in a on Tuesday, August 29th, where we’ll be discussing different considerations for optimizing your AWS usage. .