Beneath the Surface of Snowflake's Query Cost Attribution

Snowflake’s cost model is a complex one. With many feature offerings, and as many ethereal compute resources as one could desire its easy to spend, and easy to lose that spend in the ether. The question that keeps a product manager awake at night — staring at a budget burning faster than snow melting in spring — is, “What query is causing this cost spike?” For engineers, the query is the atomic unit of work, cost, and optimisation. That’s what we can control, tune, or shut down.

Until recently, Snowflake made it difficult to directly connect query-level costs with overall monthly spending. While the snowflake.account_usage schema provided some clues, figuring out the cost per query still felt like a lonely detective job. But then, Snowflake introduced the QUERY_ATTRIBUTION_HISTORY view, which promised to change that. It even had the magic column we’d all been waiting for: credits_attributed_compute.

At Alvin, where we handle complex cost breakdowns for our customers, we were excited — finally, a reliable number straight from Snowflake! We could shift our focus to delivering deeper insights rather than piecing together cost data. But alas, a glance at the documentation dashed our hopes. The column description reads:

“credits_attributed_compute | NUMBER | Number of credits attributed to this query. Includes only the credit usage for the query execution and doesn’t include any warehouse idle time.”

That strange omission, and why it’s the wrong choice from a user perspective, is the disappointing topic of this article.

What You Are About to Read

In this article, we focus on understanding query-level costs in Snowflake and the puzzling decisions Snowflake has made around cost attribution. Specifically, we will:

Outline the problem: Why query-level cost tracking is essential and where Snowflake’s approach falls short.
Examine Snowflake’s design choices: We’ll touch on the decisions that seem almost deliberately designed to obscure costs.
Explore a demonstrative example: Showcasing how specific queries — short, frequent, large in text, or large in data — can drive different costs.
Discuss potential solutions:
For the specific example
For the broader issue of query cost tracking and optimisation

We won’t dive into the following topics here, but you can explore them in more detail via the provided links:

The complex calculation of cost per query by attributing credits
The role of storage costs, Snowflake’s second major cost driver
Query tagging and Data Warehouse (DWH) metadata
A catalog of forms that cost-driving queries can take
Snowflake’s new, somewhat obtuse cost models — like serverless compute

The Problem

To understand the issue with Snowflake’s new design choice, we first need to align on the goal of query-level cost attribution. If the goal were simply to assign the exact cost to a query, Snowflake’s provided model might be optimal — perhaps even using clever internal mechanisms to allocate costs with fine precision. However, I argue that “correct” cost assignment is merely an implementation detail. What we truly want is to understand cost changes in our Snowflake environment and, most importantly, identify actionable items to reduce costs. In essence, it should be possible to say: “Fixing these lines of code (that we own) will have the biggest impact on reducing cost.”

This second point is crucial in any form of data analysis. Knowing that overall costs are increasing across the organization isn’t helpful unless we can pinpoint actionable steps without needing to start a whole new deep dive.

So, what’s the big issue with Snowflake’s model? Skipping over “idle time” for warehouses isn’t inherently wrong — we can still view those costs in idle time or warehouse reports. However, in my experience, these reports either don’t exist, or if they do, they are poorly maintained and not as closely monitored. The issues with attribution without accounting for idle time are:

Costs get split into categories that aren’t analyzed together.
Idle time is disconnected from the queries or code that caused it.
A unified alternative exists, as a cohesive view is possible without loss of fidelity

Understanding Idle Time in Snowflake

In this section, we will delve into the technical details surrounding idle time in Snowflake and how it impacts cost attribution. This exploration will help clarify the second point regarding the disconnect between idle time and the queries that trigger it.

To understand this, let’s review the typical events that lead to warehouse idle time:

The warehouse is currently inactive.
The user selects a warehouse with the command: USE WAREHOUSE X. (This does not yet start any warehouse.)
The user runs a query that requires compute resources.
Snowflake’s AUTO RESUME triggers, causing the warehouse to start automatically in response to the query without requiring any additional steps.
The query runs, and results are returned.
The warehouse enters an inactive state again.
If no further queries are executed, AUTO SUSPEND triggers, shutting the warehouse down after a predefined period.

Considering the sequence, put together with some other info from Snowflake there are several points that begin to reveal Snowflake’s underlying logic:

Idle time typically occurs only after a query has completed.
If multiple queries are executed, the likelihood of significant idle time, when distributed across all queries, is low.
Snowflake sets the default Auto Suspend time to five minutes. This is an unusual duration, and as a Snowflake user, you have likely adjusted it to one minute. This adjustment is often touted as one of the simplest ways to reduce costs, alongside using TRANSIENT tables and disabling Time Travel (which is enabled by default).
Snowflake does not guarantee a shutdown in intervals of less than 30 seconds.
Snowflake charges a minimum of 60 seconds for warehouse usage, regardless of how long the query actually runs.
Snowflake provides the snowflake.account_usage.WAREHOUSE_EVENTS_HISTORY view, which details changes in the warehouse state, including the query that caused the warehouse to boot.

A cynical interpretation of these facts suggests that Snowflake has intentionally set a high default Auto Suspend time, established a minimum billing increment of 60 seconds, and distributed idle time costs across queries — all to encourage users to spend more without clearly tracing where that spending originates.

A Worked Example: The Cost of Probing Queries

To illustrate the potential pitfalls of Snowflake’s cost attribution, let’s consider a practical example involving a function that performs an expensive upsert operation. Below is a sample implementation:

def do_expensive_upsert(source_table: str, target_table: str):
    with snowflake.connect() as conn:
        cursor = conn.cursor()
        cursor.execute("USE WAREHOUSE XXL")
        cursor.execute(f"""
          SELECT
            (SELECT MAX(updated_at) FROM {source_table}) AS source_max,
            (SELECT MAX(updated_at) FROM {target_table}) AS target_max,
            source_max > target_max AS needs_upsert
        """)
        needs_upsert = cursor.fetchone()[-1]
        if not needs_upsert:
            # Skip our expensive work
            return
        _do_expensive_upsert(conn, source_table, target_table)

This pattern of checking for work before executing an expensive operation is common in various scenarios, including data loading, schema migrations, and table rebuilds. However, when implemented in a system like Airflow, it can lead to unexpected costs. Consider this unfortunate sequence of events when managing multiple upserts:

Airflow task 1 of X starts: The first task begins its execution.
Warehouse boots up: The task triggers the warehouse to start.
Early return: The task checks if an upsert is necessary and returns quickly if not.
Warehouse idles: The warehouse remains idle for a minute.
Next Airflow task starts: The next task takes a full minute to begin (a common scenario in a busy Airflow cluster).
Warehouse boots up again: The warehouse is triggered to start once more.

I have witnessed this pattern repeatedly across various clients, and it’s reasonable to assume that Snowflake has too. This approach not only drives up costs but does so in several obscured ways:

Fast and cheap individual queries: Each individual query runs quickly and appears cost-effective, making them less likely targets for optimisation.
Probing queries proliferate: Because these probing queries are perceived as necessary checks, they can be run multiple times, spreading the cost across different queries and times of day.
Difficulty in grouping queries: Each query generates a unique query_hash, complicating efforts to group and analyze costs effectively.
Small warehouse for probing queries: Utilizing a small warehouse for these probe queries might mitigate some costs, but it’s not a solution that readily comes to mind for most users.
Airflow gaps go unnoticed: The gaps in execution times within Airflow are rarely scrutinized unless a team has a pressing real-time requirement, meaning they often remain unoptimized.

Disconnection from idle time: Even if you leverage the new snowflake.account_usage.QUERY_ATTRIBUTION_HISTORY view, the probing query — which significantly contributes to costs — will be disconnected from the idle time it creates.

Solutions

To address the cost attribution challenges we’ve discussed, the primary solution is to include idle time — that is, all costs — when distributing or attributing expenses to queries. While one might debate the “perfect” method for this distribution, any form of attribution that incorporates idle time aligns more closely with the spirit of cost analysis than leaving it out entirely. When costs are properly attributed, any spikes will at least be reflected somewhere, making it easier to identify and address them.

Beyond attributing all costs, there are additional strategies we can implement, as well as considerations to keep in mind:

Establish a Metric for Credits Per Second: Tracking credits per second can provide valuable insights into your cost structure. Without attribution, this metric would remain constant; however, when costs are attributed correctly, credits per second can easily exceed the cost of the warehouse being used. If this occurs, it serves as a strong indicator that something is amiss in your query management or resource allocation.

Implement Query Tagging for Better Grouping: Tagging queries enhances the ability to group and analyze costs effectively. Since many cost drivers run multiple times a day, they might not surface in isolated query analyses. Effective grouping helps us pinpoint issues, even if they are new or unfamiliar. Consider the following tagging strategies:

Model: This tag is typically applied by tools like dbt to encapsulate the cluster of queries that collectively contribute to building a business concept (e.g., a specific table).
Technical Purpose: In contrast to the model tag, this one describes what the query aims to achieve. For instance, the same probing query might be utilised across various models or contexts.
Git/Source Information: This is particularly important in this context, as it helps identify the file path of the query in your version control system. This visibility facilitates collaboration and troubleshooting.

Monitor Warehouse Settings: Regularly review and adjust your warehouse settings, such as the Auto Suspend time. Setting this value appropriately can minimize idle time and, consequently, costs. While a one-minute setting is often recommended, assess the specific usage patterns of your organization to find the best fit.

Optimize Query Performance: Continuously review and optimise the performance of queries. Investigate which queries consume the most resources and explore optimisation strategies, such as improving indexing, adjusting join operations, or even rewriting the queries to reduce complexity.

Engage in Cost Reviews: Schedule regular cost reviews as part of your team’s workflow. This can help ensure that everyone is aware of spending patterns and can collaborate on identifying potential areas for optimisation.

Try Alvin: alvin.ai is a multi-cloud data observability and cost opimisation platform that attributes query cost in the way I’ve described in this article, and provides actionable insights to drive down costs. You can try it out for free to see the how it compares to Snowflake’s version in your environment.

By adopting these solutions, you can create a more accurate and actionable understanding of costs within your Snowflake environment. These strategies will not only help reduce costs but also improve the efficiency and effectiveness of your data operations.

‍