Cost anomalies
Overview
DoiT Cloud cost anomaly detection offers end-to-end monitoring of spikes in your Google Cloud, Amazon Web Services, and Microsoft Azure costs across your projects and services.
The detection service leverages time series modeling to monitor billing data and analyze the trend of spending in your cloud environment. It identifies billing patterns across DoiT customers, forecasts your cloud spending, and is continuously refined to provide even more accurate results.
Billing records that don't align with your anticipated spending behavior are identified as potential anomalies. The DoiT console also provides insights into which resources are causing the anomalies and helps you take corrective actions if necessary.
- FinOps Foundation: Managing Cloud Cost Anomalies
Before you begin
The data analysis begins as soon as you sign up. However, for anomaly detection to work properly, a sufficient number of historical data points are required. We set a minimum threshold of 14 days of data to allow the machine learning model to establish a reliable pattern for normal range.
In case anomaly detection is critical to your operation, we recommend waiting this 14-day period out before you make significant changes to your cloud spending. No spend will be classified as anomalies during the first 14 days.
Access anomalies
Required permissions
- Attributions Manager, Anomalies Viewer, Cloud Analytics
View the anomaly list
To view the list of detected cost anomalies, select Governance from the top navigation menu, and then select Cost anomalies.
The DoiT Platform stores all the historical cost anomalies. You can change the Time range or use anomaly properties to filter the results.
Anomaly properties
Each anomaly entry on the Cost anomalies page provides the following information:
-
Start Time: The start time of the hourly usage window on which the aggregated cost exceeds the predefined threshold and is considered a potential anomaly. The time value comes from the billing data by the cloud providers: for AWS it is the
lineItem/UsageStartDate
(UTC); for Google Cloud it is theusage_start_time
(PT); for Azure it is the propertyusageStart
(UTC). See also Time zone. -
Status: (valid for anomalies detected after December 11, 2023) Shows the status of an anomaly and whether it has been acknowledged. See Dynamic updates and acknowledge a cost anomaly for more information.
-
Project/Account: See Hierarchy groups: Project/Account ID. This field shows All if the anomaly was detected at service level instead of at SKU level.
-
Service: See Resource metadata: Service.
-
SKU: The Stock Keeping Unit of a service. See Resource metadata: SKU.This field shows All if the anomaly was detected at service level instead of at SKU level.
-
Severity: The severity level of the anomaly. There are three severity levels:
Information
,Warning
, andCritical
. They're defined by DoiT in accordance with the extent to which the actual cost deviates from the established pattern. -
Cost of anomaly: The difference between the actual cost and the maximum cost in the normal range.
-
Anomaly: A thumbnail image of the anomaly chart.
-
Details: Select the View button in this column to view the details of a specific anomaly.
How it works
The anomaly detection system leverages a time-series model to continuously monitor and evaluate cost and usage data at both the SKU level and the service level. It is designed to detect significant cost spikes by analyzing historical patterns and comparing them against current usage trends. When an anomalous cost spike is detected, an anomaly is triggered.
It's worth mentioning that, to speed up your time to resolution, the anomaly detection system works primarily at the SKU level, which also means a large portion of the detected cost anomalies will be SKU-level anomalies.
The service-level monitoring serves mainly as a supplement to deal with early spikes caused by newly created projects or new SKUs. When a new project is created or a new SKU starts incurring costs, a new time series is identified and starts collecting cost data. However, the new time series will not produce SKU-level anomaly candidates in the first few days due to lack of sufficient historical data points. While the "normal" spend for the new time series is yet to be established, the newly incurred costs may already cause an anomalous spike at the service level.
Evaluation scope
Data samples evaluated by the system are partitioned as follows:
- per billing account
- per project/account
- per service
- per SKU
- per attribution (if applicable)
For SKU-level anomalies, the anomaly detection system evaluates anomalies per SKU, per service, and per project/account across regions; for service-level anomalies, the system evaluates anomalies per service, across projects and SKUs. The anomaly detection system doesn't evaluate the combined costs of multiple services.
Criteria
To be classified as an anomaly, the spend of a SKU or service must meet all the following criteria:
-
The daily spend reaches the minimum threshold:
-
SKU-level anomalies: US$50
-
Service-level anomalies: US$250
-
-
The daily spend exceeds monthly seasonality.
-
The daily spend exceeds the upper bound of the system's normal range (or acceptable range).
The anomaly detection system uses a model fitted on data from the preceding period to forecast expected spend. The normal range is determined by a DoiT-specific confidence interval, which represents a range within which a certain percentage of possible values should fall. For example, a 90% confidence interval indicates the range for 90% of possible values.
The normal range is depicted as a shaded area on cost anomaly charts.
Detection latency
In most cases, an anomaly is reported within 12 hours once the aggregated cost exceeds the predefined threshold.
The anomaly detection engine checks usage and cost data at regular intervals. For SKU-level anomalies, the check runs hourly; for service-level anomalies, it runs every six hours. The detection latency mainly relates to the varying intervals at which cloud providers report usage and cost data.
See also AWS cost data latency in DoiT console and Google Cloud's frequency of data loads.
Dynamic updates
An ongoing anomaly is regarded as an Active
anomaly. The detection system keeps monitoring active anomalies, constantly updating the system with latest available cost data.
An anomaly becomes Inactive
when either of the following conditions is met:
-
Cost falls back into the new normal range.
-
The anomaly has reached the maximum active period of 7 days.
You can find more information about when an anomaly becomes Active
and Inactive
on cost anomaly charts.
Cost anomaly alerts
When an anomaly is detected, we check the billing account, service, and project ID to make sure an alert is triggered only if no other alert has already been sent for the same context.
This means that for the same service, if a SKU-level alert has already been sent, a service-level alert will not be triggered. And vice versa.
Spikes at the beginning of the month
Some services apply bulk billing on the first day of the month, resulting in a disproportionate cost spike compared to the rest of the month. However, if the amount is in line with previous months, it is not considered an anomaly.
The anomaly detection model takes this into account. When evaluating costs at the beginning of the month, it conducts a month-to-month comparison in addition to daily cost modeling. Below is the list of services thus assessed.
Services with cost spikes at the beginning of the month
- Amazon Cognito
- Amazon ElastiCache
- Amazon OpenSearch Service
- Amazon Redshift
- Amazon Relational Database Service
- Amazon Route 53
- Amazon WorkSpaces
- AWS Certificate Manager
- AWS Data Transfer
- AWS Elemental MediaConvert
- AWS Identity and Access Management Access Analyzer
- AWS Outposts
- AWS Shield
- Bright Data Enterprise non metered
- Cloud Navigator
- Compute Engine (selected SKUs)
- Contact Center Telecommunications (service sold by AMCS, LLC)
- Coralogix
- Datadog
- Datadog Pro
- Directions API
- Drata Security & Compliance Automation Platform
- Fastly for GCP Marketplace
- Geocoding API
- Geolocation API
- Grafana Cloud observability: Grafana, Prometheus metrics, logs, traces
- HYCU R-Cloud™ Platform
- Identity Platform
- JFrog DevOps Platform - Enterprise X
- JFrog Software Supply Chain Platform
- Looker Studio
- Maps API
- Office LTSC Professional Plus 2021
- Places API
- Plerion Cloud Security Platform (Contract)
- Snyk: Developer Security Platform
- Twilio Segment
- Vantage Cloud Cost Platform - Enterprise
- WIZ Cloud Infrastructure Security Platform
Interactive demo
Try out our interactive demo for a hands-on walk-through experience.
If the demo doesn't display properly, try expanding your browser window or opening the demo in a new tab.