Cost anomalies

DoiT Anomaly Detection provides end-to-end monitoring of cost spikes in Google Cloud, Amazon Web Services, Microsoft Azure, Snowflake, Datadog, and OpenAI.

The detection service leverages time series modeling to monitor data and analyze the trend of spending in your cloud environment. It identifies billing patterns across DoiT customers, forecasts your cloud spending, and is continuously refined to provide even more accurate results.

DoiT Anomaly Detection supports two types of source data:

Billing data: Cost and usage data by cloud providers, for example, AWS CUR, Google Cloud Billing data export, and Azure billing export.
Real-time usage data: DoiT supports near real-time cost anomaly detection for Amazon Elastic Compute Cloud (EC2), Amazon Relational Database Service (RDS) (Instance Usage and Storage/IOPS/Throughput cost of provisioned RDS instances), and Google Compute Engine (GCE) workloads, using estimated on-demand costs based on usage derived from AWS CloudTrail and Google Cloud Audit Logs.

Billing records that don't align with your established spending behavior are identified as potential anomalies. The DoiT console provides detailed information such as contributory resources to help you investigate and take corrective actions if necessary.

The following sections explain how DoiT Anomaly Detection works.

Baseline period

The anomaly detection system starts analyzing data immediately after you sign up. However, for accurate detection, it requires sufficient historical data. We set a baseline period of 14 days of billing data and 4 days of real-time usage data for the machine learning model to establish reliable baselines for usage patterns based on billing and real-time data, respectively.

In case anomaly detection is critical to your operation, we recommend waiting this baseline period out before you make significant changes to your cloud spending. No spend will be classified as anomalies during the baseline period.

Aggregation level

The anomaly detection system leverages a time-series model for monitoring and evaluating cost and usage data at both the SKU level and the service level. It is designed to detect significant cost spikes by analyzing historical patterns and comparing them against current usage trends. When an anomalous cost spike is detected, an anomaly alert is triggered.

Billing data

For billing data, to speed up the time to resolution, the anomaly detection system works primarily at the SKU level, which means a large portion of the detected cost anomalies will be SKU-level anomalies.

The service-level monitoring mainly serves as a supplement to deal with early spikes caused by newly created projects or new SKUs. When a new project is created or a new SKU starts incurring costs, a new time series is identified and starts collecting cost data. However, the new time series will not produce SKU-level anomaly candidates in the first few days due to lack of sufficient historical data points. While the "normal" spend for the new time series is yet to be established, the newly incurred costs may already cause an anomalous spike at the service level.

Real-time usage data

Starting May 21, 2025, the anomaly detection system for real-time usage data, which is of hourly granularity, works solely at the service level to mitigate noise caused by high-resolution sampling.

Evaluation scope

Data samples evaluated by the system are partitioned as follows:

per billing account
per project/account
per service
per SKU
per allocation (if applicable)

For SKU-level anomalies, the anomaly detection system evaluates anomalies per SKU, per service, and per project/account across regions; for service-level anomalies, the system evaluates anomalies per service across projects and SKUs. The anomaly detection system doesn't evaluate the combined costs of multiple services.

Criteria

To mitigate false alarms, the anomaly detection system classifies the spend of a SKU or service as an anomaly only when it meets all the following criteria.

Billing data

The daily spend reaches the minimum threshold:
- SKU-level anomalies: US$50
- Service-level anomalies: US$100
The daily spend exceeds monthly seasonality.
The daily spend exceeds the upper bound of the system's normal range (or acceptable range).

Real-time usage data

The hourly spend reaches the minimum threshold of US$10.
The hourly spend exceeds the upper bound of the system's normal range (or acceptable range).

Tip

The anomaly detection system uses a model trained on data from the preceding period to forecast expected spend. The normal range is determined by a DoiT-specific prediction interval, which is an estimate of an interval in which a certain percentage of possible values will fall. For example, a 90% prediction interval contains 90% of the possible values that a new data point can have, based on prior values fitted by the model. The normal range is depicted as a shaded area on cost anomaly charts.

Detection latency

The detection latency varies with the source data.

Billing data

In most cases, a billing data anomaly is reported within 12 hours once the aggregated cost exceeds the predefined threshold.

The anomaly detection engine evaluates usage and cost data at regular intervals. For SKU-level anomalies, the evaluation runs hourly; for service-level anomalies, it runs every six hours. The detection latency mainly relates to the varying intervals at which cloud providers report usage and cost data.

See also AWS cost data latency in DoiT console and Google Cloud's frequency of data loads.

Real-time usage data

The anomaly detection engine evaluates real-time usage data every 30 minutes. An anomaly detected via real-time usage data is reported within one hour of usage.

Dynamic updates

An ongoing anomaly is regarded as an Active anomaly. The detection system keeps monitoring active anomalies, constantly updating the system with the latest available data.

An anomaly becomes Inactive when either of the following conditions is met:

The cost falls back into the new normal range.
The anomaly has reached its maximum active period: 7 days if it's based on billing data, or 3 days if it's based on real-time usage data.

You can find more information about when an anomaly becomes Active and Inactive on cost anomaly charts.

Cost anomaly alerts

When an anomaly is detected, we check the billing account, service, and project ID to make sure an alert is triggered only if no other alert has already been sent for the same context.

This means that for the same service, if a SKU-level alert has already been sent, a service-level alert will not be triggered. And vice versa.

Spikes at the beginning of the month

Some services apply bulk billing on the first day of the month, resulting in a disproportionate cost spike compared to the rest of the month. However, if the amount is in line with previous months, it is not considered an anomaly.

To improve accuracy, when evaluating costs at the beginning of the month, the anomaly detection model conducts a month-to-month comparison in addition to daily cost modeling. (Near real-time anomaly detection conducts a more granular evaluation over a shorter period, so monthly seasonality is not considered.)

Below is the list of assessed services.

Services with cost spikes at the beginning of the month

Amazon Cognito
Amazon DynamoDB
Amazon ElastiCache
Amazon OpenSearch Service
Amazon Redshift
Amazon Relational Database Service
Amazon Route 53
Amazon WorkSpaces
AWS Certificate Manager
AWS Data Transfer
AWS Elemental MediaConvert
AWS Identity and Access Management Access Analyzer
AWS Outposts
AWS Shield
Bright Data Enterprise non metered
Cloud Navigator
Compute Engine (selected SKUs)
Contact Center Telecommunications (service sold by AMCS, LLC)
Coralogix
Datadog
Datadog Pro
Directions API
Drata Security & Compliance Automation Platform
Fastly for GCP Marketplace
Geocoding API
Geolocation API
Grafana Cloud observability: Grafana, Prometheus metrics, logs, traces
HYCU R-Cloud™ Platform
Identity Platform
JFrog DevOps Platform - Enterprise X
JFrog Software Supply Chain Platform
Looker Studio
Maps API
Office LTSC Professional Plus 2021
Places API
Plerion Cloud Security Platform (Contract)
Snyk: Developer Security Platform
Twilio Segment
Vantage Cloud Cost Platform - Enterprise
WIZ Cloud Infrastructure Security Platform

▶️ Interactive demo

Try out our interactive demo for a hands-on walk-through experience.

If the demo doesn't display properly, try expanding your browser window or opening the demo in a new tab.

Baseline period​

Aggregation level​

Billing data​

Real-time usage data​

Evaluation scope​

Criteria​

Billing data​

Real-time usage data​

Detection latency​

Billing data​

Real-time usage data​

Dynamic updates​

Cost anomaly alerts​

Spikes at the beginning of the month​

▶️ Interactive demo​

See also​

Baseline period

Aggregation level

Billing data

Real-time usage data

Evaluation scope

Criteria

Billing data

Real-time usage data

Detection latency

Billing data

Real-time usage data

Dynamic updates

Cost anomaly alerts

Spikes at the beginning of the month

▶️ Interactive demo

See also