Limits by Plan
Rate limits are applied per API key on a rolling monthly window. The counter resets on the first day of each calendar month (UTC). Each successful request counts as one unit regardless of the model used or tokens consumed.
| Plan | Monthly Requests | Requests / Minute | Concurrent Requests |
|---|---|---|---|
| Free | 200 | 10 | 1 |
| Starter | 1,000 | 30 | 3 |
| Pro | Unlimited | 120 | 10 |
| Enterprise | Unlimited | Custom | Custom |
Note
What Counts as a Request
- Each call to /v1/chat/completions counts as one request, whether streaming or non-streaming.
- Polling endpoints (e.g., agent status) count as one request per poll.
- Requests that fail with a 4xx client error still count toward your limit.
- Requests that fail with a 5xx server error do not count toward your limit.
Rate Limit Headers
Every API response includes headers that tell you your current rate limit status. Use these to monitor your consumption and implement client-side throttling.
| Header | Type | Description |
|---|---|---|
| X-RateLimit-Limit | integer | Maximum number of requests allowed in the current period. |
| X-RateLimit-Remaining | integer | Number of requests remaining in the current period. |
| X-RateLimit-Reset | integer | Unix timestamp (seconds) when the current period resets. |
| X-RateLimit-Limit-Minute | integer | Maximum requests allowed per minute. |
| X-RateLimit-Remaining-Minute | integer | Requests remaining in the current minute window. |
| Retry-After | integer | Seconds to wait before retrying (only present on 429 responses). |
Example Response Headers
HTTP/2 200 OK Content-Type: application/json X-RateLimit-Limit: 1000 X-RateLimit-Remaining: 847 X-RateLimit-Reset: 1714521600 X-RateLimit-Limit-Minute: 30 X-RateLimit-Remaining-Minute: 28
The X-RateLimit-Limit and X-RateLimit-Remaining headers reflect your monthly quota. The minute-level headers reflect the short-term burst limit.
Handling 429 Responses
When you exceed either the monthly or per-minute limit, the API returns a 429 Too Many Requests response with a JSON error body and a Retry-After header.
429 Response Body
{
"error": {
"type": "rate_limit_exceeded",
"message": "You have exceeded your per-minute request limit. Please retry after 12 seconds.",
"retry_after": 12
}
}Retry Strategy
Implement exponential backoff with jitter when you receive a 429. The Retry-After header gives you the minimum wait time, but adding jitter prevents thundering herd problems when multiple clients are rate-limited simultaneously.
Python Example
Warning
Best Practices
Follow these practices to stay within your rate limits and build resilient integrations.
- Monitor the X-RateLimit-Remaining header and slow down requests when it drops below 10% of your limit.
- Use streaming responses for chat completions -- a single streaming request is more efficient than polling for results.
- Cache responses when appropriate. If multiple users ask the same question, serve the cached result instead of making duplicate API calls.
- Batch operations where possible. Use the models list endpoint once and cache it rather than calling it before every completion request.
- Distribute requests evenly across the minute window instead of sending bursts.
- Set up alerts in your monitoring system when X-RateLimit-Remaining drops below a threshold.
- On the Free plan, consider upgrading to Starter if you regularly hit the 200 request/month ceiling.
Tip