Doc with explanation of different status codes and error conditions (#…

…37945) * Doc with explanation of different status codes and error conditions * Update StatusCodes.md * Update StatusCodes.md * Update StatusCodes.md * Update StatusCodes.md * Update StatusCodes.md * Update StatusCodes.md
Azure · Dec 7, 2023 · 1691f99 · 1691f99
1 parent 3a0c2f1
commit 1691f99
Showing 1 changed file with 61 additions and 0 deletions.
diff --git a/sdk/cosmos/azure-cosmos/docs/StatusCodes.md b/sdk/cosmos/azure-cosmos/docs/StatusCodes.md
@@ -0,0 +1,61 @@
+# Cosmos DB Java SDK - Status Codes
+
+## Overview
+
+Below is a list of the different status code / sub status code combinations that admins and/or developers can experience in their application or when looking at diagnostics/metrics when using the Cosmos DB Java SDK.
+Several of these error conditions would never surface to the application, because the SDK has built-in retry-logic to recover. But these status codes could show up in diagnostics and/or micrometer metrics. Common questions from users regarding the context of these status codes include::
+- what do these error conditions mean - even when they are automatically handled by the SDK?
+- how long should a certain error condition happen before self-recovery happens?
+
+The column "Expected to be transient" indicates whether this error condition is to be expected to always be seen for a short period of time only (and auto-mitigated within seconds or at least few minutes should happen in the SDK and/or the service). For error conditions that might not go away automatically the description contains additional context on what to watch out for if this error is seen for longer period of times. For errors that are supposed to be "transient" it will be necessary to file a support ticket, if the error condition exists for an extended period of time. The "Additional Info" column will also point to known issues if available and possible mitigations.
+
+## Out of scope
+This document is intentionally not going into details on how resilient applications should react  to certain error conditions (and whether/how retries are recommended). There is prescriptive guidance for developers around this located here: [Designing resilient applications with Azure Cosmos DB SDKs](https://learn.microsoft.com/azure/cosmos-db/nosql/conceptual-resilient-sdk-applications)
+
+## Status codes
+
+| Status code | Substatus code | Expected to be transient | Additional info |
+| -----------------:|-----------------------:|:-----------------------:|:----------------------|
+|200|0|No| `OK`|
+|201|0|No| `Created` - returned for createItem or upsertItem when a new document was created|
+|204|0|No| `No Content` - returned when no payload would ever be returned - like for delete operations|
+|207|0|No| `Multi-Status` - returned for transactional batch or bulk operations when some of the item operations have failed and others succeeded. The API allows checking status codes of item operations.|
+|304|0|No| `Not Modified` - will be returned for `ChangeFeed` operations to indicate that there are no more changes|
+|400|\*|No| `Bad Request` - indicates that the client violated some protocol constraint. See [Bad Request TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-bad-request) for more details.|
+|400|1001|No| `Bad Request/Partition key mismatch` - indicates that the PartitionKey defined in the point operation does not match the partition key value being extracted in the service form the document's payload based on the `PartitionKeyDefinition` of the container. See [Bad Request TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-bad-request) for more details.|
+|400|1004|No| `Bad Request/CrossPartitionQueryNotServable` - indicates that the client attempted to execute a cross-partition query, which cannot be processed with the current SDK version. Usually this means that the query uses a query construct, which is not yet supported in the SDK version being used. Upgrading the SDK might help to address the problem. See [Bad Request TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-bad-request) for more details.|
+|401|0|No|`Unauthorized` - indicates that the client used invalid credentials. The most frequent scenario when this is happening, is when customers rotate the currently used key. Key rotation needs to be replicated across regions, which can take up-to a few minuts. During this time a `401 Unauthroized` would be used when the client is using the old or new key while the replication is still happening. The best way to do key rotation is to rotate the key only after it is not used by applications anymore - that is why a primary and secondary key exists for both writable and read-only keys. More details can be found here - [key rotation best practices](https://learn.microsoft.com/azure/cosmos-db/secure-access-to-data?tabs=using-primary-key#key-rotation). In addition this could also mean an invalid key when using `MasterKey`-based authentication, it could mean there is a time-synchronization issue or when using AAD that the AAD credentials are not correctly set-up. See [Unauthorized TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-unauthorized) for more details.|
+|403|\*|No|`Forbidden` - indicates that the service rejected the request due to missing permissions. See [Forbidden TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-forbidden)|
+|403|3|Yes (up to few minutes)|`Forbidden/WriteForbidden` - indicates that the client attempted a write operation against a read-only region in a single write region set-up.|
+|403|1008|Yes (up to few minutes)|`Forbidden/AccountNotFound` - indicates that the client attempted a read or write operation against a replica that did not have information about the database account.|
+|403|5300|No|`Forbidden/AADForMetadata` - indicates that the client attempted a metadata operation (like creating, deleting or modifying a container/database) when using AAD authentication. This is not possible via the Data plane SDK. To execute control plane operatiosn with AAD authentication, please use the management SDK. See [Forbidden TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-forbidden#partition-key-exceeding-storage) and also the [Azure Cosmos DB Service quotas](https://learn.microsoft.com/azure/cosmos-db/concepts-limits#provisioned-throughput) for more details|
+|403|1014|No|`Forbidden/LogicalPartitionExceedsStorage` - indicates that the data size of a logical partition exceeds the service quota (currently 20 GB). See [Forbidden TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-forbidden#non-data-operations-are-not-allowed) for more details|
+|404|0|No|`Not found` - Indicates that the resource the client tried to read does not exist (on the replica being contacted). Depending on the consistency level used this could be a transient error condition - but when using less than strong consistency the application needs to be able to handle temporarily seeing 404/0 from some replica even after document got created gracefully. See [Not found TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-not-found) for more details.|
+|404|1002|In most cases|`Not Found/Read session no available` - Indicates that a client uses session consistency and reached a replica that has a replication lag and has not caught-up to the requested session token. In many cases this error condition will be transient. But there are certain situation in which it could persist for longer period of times - either a wrong session token is being provided in the application or in a Multi-Write region set-up operations are regulary directed to different regions|
+|404|1003|Yes (up to few minutes)|`Not Found/Owner resource does not exist` - Indicates that a client attempted to process an operation on a resource whose parent does not exist. For example an attempt to do a point operation on a document when the container does not exist (yet). Can be transient when attempting document operations immediately after creating a container etc. - but when not transient usually means a bug in your application.|
+|404|1024|x|`Not Found/Incorrect Container resource id` - Indicates that a client attempted to use a container that has recently been deleted and recreated. So, the cached container id in the client is stale - and identifies the previosuly deleted container. The SDK will trigger retries - in general applications need to be able to tolerate that container deletion and immediate recreation will take up-to a few seconds/minutes to be replicated across all regions.|
+|408|\*|Yes|`Request timeout` - Indicates a timeout for an attempted operation. See [Request timeout TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-java-sdk-request-timeout) for more details.|
+|408|20008|Yes, unless unrealistic e2e timeout is used|`Request timeout/End-to-end timeout exceeded` - Indicates that the application defined end-to-end timeout was exceeded when processing an operation. This will usually be a transient error condition - exceptions are when the application defines unrealistic end-to-end timeouts - for example when executing a query that could very well take a few seconds because it is relatively inefficient or when the end-to-end timeout is lower than the to-be-expected network transit time between the application's location and the Cosmos DB service endpoint.|
+|408|20901|No|`Request timeout/Negative End-to-end timeout provided` - Indicates that the application defined a negative end-to-end timeout. This indicates a bug in your application.|
+|409|0|No|`Conflict` - Indicates that the attempt to insert (or upsert) a new document cannot be processed because another document with the same identity (partition key value + value of `id` property) exists or a unqiue key constraint would be violated.|
+|410|\*|Yes|`Gone` - indicates transient error conditions that could happen while replica get moved to a different node or partitions get split/merged. The SDK will retry these error conditions and usually mitigate them without even surfacing them to the application. If these errors get surfaced to the application as `CosmosException` with status code `410` or `503` these errors should always be transient.|
+|410|1000|x|`Not Found/Incorrect Container resource id` - Indicates that a client attempted to use a container that has recently been deleted and recreated. So, the cached container id in the client is stale - and identifies the previosuly deleted container. The SDK will trigger retries - in general applications need to be able to tolerate that container deletion and immediate recreation will take up-to a few seconds/minutes to be replicated across all regions.|
+|410|21010|Yes|`Service timeout` - Indicates that an operation has been timed out at the service.  See [Request timeout TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-request-timeout) for more details. This error will be mapped to a CosmosException with status code `503` when surfacing it to the application after exceeding SDK-retries.|
+|410|21006|Yes|`Global strong write barrier not met` - Indicates that synchronous replication of a write operation in a multi-region account with strong consistency did not complete. This error should always be transient and will be mapped to a CosmosException with status code `503` when surfacing it to the application after exceeding SDK-retries.|
+|410|21007|Yes|`Read quorum not met` - Indicates that no read quorum could be achieved when using strong or bounded staleness consistency. This error should always be transient and will be mapped to a CosmosException with status code `503` when surfacing it to the application after exceeding SDK-retries.|
+|412|0|No|`Precondition failed` - The document has been modified since the application read it (and retrieved the etag that was used as pre-codnition for the write operation). This is the typical optimistic concurrency signal - and needs to be gracefully handled in your application. The usual patterns is to re-read the document, apply the same changes and retry the write with the updated etag. See [Precondition failed TSG - trouble-shooting guide](https://aka.ms/CosmosDB/sql/errors/precondition-failed) for more details.|
+|413|\*|No| `Request entity too large` - indicates that the client attempted to create or update a document with a payload that is too large. See [Azure Cosmos DB Service quotas](https://learn.microsoft.com/azure/cosmos-db/concepts-limits#per-item-limits) for more details.|
+|429|3200|Depends on app RU/s usage|`User throttling` - Indicates that the operations being processed by your Cosmos DB account exceed the provisioned throughput RU/s. Mitigation can be done by either scaling-up - or improving the efficiency especially of queries to reduce the RU/s consumption. See [Throttling TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-request-rate-too-large) for more details.|
+|429|3201|Yes|`Metadata throttling` - Indicates that metadata operations are being throttled. Increasing provisioned throughput (RU/s) won't help - this usually indicates a bug in your application where metadata calls are triggered extensively or you are not using a singleton pattern for `CosmosClient`/`CosmosAsyncClient`. See [Throttling TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-request-rate-too-large) for more details.|
+|429|< 3200|Yes (up to few minutes)|`SLA violating throttling` - Indicates service-side throttling that will count against the service's SLA. These errors should always be transient.| 
+|449|0|Yes|`RetryWith` - Indicates a concurrent attempt to change documents server-side - for example via patch or stored procedure invocation. The `449` status code will be automatically retried by the SDK. This condition should always be transient as long as the application is not excessively doing concurrent changes to documents.|
+|500|0|Unknown|`Internal Server error` - Indicates unexpected and unqualified internal service error.|
+|502|0|Unknown|`Bad gateway` - Indicated an HTTP proxy you are using is misbehaving. Any `502` or `504` is a clear signal that the actual problem is not in Cosmos DB but the proxy being used. In general HTTP proxies are not recommended for any production workload.|
+|503|\*|Yes|`Service unavailable` - Indicates that  either service issue occurred or the client event after retries is not able to successfully process an operation. See [Service unavailable TSG - trouble-shooting guide](https://learn.microsoft.com/azure/cosmos-db/nosql/troubleshoot-service-unavailable)|
+|503|21001|Yes|`Name cache is stale` - Indicates that a container was deleted and recreated - and the client's cache still has the old container metadata. This error indicates that the client even after refreshing the cache got the container metadata of the "old" container. Usually it indicates that the replication of the new container metadata across all regions took longer than usual. This error should always be transient.|
+|503|21002|Yes|`Partition key range gone` - Indicates that a partition split or merge happened and the client even after several retries was not able to get the metadata for the new partition. This error indicates a delay of replication of partition key range metadata and should always be transient.|
+|503|21003|Yes|`Completing split` - Indicates that a partition split or merge is pending and commiting the split takes longer than expected. This error should always be transient and will be mapped to a CosmosException with status code `503` when surfacing it to the application after exceeding SDK-retries.|
+|503|21004|Yes|`Completing migration` - Indicates that a partition migration due to load-balancing is pending and takes longer than expected. This error should always be transient and will be mapped to a CosmosException with status code `503` when surfacing it to the application after exceeding SDK-retries.|
+|410/503|21005|Yes|`Serverside 410` - Indicates that a replica returns a 410 - usually during initialization of the replica. This error should always be transient and will be mapped to a CosmosException with status code `503` when surfacing it to the application after exceeding SDK-retries.|
+|503|21008|Yes|`Service unavailable` - Indicates that a replica returned `503` service unavailable. This error should always be transient and will surface as a CosmosException with status code `503` after exceeding SDK-retries.|
+|504|0|Unknown|`Gateway timeout` - Indicated an HTTP proxy you are using timed  out. Any `502` or `504` is a clear signal that the actual problem is not in Cosmos DB but the proxy being used. In general HTTP proxies are not recommended for any production workload.|