Operations
Failure Modes
Read
If the gateway is down, your queries are down. The document cache is stateless and can scale to zero with no disruption, and no other components impact the read path.
Write
The primary failure mode for writes is Aerospike stop-writes during a multi-stage pipeline job. Staged documents stay warm in the cache but do not contain vector data. If this data exceeds the Aerospike drive allocation the system will stop accepting writes and your pipeline will degrade to S3-backed chunk reads. The operator can restart Aerospike and the document cache will be lost. Pipeline workers resume automatically: staged chunk bodies are durable in S3, pending state is in PostgreSQL, and the gateway refills Aerospike from S3 after reconnect.
The Helm document cache restarts automatically on stop-writes by default
(documentCache.autoRestartOnStopWrites: true) and clears its Aerospike
backing file on pod start (documentCache.storage.resetOnStart: true). That
makes a pod restart a valid stop-writes recovery action for the Layer-owned
cache. S3 and PostgreSQL must remain healthy; they are the durable recovery
boundary.