High Availability & Disaster Recovery

Overview

This document describes the High Availability and Disaster Recovery options available for on-premises deployments of Sestek products. It covers the supported deployment models, the resilience characteristics of each model, and the responsibilities that fall to both Sestek and the customer's infrastructure teams.

High Availability (HA) addresses continuity within a single deployment site, keeping services running through component failures, pod restarts, and software updates. Disaster Recovery (DR) addresses a broader failure scenario in which an entire site becomes unavailable, requiring traffic to be redirected to a secondary site.

High Availability and Disaster Recovery are the two core technical components of a business continuity strategy. The two capabilities are complementary and can be adopted independently or together. The appropriate model is selected based on the customer's business continuity requirements, infrastructure constraints, and operational preferences.

High Availability

Sestek products achieve High Availability at the application layer through Kubernetes-native capabilities.

Availability Modes

Baseline

All Sestek products run on Kubernetes. At this baseline level, the platform provides self-healing capabilities: Kubernetes automatically detects failed pods and restarts them, keeping services operational without manual intervention. Software updates are applied using a rolling update strategy, which deploys changes incrementally to minimize service impact. Baseline is suitable for deployments where short recovery times after a pod failure are acceptable, and where the overhead of additional replicas or clusters is not justified by the business requirements.

Multi-Replica

In this mode, application pods, including web, API, and application logic, as well as cache and message queue pods, run as multiple concurrent replicas within the same Kubernetes cluster. If any replica fails, the remaining ones continue to serve requests without interruption. Multi-Replica is recommended for deployments where pod-level availability is important but a full cluster outage is an acceptable risk. It provides a significant improvement over Baseline without the complexity of running multiple clusters.

Multi-Cluster

In this mode, two independent Kubernetes clusters operate simultaneously, protecting against infrastructure-level failures that would impact an entire cluster. Multi-Cluster is recommended for deployments that serve real-time, user-facing interactions where any cluster-level failure would result in unacceptable downtime. Multi-Cluster deployments have specific infrastructure requirements.

Stateful components: All stateful components (database, S3-compatible object storage, Redis, RabbitMQ, Qdrant) must be deployed outside the Kubernetes clusters and on a single site. This ensures that both clusters share the same data, preventing consistency issues that would arise if each cluster maintained its own independent state. For component ownership, refer to the Component Responsibility Matrix: https://docs.knovvu.com/docs/component-responsibility-matrix
Load balancing and session management: Customer is responsible for providing load balancing and sticky session handling across the clusters.
GitOps: A shared Git repository must be accessible from both clusters.

ha_diagram (1).png

	Single Cluster / Baseline	Single Cluster / Multi-Replica	Multi-Cluster / Baseline	Multi-Cluster / Multi-Replica
Kubernetes Clusters	1	1	2	2
Pod Replicas per Service	Single	Multiple	Single	Multiple
Self-Healing	✓	✓	✓	✓
Pod-level Failover	—	✓	—	✓
Cluster-level Failover	—	—	✓	✓

Infrastructure Requirements

Multi-Replica and Multi-Cluster deployments require additional hardware capacity. Sestek will share the necessary sizing requirements based on the selected configuration. Procurement and provisioning of the infrastructure is the customer's responsibility. The cost of deploying and configuring the additional clusters is reflected in the project's implementation effort estimate.

Disaster Recovery

Disaster Recovery (DR) addresses the ability to restore service from a second, geographically or infrastructurally separate site following a major failure, such as a data center outage or catastrophic hardware failure, that renders the primary site unavailable.

The model follows an Active/Passive pattern:

Primary site: Actively handles all production traffic.
Secondary site: Remains in standby, kept in sync via data replication.

Availability Mode per Site

The availability mode and number of clusters deployed within each site is determined by the customer. However, all clusters across Primary and Secondary sites must be configured and sized identically. This ensures that any cluster on the Secondary site can fully handle production traffic upon failover, and that all clusters receive identical GitOps configuration.

Data Replication

The customer is responsible for replicating the database and S3-compatible object storage from the Primary site to the Secondary site. The replication approach should be aligned with the Recovery Point Objective (RPO) acceptable to the business. In a failback scenario, data accumulated on the Secondary site must be replicated back to the Primary site before traffic is switched. The customer is responsible for managing this reverse replication.

Network-Layer Failover

The customer is responsible for operating the failover at the network layer. The specific mechanism, such as a DNS update, load balancer reconfiguration, or equivalent, is determined by the customer based on their network infrastructure. This includes detecting a failure on the Primary site and initiating the promotion of the Secondary site to active. The Recovery Time Objective (RTO) is directly dependent on how quickly the customer can execute this failover process.

GitOps

A shared Git repository must be accessible from both sites. This ensures consistent application configuration and version state regardless of which site is active.

Infrastructure Requirements

A DR deployment requires a fully independent Secondary site. The exact hardware requirements for each site depend on the HA mode selected for that site. Sestek will share the necessary sizing requirements based on the selected configuration. Procurement and provisioning of the infrastructure for both sites is the customer's responsibility. The cost of deploying and configuring the additional clusters is reflected in the project's implementation effort estimate.

Combined HA and DR Example

The deployment illustrated below is intentionally comprehensive, combining various HA and DR options in a single deployment to show how they work together. Most customers will not require all of these options and will select a configuration that matches their specific availability and recovery requirements.

The active site runs two independent Kubernetes clusters in Multi-Cluster mode. Each cluster runs application pods in Multi-Replica configuration. All stateful components, Redis, RabbitMQ, Qdrant, Database, and S3 storage, are deployed outside the clusters on a single site to ensure data consistency across both clusters.

The passive site runs a single Kubernetes cluster, configured and sized identically to each of the clusters on the active site, running application pods in Multi-Replica configuration. Redis, RabbitMQ, and Qdrant are deployed outside the cluster, while Database and S3 storage are replicated from the active site.

Documentation Index

High Availability & Disaster Recovery

Overview

High Availability

Availability Modes

Baseline

Multi-Replica

Multi-Cluster

Infrastructure Requirements

Disaster Recovery

Availability Mode per Site

Data Replication

Network-Layer Failover

GitOps

Infrastructure Requirements

Combined HA and DR Example