Architecture

Introduction to

The () provides an enterprise-grade Kubernetes-based platform that enables organizations to build, deploy, and manage applications consistently across hybrid and multi-cloud environments. integrates core Kubernetes capabilities with enhanced management, observability, and security services, offering a unified control plane and flexible workload clusters.

The architecture follows a hub-and-spoke model, consisting of a global cluster and multiple workload clusters. This design provides centralized governance while allowing independent workload execution and scalability.

For canonical definitions of platform-wide terms such as global cluster, workload cluster, and cluster plugin, see Glossary.

Core Architectural Components

Global Cluster

The global cluster serves as the centralized management and control hub of . It provides platform-wide services such as authentication, policy management, cluster lifecycle operations, and observability. It's also a central hub for multi-cluster management and provides cross-cluster functionality.

Key components include:

Gateway Acts as the main entry point to the platform. It manages API requests from the UI, CLI (kubectl), and automation tools, routing them to appropriate backend services.
Authentication and Authorization (Auth) Integrates with external Identity Providers (IdPs) to provide Single Sign-On (SSO) and RBAC-based access control.
Web Console Provides a web-based interface for . It interfaces with platform APIs through the gateway.
Cluster Management Handles the registration, provisioning, and lifecycle management of workload clusters.
Services
Operator Lifecycle Manager (OLM) and Cluster Plugins Manages the installation, updates, and lifecycle of operators and cluster extensions.
Internal Image Registry Offers an out-of-box integrated container image repository with role-based access.
Observability Provides centralized logging, metrics, and tracing for both the global and workload clusters.
Cluster Proxy Enables secure communication between the global cluster and workload clusters.

Workload Cluster

Workload clusters are Kubernetes-based environments managed by the global cluster. Each workload cluster runs isolated application workloads and inherits governance and configuration from the central control plane.

External Integrations

Identity Provider (IdP) Supports federated authentication via standard protocols (OIDC, SAML) for unified user management.
API and CLI Access Users can interact with through RESTful APIs, the web console, or command-line tools like kubectl and ac.
Load Balancer (VIP/DNS/SLB) Provides high availability and traffic distribution to the Gateway and ingress endpoints of the global and workload Clusters.

Scalability and High Availability

is designed for horizontal scalability and high availability:

Each component can be deployed redundantly to eliminate single points of failure.
The global cluster supports managing dozens to hundreds of workload clusters.
Workload clusters can scale independently according to workload demand.
The use of VIP/DNS/Ingress ensures seamless routing and failover.

Functional Perspective

()'s complete functionality consists of Core and extensions based on two technical stacks: Operator and Cluster Plugin.

Core

The minimal deliverable unit of , providing core capabilities such as cluster management, container orchestration, projects, and user administration.
- Meets the highest security standards
- Delivers maximum stability
- Offers the longest support lifecycle
Extensions

Extensions in both the Operator and Cluster Plugin stacks can be classified into:
- Aligned – Life cycle strategy consisting of multiple maintenance streams, with alignment to .
- Agnostic – Life cycle strategy consisting of multiple maintenance streams, released independently from .
For more details about extensions, see Extend.

Technical Perspective

Platform Component Runtime All platform components run as containers within a Kubernetes management cluster (the global cluster).

High Availability Architecture

The global cluster typically consists of at least three control plane nodes and multiple worker nodes
High availability of etcd is central to cluster HA; see Key Component High Availability Mechanisms for details
Load balancing can be provided by an external load balancer or a self-built VIP inside the cluster

Request Routing

Client requests first pass through the load balancer or self-built VIP
Requests are forwarded to ALB (the platform's default Kubernetes Ingress Gateway) running on designated ingress nodes (or control-plane nodes if configured)
ALB routes traffic to the target component pods according to configured rules

Replica Strategy

Core components run with at least two replicas
Key components (such as registry, MinIO, ALB) run with three replicas

Fault Tolerance & Self-healing

Achieved through cooperation between kubelet, kube-controller-manager, kube-scheduler, kube-proxy, ALB, and other components
Includes health checks, failover, and traffic redirection

Data Storage & Recovery

Control-plane configuration and platform state are stored in etcd as Kubernetes resources
In catastrophic failures, recovery can be performed from etcd snapshots

Primary / Standby Disaster Recovery

Two separate global clusters: Primary Cluster and Standby Cluster
The disaster recovery mechanism is based on real-time synchronization of etcd data from the Primary Cluster to the Standby Cluster.
If the Primary Cluster becomes unavailable due to a failure, services can quickly switch to the Standby Cluster.

Key Component High Availability Mechanisms

etcd

Deployed on three (or five) control plane nodes
Uses the RAFT protocol for leader election and data replication
Three-node deployments tolerate up to one node failure; five-node deployments tolerate up to two
Supports local and remote S3 snapshot backups

Monitoring Components

Prometheus: Multiple instances, deduplication with Thanos Query, and cross-region redundancy
VictoriaMetrics: Cluster mode with distributed VMStorage, VMInsert, and VMSelect components

Logging Components

Nevermore collects logs and audit data
Kafka / Elasticsearch / Razor / Lanaya are deployed in distributed and multi-replica modes

Networking Components (CNI)

Kube-OVN / Calico / Flannel: Achieve HA via stateless DaemonSets or triple-replica control plane components

ALB

Operator deployed with three replicas, leader election enabled
Instance-level health checks and load balancing

Self-built VIP

High-availability virtual IP based on Keepalived
Supports heartbeat detection and active-standby failover

Harbor

ALB-based load balancing
PostgreSQL with Patroni HA
Redis Sentinel mode
Stateless services deployed in multiple replicas

Registry and MinIO

Registry deployed with three replicas
MinIO in distributed mode with erasure coding, data redundancy, and automatic recovery

#Architecture

#TOC

#Introduction to

#Core Architectural Components

#Global Cluster

#Workload Cluster

#External Integrations

#Scalability and High Availability

#Functional Perspective

#Technical Perspective

#Key Component High Availability Mechanisms