Core Trading System benchmarking simulation platform for Capacity Planning for a Stock Exchange

About Client Problem Statement Oneture's Role Solution Value Delivered Technologies Used Lessons Learned

About Client

A leading stock exchange responsible for managing billions in daily equity and derivatives transactions. The client operates in an extremely high-frequency trading environment where system performance, scalability, and stability are mission-critical — especially during peak trading windows.

Problem Statement

With increasing trading volumes and heightened volatility driven by algorithmic trading, the client needed a robust solution to:

Accurately simulate and replay real-world trading loads.
Perform capacity planning and identify potential bottlenecks in the trading infrastructure.
Stress test their trading system components - from order management to network throughput.
Analyze scalability and latency across end-to-end trade flows.

Legacy benchmarking tools were insufficient to simulate realistic high-load scenarios or to provide fine-grained analysis of system behaviour under stress. Legacy systems were not equipped to handle modern market dynamics - especially the bursty nature of algo-driven trading and the concurrent flow of equity and derivatives orders.

Additional challenges included:

Receiving and processing hundreds of thousands of trade messages per second over raw TCP.
Ensuring real-time visualization and rule-based alerting for the client team.
Enforcing strict network isolation with zero reliance on public cloud or internet connectivity.

Oneture's Role

Oneture partnered with the client to design and build a Custom Benchmarking Simulation Platform — combining our expertise in Capital Markets, high-performance computing, and real-time systems.

Oneture designed and deployed a fully air-gapped Kubernetes cluster optimized for concurrent, low-latency trade ingestion and multi-asset class (Equity + Derivatives) support. We built all services in Golang for performance and deployed a scalable, pod-based TCP ingestion system inside the cluster.

Key features of our engagement included:

Co-designing the platform architecture in close collaboration with the client’s technology and infrastructure teams.
Building a future-proof platform that allows the client full code ownership under a Build-Operate-Transfer (BOT) model.
Leveraging cloud-native capabilities for faster time-to-market while validating hardware feasibility for eventual production deployment.
Developing both load generation and time-warped historical order replay capabilities to simulate real-world trading peaks.

Solution

Time-Warping of Historical Order Data

Collected historical trade order data and applied "time-warping" algorithms to compress or expand timestamps.
Enabled realistic simulation of trading days under various volatility conditions.
Supported as-is, time-compressed (to stress test system limits), and time-expanded (to simulate prolonged high-load conditions) modes.
Maintained data integrity while handling edge cases like zero time deltas.

Massive Load Generation

Successfully simulated loads of up to XXX million order entries per second.
Developed custom code to generate controlled, incremental load scenarios, enabling soak testing and stepwise capacity testing.
Conducted detailed network throughput feasibility analysis:
- Validated machine configurations (e.g. 15–20 Gbps network capacity) to achieve target loads.
- Optimized packet sizes and TCP configuration for maximal throughput.

Cloud-Native Architecture

Built a base version on cloud infrastructure for rapid prototyping and validation.
Designed the platform to integrate with client’s on-premise systems for production deployments.
Ensured full compliance with client’s data security, confidentiality, and regulatory requirements.

Granular Observability

Incorporated real-time metrics, dashboards, and logs using Prometheus, Grafana, and Loki.
Provided full visibility into system performance, transaction latencies, and pod-level resource utilization.
Enabled historical playback and forensic analysis of any test run.

The architecture centers around a highly available Kubernetes cluster (RKE2) running on RHEL 9, with one control-plane node and 10 worker nodes. Key technical components:

a. Golang-based TCP Ingestion System

Developed a fleet of TCP client pods using Golang’s net package and goroutines to handle 1000s concurrent TCP connections.
Each pod runs an event-driven listener for trade orders from external Processing Engines over Layer 4.
Used Goroutine pools, channel buffering, and context-based cancellation for fault-tolerant message processing.
Orders are classified in real time into Equity and Derivatives, parsed, and pushed to Order Processing Engine.

b. Kubernetes-Native Pod Scaling

Deployed TCP clients as Stateful Sets for sticky sessions and horizontal scalability.
Configured custom resource limits and affinity rules to isolate workloads by trading segment.
Used native Kubernetes autoscaling (HPA + custom metrics) to spin up new client pods during trading peaks.
The cluster enabled segregation of traffic flows per trader ID, preventing noisy-neighbor problems during high volatility.

c. Real-Time Monitoring and Visualization

Prometheus scrapes pod-level metrics and node-level resource usage metrics.
Grafana dashboards give the client team immediate visibility into order volumes, message latency, and pod saturation.
Loki aggregates structured logs for every incoming connection and parsed order — useful for historical playback and debugging.

Value Delivered

Enabled client to simulate high-stress scenarios and validate system stability before production rollout.

Helped identify performance bottlenecks early — allowing targeted infrastructure upgrades.

Reduced business risk during volatile trading days by ensuring system resilience.

Provided a reusable platform for ongoing capacity planning and future scalability studies.

Delivered full knowledge transfer and platform ownership to the client under BOT model.

Achieved sub-millisecond trade ingestion latency even at peak XXX orders/sec

Seamless horizontal scaling of TCP pods eliminated throughput bottlenecks

Provided full observability into market flows without external tools or SaaS dependencies

Enabled client teams to correlate Equity and Derivative order behaviour in near real-time — essential for regulatory alerts and market integrity

Enabled full visibility into TCP streams and real-time alerting via Grafana.

Technologies Used

Kubernetes (RKE2)
Golang
TCP
Prometheus
Grafana
Loki
Angular
Redis

Lessons Learned

Golang’s raw socket capabilities and efficient memory handling made it ideal for building low-latency TCP clients.

Kubernetes' native autoscaling and self-healing features were critical for supporting highly volatile trading volumes.

Observability (often overlooked) proved essential for gaining trust from client IT teams and for audit compliance.

Custom Kubernetes controllers (built using controller-runtime) gave us flexibility to manage pod lifecycle based on trading hours and trader load patterns.

TCP in Kubernetes is viable at scale—but it requires careful tuning of pod networking, socket reuse, and client-server handshake design.

Time-Warping Provides Realistic Scenarios: Simple replay of historical data is not enough — time-warping techniques were essential to simulate extreme volatility and stress conditions.

Scaling TCP at High Concurrency Needs Deep Tuning: Achieving stable XXX million orders per second load required careful design of TCP socket handling, buffer management, and connection pooling.

Network Throughput Is a Key Constraint: Many assumed system limits came from CPU or storage, but network bandwidth was often the first saturation point; precise throughput modelling and testing were necessary.

Observability Drives Confidence: Building in full telemetry — including metrics, logs, and visual dashboards — not only accelerated debugging but also helped client stakeholders build trust in the platform.

Cloud as a Proving Ground: Using cloud infrastructure for early-stage feasibility allowed faster iteration without disrupting production environments while still validating hardware sizing for on-premise rollout.

BOT Model Ensures Long-Term Client Ownership: Structuring the engagement with Build-Operate-Transfer allowed for full client ownership and IP transfer, ensuring sustainability beyond initial delivery.