Building High-Performance, High-Concurrency, and High-Availability Systems: A Full-Stack Developer's Journey

Introduction

Throughout my career as a full-stack developer, I have deeply realized that building scalable web applications is a multidimensional challenge, much like constructing a complex skyscraper. It requires not only a robust frontend interface but also a resilient and flexible backend infrastructure. The key to success lies in a solid foundation across the tech stack, meticulous planning, continuous monitoring, and proactive maintenance. For any successful web application—especially those aiming for large scale—high performance, high concurrency, and high availability are the three critical pillars. In this article, I share my experiences tackling these challenges and the practical solutions I've implemented.

The evolution of software development is a constant battle with complexity. Complexity can be divided into business complexity and technical complexity. Business complexity mainly involves modeling the real world through abstraction and design, while technical complexity revolves around solving the "three highs": high performance, high concurrency, and high availability. Consumer-facing (C-end) businesses usually prioritize technical complexity to handle massive user traffic and ensure a smooth experience; enterprise (B-end) or merchant (M-end) systems focus more on modeling complex business logic, though scaling also brings technical challenges. This article spans both areas, sharing my "three highs" experience in C-end user interfaces and B/M-end backend systems, combined with real-world logistics platform practices.

1. Foundation: Code Organization, Architecture, and System Understanding

Building scalable applications—whether focused on frontend user experience or backend data processing—starts with a well-thought-out foundation, including code organization, architectural patterns, and understanding system types.

1.1 Frontend Project Structure

A well-organized frontend codebase is the first step toward maintainable and high-performance applications. It supports efficient team collaboration and simplifies onboarding for new developers. For Next.js projects, I typically organize the source code directory as follows to logically separate concerns:

// Project structure - frontend perspective
src/
  ├── components/
  │   ├── common/        # Highly reusable components (e.g., buttons, inputs, modals)
  │   ├── features/      # Feature-specific components (e.g., login forms, product cards)
  │   └── layouts/       # Components defining page layouts (e.g., app layout, auth layout)
  ├── hooks/            # Custom React hooks encapsulating logic (e.g., useAuth, useFetch)
  ├── lib/              # Utility functions and helper modules (e.g., formatting, validation)
  ├── pages/            # Next.js pages as routing entry points
  ├── services/         # API service layer handling backend communication
  ├── store/            # State management (e.g., Zustand, Redux)
  └── types/            # TypeScript type definitions for clarity and safety

This structure promotes modularity and helps developers quickly locate code.

1.2 Feature-First Architecture (Frontend and Backend)

Organizing code by feature rather than technical type (e.g., grouping all components, services, hooks) often improves maintainability in large applications. This principle applies to both frontend and backend microservices.

// features/auth/components/LoginForm.tsx - Frontend feature example
import { useAuth } from '../hooks/useAuth' // Feature-specific hook
import { useForm } from '../../../hooks/useForm' // Common hook

export function LoginForm() {
  // The useAuth hook encapsulates login logic, possibly interacting with backend auth services
  const { login, isLoading, error } = useAuth()
  // The useForm hook manages form state and submission
  const { form, handleChange, handleSubmit, values, errors } = useForm({
    initialValues: { email: '', password: '' },
    // onSubmit calls the feature-specific login function
    onSubmit: async (values) => {
      console.log('Submitting login:', values.email);
      await login(values.email, values.password);
    },
    validate: (values) => {
      const errors: any = {};
      if (!values.email) errors.email = 'Email is required';
      if (!values.password) errors.password = 'Password is required';
      return errors;
    }
  });

  return (
    <form onSubmit={handleSubmit} className="space-y-4">
      <div>
        <label htmlFor="email">Email:</label>
        <input
          id="email"
          name="email"
          type="email"
          value={values.email}
          onChange={handleChange}
          className={`border p-2 w-full ${errors.email ? 'border-red-500' : ''}`}
        />
        {errors.email && <p className="text-red-500 text-sm">{errors.email}</p>}
      </div>
      <div>
        <label htmlFor="password">Password:</label>
        <input
          id="password"
          name="password"
          type="password"
          value={values.password}
          onChange={handleChange}
          className={`border p-2 w-full ${errors.password ? 'border-red-500' : ''}`}
        />
        {errors.password && <p className="text-red-500 text-sm">{errors.password}</p>}
      </div>
      {error && <p className="text-red-500 text-sm">{error}</p>}
      <button type="submit" disabled={isLoading} className="bg-blue-500 text-white p-2 rounded w-full disabled:opacity-50">
        {isLoading ? 'Logging in...' : 'Login'}
      </button>
    </form>
  );
}

On the backend, this translates to defining clear service boundaries based on business domains (e.g., auth-service, order-service, product-service). This approach is often guided by Domain-Driven Design (DDD), enhancing maintainability and scalability. In logistics platform development, DDD helps build complex backend systems around core business capabilities (such as "order fulfillment," "transportation," or "inventory"). For example, we divide the system into product domain, order domain, payment/settlement domain, and fulfillment domain, each defined by business processes (e.g., merchant order placement, courier pickup, user delivery confirmation).

1.3 Understanding System Types

Understanding the characteristics of different system types is key to foundational planning and determines how to address performance, concurrency, and availability.

Online systems: Characterized by real-time request-response interactions, where low latency (response time) is critical, such as fetching user profiles, placing orders, or searching for products.
Offline systems: Also known as batch processing systems, handle large data jobs on a scheduled basis. Throughput (amount of data processed per unit time) is the key metric, such as generating daily reports, data migration, or analytics jobs (e.g., daily order volume, monthly active users).
Near real-time systems: Process data streams with low latency, reacting to events in real time. These are event-driven architectures and stream processing systems, such as handling sensor data, real-time notifications, or updating search indexes based on database changes.

Each system type requires different architectural patterns, resource allocation, and optimization strategies. In logistics platforms, online systems handle real-time order creation, offline systems generate daily transport reports, and near real-time systems use message queues (like JMQ, Kafka) to update caches or notify users.

2. High Performance: Multi-Layered Acceleration Methods

High performance ensures that applications respond quickly on both the client and server sides. Performance bottlenecks can occur in frontend rendering, database queries, or network latency. The three main factors affecting performance are computation (e.g., complex logic, full GC), communication (e.g., slow downstream services), and storage (e.g., large tables, slow SQL, suboptimal ES shard settings). Optimization is approached from both read and write perspectives, combining frontend and backend techniques.

2.1 Frontend Performance Optimization

Frontend optimization directly impacts user experience and perceived performance.

Code Splitting and Bundle Optimization

Large JavaScript bundles slow down page loads. Code splitting breaks the main bundle into smaller, on-demand chunks, significantly improving initial load times. Next.js automatically handles page-level splitting, but explicit dynamic imports are useful for specific components.

// pages/dashboard.tsx - Frontend code splitting example
import dynamic from 'next/dynamic'
import LoadingSpinner from '../components/common/LoadingSpinner'; // Common loading component

// Lazily load non-critical or resource-intensive components
const Analytics = dynamic(() => import('../components/features/Analytics'), {
  loading: () => <LoadingSpinner />, // Show animation while loading
  ssr: false // Client-side only
});

const Reports = dynamic(() => import('../components/features/Reports'), {
  loading: () => <LoadingSpinner />,
  ssr: false
});

const SettingsPanel = dynamic(() => import('../components/features/SettingsPanel'), {
  loading: () => <LoadingSpinner />,
  ssr: false
});

export default function Dashboard() {
  return (
    <div className="p-6">
      <h1 className="text-2xl font-bold mb-4">Dashboard Overview</h1>
      <div className="grid grid-cols-1 md:grid-cols-2 gap-6">
        <Analytics />
        <Reports />
        <SettingsPanel />
      </div>
    </div>
  );
}

Use Webpack Bundle Analyzer to visualize bundle contents and identify optimization areas. Minimizing dependencies and choosing lightweight libraries also improves performance.

Image and Asset Optimization

Images are often the heaviest resources on a web page. Optimization includes choosing appropriate formats (WebP, AVIF), compressing images, using responsive images (picture or srcset), and lazy loading. Next.js's Image component implements these best practices automatically.

// components/OptimizedImage.tsx - Frontend image optimization example
import Image from 'next/image'
import { useState } from 'react'

interface OptimizedImageProps {
  src: string;
  alt: string;
  width?: number;
  height?: number;
  className?: string;
  [key: string]: any; // Allow other props
}

export function OptimizedImage({ src, alt, className, ...props }: OptimizedImageProps) {
  const [isLoading, setIsLoading] = useState(true);

  return (
    <div className={`relative overflow-hidden ${className || ''}`}>
      <Image
        src={src}
        alt={alt}
        {...props}
        onLoadingComplete={() => setIsLoading(false)}
        className={
          `duration-700 ease-in-out ${isLoading ? 'scale-110 blur-2xl grayscale' : 'scale-100 blur-0 grayscale-0'}`
        }
      />
      {isLoading && (
        <div className="absolute inset-0 bg-gray-200 animate-pulse" />
      )}
    </div>
  );
}

2.2 Backend Performance Optimization

Backend performance determines how quickly the server processes requests, limited by computation, communication (network calls to other services or databases), and storage access.

Database Optimization: The Performance Pillar

The database is often the backend bottleneck. Efficient schema design and query optimization are crucial.

Indexes: Create indexes on frequently queried columns to speed up reads. Use SQL's EXPLAIN to analyze query plans and identify missing indexes or inefficient patterns.
Schema design: Choose appropriate data types, normalize or denormalize based on access patterns, and avoid large tables.
Query optimization: Write efficient queries, minimize data scans, use joins effectively, and avoid N+1 problems (e.g., ORM preloading).

// services/database.ts - Database query example (using Prisma ORM)
import { PrismaClient } from '@prisma/client';

const prisma = new PrismaClient();

export async function getOptimizedUserData(userId  // Fetch optimized user data
{
  try {
    const user = await prisma.user.findUnique({
      where: { id: userId },
      include: {
        profile: true, // Preload user profile
        posts: {
          take: 10, // Limit number of posts
          orderBy: { createdAt: 'desc' }, // Sort by creation date
          include: { comments: true } // Preload comments
        }
      }
    });
    return user;
  } catch (error) {
    console.error("Error fetching optimized user data:", error);
    throw error;
  } finally {
    await prisma.$disconnect(); // Disconnect
  }
}

For large-scale systems, large tables or inefficient SQL (e.g., missing indexes, complex joins) degrade performance. MySQL's EXPLAIN or Elasticsearch query analysis tools can diagnose issues. For Elasticsearch, optimizing shard size, shard count, and index strategies is critical.

Read Optimization: Caching and Database Strategies

Caching is the most effective way to improve read performance, but must be combined with the database for reliability. The caching strategy depends on whether the system is read-heavy or write-heavy.

Read-heavy systems: After synchronously updating the database, invalidate the cache. Reads fetch from the database and repopulate the cache, ensuring consistency and performance. For example, in a retail platform, product details are cached in Redis, with MySQL as the data source; updates invalidate the cache.

// Pseudocode: Read-heavy system cache-database sync
async function updateProduct(productId, productData) {
  await db.update('products', productId, productData); // Sync update to DB
  await redis.del(`product:${productId}`); // Delete cache
}

async function getProduct(productId) {
  let product = await redis.get(`product:${productId}`); // Try cache
  if (product) return product;
  product = await db.get('products', productId); // Cache miss, fetch from DB
  await redis.set(`product:${productId}`, product, { EX: 3600 }); // Repopulate cache, 1h expiry
  return product;
}

Write-heavy systems: Synchronously update the cache, asynchronously update the database. The cache absorbs write traffic, and a queue ensures eventual DB consistency. For example, in logistics order storage, Redis handles instant writes, and a queue asynchronously updates the database.

// Pseudocode: Write-heavy system cache-database async update
async function createOrder(orderData) {
  const orderId = generateOrderId();
  await redis.set(`order:${orderId}`, orderData, { EX: 86400 }); // Cache for 1 day
  await sendMessageToQueue('order_update_queue', { orderId, orderData }); // Async DB update
  return { orderId, message: 'Order created' };
}

async function processOrderUpdate(message) {
  const { orderId, orderData } = message;
  try {
    await db.insert('orders', orderId, orderData);
    console.log(`Order ${orderId} persisted to DB`);
  } catch (error) {
    console.error(`Failed to persist order ${orderId}:`, error);
    // Log error, send to dead-letter queue, or trigger alert
  }
}

Choose a strategy based on read/write patterns and consistency requirements. Read-heavy systems prioritize DB consistency; write-heavy systems leverage cache for performance.

Write Optimization: Asynchronous Processing

For traffic spikes (e.g., e-commerce flash sales), synchronously processing complex logic (like inventory deduction, payment) can overwhelm the system. Asynchronous processing decouples the request-response loop via message queues. In flash sale scenarios, the order API quickly validates requests, stores them in cache (e.g., Redis for inventory), pushes to a queue (e.g., JMQ), and worker processes handle inventory deduction, payment, and notifications, ensuring fast API responses.

// Pseudocode: Flash sale async order processing
async function placeSeckillOrder(requestData) {
  if (!isValid(requestData)) return { status: 400, message: 'Invalid request' }; // Basic validation
  
  const skuId = requestData.skuId;
  const stock = await redis.get(`stock:${skuId}`); // Quick stock check
  if (!stock || stock <= 0) return { status: 409, message: 'Out of stock' };
  
  const orderId = await db.createPendingOrder(requestData); // Create pending order
  await sendMessageToQueue('seckill_order_queue', { orderId, skuId, ...requestData }); // Push to queue
  return { status: 202, orderId, message: 'Order accepted, processing...' }; // Fast response
}

async function processSeckillOrder(message) {
  const { orderId, skuId, userId } = message;
  try {
    const success = await redis.decr(`stock:${skuId}`); // Deduct stock
    if (success < 0) {
      await db.updateOrderStatus(orderId, 'failed', 'Out of stock');
      return;
    }
    await processPayment(orderId); // Handle payment
    await db.updateOrderStatus(orderId, 'confirmed'); // Update order status
    await sendSMS(userId, `Order ${orderId} confirmed, please pay.`); // Send notification
  } catch (error) {
    console.error(`Failed to process order ${orderId}:`, error);
    await db.updateOrderStatus(orderId, 'failed', error.message);
  }
}

This approach uses queues to smooth traffic spikes, ensuring the system efficiently handles surges while maintaining low latency.

3. High Concurrency: Handling Multiple Users Simultaneously

Concurrency is the system's ability to handle multiple requests or user interactions at the same time. A system may perform well for a single user but crash under thousands of concurrent users. Achieving high concurrency requires both single-machine performance optimization and cluster scaling: the former improves processing speed, the latter increases throughput via three-dimensional scaling.

3.1 Real-Time Features and Frontend Concurrency

For real-time features (e.g., chat, live data streams, collaboration tools), the frontend must efficiently manage concurrent connections for a responsive experience.

WebSocket Implementation

WebSocket provides persistent, bidirectional communication between client and server, ideal for real-time features. The client must reliably manage connection state, retries, and message handling.

// hooks/useWebSocket.ts - Frontend WebSocket hook example
import { useEffect, useRef, useCallback, useState } from 'react';

interface UseWebSocketOptions {
  onOpen?: (event: Event) => void; // Connection open callback
  onMessage?: (event: MessageEvent) => void; // Message received callback
  onError?: (event: Event) => void; // Error callback
  onClose?: (event: CloseEvent) => void; // Connection closed callback
  reconnectAttempts?: number; // Number of reconnect attempts
  reconnectInterval?: number; // Reconnect interval (ms)
}

export function useWebSocket(url: string, options?: UseWebSocketOptions) {
  const {
    onOpen, onMessage, onError, onClose,
    reconnectAttempts = 5,
    reconnectInterval = 1000
  } = options || {};

  const ws = useRef<WebSocket | null>(null);
  const reconnectTimeout = useRef<NodeJS.Timeout | null>(null);
  const attemptCount = useRef(0);
  const [isConnected, setIsConnected] = useState(false);

  const connect = useCallback(() => {
    if (ws.current && (ws.current.readyState === WebSocket.OPEN || ws.current.readyState === WebSocket.CONNECTING)) {
      return; // Already connected or connecting
    }

    console.log(`Attempting to connect WebSocket to ${url}... Attempt ${attemptCount.current + 1}`);
    ws.current = new WebSocket(url);

    ws.current.onopen = (event) => {
      console.log('WebSocket connected');
      setIsConnected(true);
      attemptCount.current = 0;
      if (reconnectTimeout.current) {
        clearTimeout(reconnectTimeout.current);
        reconnectTimeout.current = null;
      }
      onOpen?.(event);
    };

    ws.current.onmessage = onMessage;
    ws.current.onerror = onError;

    ws.current.onclose = (event) => {
      console.log('WebSocket closed', event.code, event.reason);
      setIsConnected(false);
      onClose?.(event);

      if (attemptCount.current < reconnectAttempts) {
        attemptCount.current++;
        reconnectTimeout.current = setTimeout(connect, reconnectInterval);
      } else {
        console.error('WebSocket failed to reconnect multiple times');
      }
    };

    return () => {
      console.log('Cleaning up WebSocket connection');
      if (ws.current) {
        ws.current.onopen = null;
        ws.current.onmessage = null;
        ws.current.onerror = null;
        ws.current.onclose = null;
        if (ws.current.readyState === WebSocket.OPEN || ws.current.readyState === WebSocket.CONNECTING) {
          ws.current.close();
        }
      }
      if (reconnectTimeout.current) {
        clearTimeout(reconnectTimeout.current);
        reconnectTimeout.current = null;
      }
    };
  }, [url, onOpen, onMessage, onError, onClose, reconnectAttempts, reconnectInterval]);

  useEffect(() => {
    const cleanup = connect();
    return cleanup;
  }, [connect]);

  const sendMessage = useCallback((message: string | object) => {
    if (ws.current && ws.current.readyState === WebSocket.OPEN) {
      const payload = typeof message === 'object' ? JSON.stringify(message) : message;
      ws.current.send(payload);
    } else {
      console.warn('WebSocket not connected, cannot send message');
    }
  }, []);

  return { ws: ws.current, isConnected, sendMessage };
}

3.2 State Management and Optimistic Updates (Frontend)

Efficient frontend state management, especially for concurrent operations or real-time updates, affects responsiveness. Optimistic updates immediately update the UI (assuming success), sync with the server in the background, and roll back if failed, improving perceived performance.

// hooks/useOptimisticUpdate.ts - Frontend optimistic update hook example
import { useState, useCallback } from 'react'

interface OptimisticUpdateConfig<T, U> {
  updateFn: (data: T) => Promise<U>; // Async server update function
  onMutate?: (data: T) => void; // Optimistically update UI
  onSuccess?: (result: U, data: T) => void; // Server success callback
  onError?: (error: Error, data: T) => void; // Server failure callback
}

export function useOptimisticUpdate<T, U = void>(
  config: OptimisticUpdateConfig<T, U>
) {
  const { updateFn, onMutate, onSuccess, onError } = config;
  const [isUpdating, setIsUpdating] = useState(false);
  const [error, setError] = useState<Error | null>(null);

  const update = useCallback(async (data: T) => {
    setIsUpdating(true);
    setError(null);

    onMutate?.(data); // Optimistically update UI

    try {
      const result = await updateFn(data); // Server update
      onSuccess?.(result, data); // Success callback
    } catch (err) {
      const updateError = err instanceof Error ? err : new Error(String(err));
      setError(updateError);
      onError?.(updateError, data); // Failure callback, rollback UI
    } finally {
      setIsUpdating(false);
    }
  }, [updateFn, onMutate, onSuccess, onError]);

  return { update, isUpdating, error };
}

3.3 Backend and System Concurrency: Scaling Strategies

Backend concurrency requires distributing load across multiple instances and managing shared resource access. Through X, Y, and Z axis scaling, throughput is increased.

Scaling Dimensions (X, Y, Z)

X-axis (horizontal scaling): Add identical application or storage instances, with a load balancer distributing requests. Stateless application services can be quickly scaled via deployment platforms (e.g., for major sales events). Storage scaling requires data migration and shard rule adjustments, commonly used to handle traffic surges.

// Pseudocode: Application layer horizontal scaling
async function scaleApplication(groupId, instanceCount) {
  await deploymentPlatform.scale(groupId, { instances: instanceCount });
  console.log(`Scaled group ${groupId} to ${instanceCount} instances`);
}

Y-axis (functional decomposition): Break a monolith into microservices, independently scaling high-load services based on DDD. For example, an e-commerce monolith is split into order, payment, and inventory microservices. In logistics, DDD divides the system into order management and fulfillment domains, improving maintainability and scalability.

// Pseudocode: DDD-based microservice division
async function createOrder(orderData) {
  validateOrder(orderData);
  const orderId = await db.insert('orders', orderData);
  await notifyFulfillmentService(orderId);
  return orderId;
}

async function processFulfillment(orderId) {
  const order = await db.get('orders', orderId);
  assignCourier(order);
  updateOrderStatus(orderId, 'assigned');
}

Z-axis (data partitioning and cell-based architecture): Partition data based on attributes (e.g., user ID, region), routing to dedicated service clusters or "cells," localizing traffic and data to reduce contention. For example, in a logistics platform, the Beijing cell serves Beijing users, the Shanghai cell serves Shanghai users—like warehouse logistics—improving concurrency and availability.

// Pseudocode: Cell-based request routing
async function routeRequest(userId, requestData) {
  const unit = getUnitByUserId(userId); // Route based on user ID
  return await unit.processRequest(requestData);
}

Hot Key Handling

Hot keys (e.g., popular SKUs during promotions) can overload cache or DB shards, causing performance degradation or failures. Solutions include:

Local cache: Store hot keys in application memory (e.g., Guava cache) for direct reads, reducing distributed cache or DB access.

// Pseudocode: Local cache for hot keys
async function getHotProduct(skuId) {
  let product = localCache.get(skuId);
  if (product) return product;
  product = await redis.get(`product:${skuId}`);
  if (product) {
    localCache.set(skuId, product);
    return product;
  }
  product = await db.get('products', skuId);
  await redis.set(`product:${skuId}`, product, { EX: 3600 });
  localCache.set(skuId, product);
  return product;
}

Key randomization: Add a random suffix to hot keys (e.g., product:123_rand42) to spread them across multiple shards. Two random digits can distribute to 100 shards, balancing the load.

// Pseudocode: Key randomization for hot key distribution
async function getProductWithRandomKey(skuId) {
  const rand = Math.floor(Math.random() * 100);
  const key = `product:${skuId}_${rand}`;
  let product = await redis.get(key);
  if (!product) {
    product = await db.get('products', skuId);
    await redis.set(key, product, { EX: 3600 });
  }
  return product;
}

DDD Practice in Retail Logistics Platforms

In logistics platforms, DDD defines service boundaries through deep understanding of business processes. Business flows include:

Forward flow: Merchant selects a service (e.g., courier), places an order, service provider accepts, assigns courier, courier picks up, weighs, charges, merchant pays, goods are delivered to user.
Reverse flow: User requests a return, merchant approves, places an order, process is similar to forward flow.

Based on this, the system is divided into product domain (managing virtual products like delivery services), order domain (order creation and updates), payment/settlement domain, and fulfillment domain, supporting B-end logistics transactions and providing virtual products (e.g., courier services) to improve concurrency.

4. High Availability: Ensuring Reliability and Resilience

High availability (HA) ensures the system remains accessible and functional even when components fail, measuring uptime. It requires redundancy and protection mechanisms to prevent small failures from escalating, spanning application, storage, and deployment layers.

4.1 Frontend Availability: Error Handling and Resilience

The frontend must remain functional or gracefully degrade when the backend is unavailable or errors occur.

Global Error Boundaries

In React applications, global error boundaries catch unhandled errors in the component tree, preventing app crashes and displaying fallback UIs.

// components/ErrorBoundary.tsx - Frontend error boundary example
import React, { Component, ErrorInfo, ReactNode } from 'react'
import { logError } from '../services/errorTracking' // Error tracking service

interface Props {
  children: ReactNode;
  fallback: ReactNode; // UI to render on error
  onError?: (error: Error, errorInfo: ErrorInfo) => void; // Error callback
}

interface State {
  hasError: boolean;
  error: Error | null;
}

class ErrorBoundary extends Component<Props, State> {
  state: State = { hasError: false, error: null };

  static getDerivedStateFromError(error: Error): State {
    return { hasError: true, error }; // Update state to show fallback UI
  }

  componentDidCatch(error: Error, errorInfo: ErrorInfo) {
    console.error("Unhandled error:", error);
    logError(error, errorInfo); // Log error
    this.props.onError?.(error, errorInfo);
  }

  render() {
    if (this.state.hasError) {
      return this.props.fallback;
    }
    return this.props.children;
  }
}

export default ErrorBoundary;

Offline Support

Service Workers provide offline capabilities, enhancing resilience during network issues.

// public/sw.js - Service Worker example (cache-first strategy)
const CACHE_NAME = 'app-cache-v1';
const STATIC_ASSETS = [
  '/',
  '/index.html',
  '/styles.css',
  '/app.js',
  '/manifest.json',
];

// Install: Pre-cache static assets
self.addEventListener('install', (event) => {
  console.log('[Service Worker] Installing');
  event.waitUntil(
    caches.open(CACHE_NAME)
      .then(cache => {
        console.log('[Service Worker] Pre-caching static assets');
        return cache.addAll(STATIC_ASSETS);
      })
      .catch(error => {
        console.error('[Service Worker] Pre-cache failed', error);
      })
  );
  self.skipWaiting();
});

// Activate: Clean up old caches
self.addEventListener('activate', (event) => {
  console.log('[Service Worker] Activating');
  event.waitUntil(
    caches.keys().then(cacheNames => {
      return Promise.all(
        cacheNames.map(cacheName => {
          if (cacheName !== CACHE_NAME) {
            console.log('[Service Worker] Deleting old cache:', cacheName);
            return caches.delete(cacheName);
          }
          return Promise.resolve();
        })
      );
    })
  );
  self.clients.claim();
});

// Fetch: Intercept network requests
self.addEventListener('fetch', (event) => {
  event.respondWith(
    caches.match(event.request)
      .then(response => {
        if (response) {
          return response;
        }
        return fetch(event.request).then(response => {
          if (!response || response.status !== 200 || response.type !== 'basic') {
            return response;
          }
          const responseToCache = response.clone();
          caches.open(CACHE_NAME).then(cache => {
            cache.put(event.request, responseToCache);
          });
          return response;
        });
      })
      .catch(error => {
        console.error('[Service Worker] Fetch failed:', event.request.url, error);
        throw error;
      })
  );
});

4.2 Backend and System High Availability: Redundancy and Protection

In high-availability system design, backend services must remain available to the outside world even when some components fail, avoiding single points of failure. In addition to basic multi-instance redundancy, various protection mechanisms are needed at the service level to ensure resilience and self-healing under high load, abnormal traffic, or downstream dependency failures.

Redundancy Mechanisms

Multi-instance deployment: Horizontally scale services across multiple servers/containers, using a load balancer to route requests to healthy instances, enabling automatic failover.
Multi-data center/availability zone: Distribute services and data across different geographic locations to prevent data center-level failures from affecting overall availability.
Master-slave/multi-master replication: Storage layers (DB, cache) use master-slave or multi-master replication; if the master fails, failover to a slave ensures data availability.

Protection Mechanisms

Rate Limiting

Rate limiting prevents services from being overwhelmed by traffic spikes or malicious requests and is the first line of defense for high-availability systems. Common algorithms include:

Fixed Window Counter: Counts requests in a fixed time window; excess requests are rejected. Simple to implement but may have spikes at window boundaries. Suitable for scenarios insensitive to traffic bursts.
Sliding Window: Divides the window into smaller intervals, smoothing out spikes. Suitable for scenarios requiring strict traffic smoothing.
Leaky Bucket: Requests enter a "bucket" at any rate, but are processed at a fixed rate. Excess requests are dropped or queued. Good for smoothing traffic, but bucket size and rate must be tuned.
Token Bucket: Tokens are generated at a fixed rate; requests consume tokens to proceed. Allows short bursts while controlling overall rate. More flexible and widely used in API gateways and microservices.

Example:

// Pseudocode: Token bucket rate limiting
class TokenBucket {
  constructor(capacity, rate) {
    this.capacity = capacity; // Bucket size
    this.tokens = capacity; // Current tokens
    this.rate = rate; // Tokens added per second
    this.lastRefill = Date.now();
  }

  tryConsume(tokens) {
    this.refill();
    if (this.tokens >= tokens) {
      this.tokens -= tokens;
      return true;
    }
    return false;
  }

  refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(this.capacity, this.tokens + elapsed * this.rate);
    this.lastRefill = now;
  }
}

const limiter = new TokenBucket(100, 10); // 100 tokens, 10/sec
async function handleRequest(req) {
  if (limiter.tryConsume(1)) {
    return await processRequest(req);
  }
  return { status: 429, message: 'Rate limit exceeded' };
}

In practice, internal platforms like JSF (Java Service Framework) support synchronous rate limiting, JMQ (message queue) supports asynchronous flow control, suitable for different business scenarios. Nginx, Envoy, and other gateways can implement global rate limiting; service internals can use libraries like Guava RateLimiter or Resilience4j for local rate limiting.

Circuit Breaking & Fallback

Circuit breaker: When downstream services experience many timeouts or errors, the circuit "opens," immediately returning fallback responses to avoid thread pileups. After a period, it attempts recovery; if downstream is healthy, it closes.
Fallback: When a service is unavailable or times out, return simplified data, cached data, or friendly messages to ensure core functionality remains available and non-core features degrade gracefully.

We use manual fallback (e.g., Ducc config switches) and automatic fallback (Hystrix, Sentinel, etc.).

// Pseudocode: Circuit breaker
class CircuitBreaker {
  constructor(threshold, timeout) {
    this.state = 'CLOSED';
    this.failures = 0;
    this.threshold = threshold;
    this.timeout = timeout;
  }

  async call(protectedCall) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.resetTime) {
        return fallbackResponse();
      }
      this.state = 'HALF_OPEN';
    }
    try {
      const result = await protectedCall();
      this.reset();
      return result;
    } catch (error) {
      this.failures++;
      if (this.failures >= this.threshold) {
        this.state = 'OPEN';
        this.resetTime = Date.now() + this.timeout;
      }
      return fallbackResponse();
    }
  }

  reset() {
    this.state = 'CLOSED';
    this.failures = 0;
  }
}

const breaker = new CircuitBreaker(5, 30000); // 5 failures to open, 30s recovery
async function callDownstream() {
  return await breaker.call(() => downstreamService());
}

Timeouts & Retries

Timeout control: Set reasonable timeouts for each downstream call to prevent indefinite waits. Timeouts should decrease at each level (funnel principle), with upstream timeouts longer than downstream to avoid resource pileups.
Retry mechanism: Automatically retry transient errors (e.g., network glitches), but set max retries and backoff to avoid retry storms. Write operations must be idempotent.

// Pseudocode: Timeout setting
async function callServiceA() {
  return await withTimeout(500, async () => {
    return await callServiceB();
  });
}

async function callServiceB() {
  return await withTimeout(400, async () => {
    return await callServiceC();
  });
}

async function withTimeout(ms, promise) {
  const timeout = new Promise((_, reject) =>
    setTimeout(() => reject(new Error('Timeout')), ms)
  );
  return Promise.race([promise, timeout]);
}

Retries: Handle transient network issues, but limit to avoid storms. Writes must be idempotent. Read retries are usually safe; write retries require unique request IDs or transaction checks.

// Pseudocode: Request with retry
async function callServiceWithRetry(request, maxRetries = 3) {
  for (let attempt = 1; attempt <= maxRetries; attempt++) {
    try {
      return await callService(request);
    } catch (error) {
      if (attempt === maxRetries) throw error;
      await sleep(100 * Math.pow(2, attempt)); // Exponential backoff
    }
  }
}

async function callService(request) {
  if (request.isWrite && !isIdempotent(request)) {
    throw new Error('Non-idempotent write operation');
  }
  return await downstreamService(request);
}

Compatibility: Forward and backward compatibility prevent deployment issues. Forward compatibility ensures old systems handle new data (e.g., rollback testing). Backward compatibility ensures new systems handle old data (e.g., traffic replay). Test by constructing new data for old systems and replaying old traffic for new systems.

Isolation Strategies

Isolation minimizes the impact of failures:

System type isolation: Online, offline, and near real-time systems run on separate infrastructure. In logistics, online services (jdl-uep-main) are separated from offline/near real-time services (jdl-uep-worker).
Environment isolation: Strict separation of dev, test, staging, and production environments; no cross-environment calls.
Data isolation: By business (retail, ISV), environment (test, prod), and access frequency (hot/cold data). Full-link stress tests use shadow tables to avoid polluting production data.
Core/non-core flow isolation: Core flows (orders, payments) get priority resources; non-core flows (notifications) are decoupled via queues.
Read/write isolation: CQRS at the app layer, master-slave DB at the storage layer, with reads routed to slaves.
Thread pool isolation: Different tasks or dependencies use separate thread pools to prevent slow calls from exhausting resources.

Storage Layer High Availability: Replication and Partitioning

Replication:
- Master-slave: Master handles writes, replicates to slaves, which serve reads and failover (e.g., MySQL, Redis).
- Multi-master: Multiple nodes accept writes and replicate to each other; higher write availability but more conflict complexity.
- Masterless: Clients write to multiple nodes, read from multiple nodes to resolve stale data (e.g., Cassandra).
Partitioning (Sharding):
- Range partitioning: Keys assigned by range, good for range queries but prone to hot spots.
- Hash partitioning: Keys assigned by hash, balances load but less efficient for range queries.

Examples:

Redis cluster: Data split into 16,384 hash slots, assigned to masters, each with slave failover.
Elasticsearch: Indexes split into primary and replica shards; primaries handle reads/writes, replicas handle reads and backup.
Kafka: Topics partitioned; leader partitions handle reads/writes, followers provide redundancy.

Deployment Layer High Availability

Multi-machine redundancy: Services run on multiple servers.
Multi-data center: Services and storage (MySQL, Redis) deployed across data centers; Elasticsearch is single-DC (with plans for multi-DC).
Geo-redundancy/cell-based: Regional cells serve local users, improving performance and availability.

Current deployment:

App containers: docker, k8s, aws, gcp.
MySQL/Redis: Dual data center.
Elasticsearch: Single data center.

5. Monitoring and Analysis: The Eyes and Ears of the System

Comprehensive monitoring, logging, and analysis provide visibility, detect issues, diagnose causes, and measure optimization effects.

Performance Monitoring

Track key metrics like latency, throughput, and error rates. Frontend uses browser performance APIs; backend monitors CPU, memory, network, disk I/O, DB query times, etc.

// lib/performance.ts - Frontend performance tracking example
export function trackPerformance(metricName: string, startMarkName?: string, endMarkName?: string) {
  if (typeof window === 'undefined' || !window.performance) {
    console.warn("Performance API unavailable");
    return;
  }

  const sMark = startMarkName || `${metricName}-start`;
  const eMark = endMarkName || `${metricName}-end`;

  try {
    if (!startMarkName) {
      window.performance.mark(sMark);
    }
    if (!endMarkName) {
      window.performance.mark(eMark);
    }
    window.performance.measure(metricName, sMark, eMark);
    const measureEntries = window.performance.getEntriesByName(metricName);
    const measure = measureEntries.length > 0 ? measureEntries[0] : null;

    if (measure && 'duration' in measure) {
      console.log(`${metricName}: ${measure.duration.toFixed(2)} ms`);
    } else {
      console.warn(`Performance measure "${metricName}" not found`);
    }

    window.performance.clearMarks(sMark);
    window.performance.clearMarks(eMark);
    window.performance.clearMeasures(metricName);
  } catch (error) {
    console.error(`Error tracking performance for "${metricName}":`, error);
  }
}

Error Tracking and Logging

Frontend and backend errors must be captured, logged, and analyzed.

// services/errorTracking.ts - Error tracking service example
interface ErrorInfo {
  componentStack?: string; // React error info
  [key: string]: any; // Extra context
}

export function logError(error: Error, errorInfo?: ErrorInfo) {
  console.error('Captured error:', error);
  if (errorInfo) {
    console.error('Error info:', errorInfo);
  }

  if (typeof window !== 'undefined' && process.env.NEXT_PUBLIC_SENTRY_DSN) {
    console.log("Frontend error will be sent to Sentry...");
  } else {
    console.log("Backend error will be sent to centralized logs...");
  }
}

Centralized logging systems (e.g., ELK Stack) aggregate distributed service logs; tracing tools (e.g., OpenTelemetry, Zipkin) visualize request flows to locate bottlenecks or failures.

6. System Architecture Evolution: The Journey to Resilience

Building "three-high" systems is a gradual process. From monoliths to SOA, microservices, and service mesh, each step improves modularity, scalability, and resilience. Logistics platforms evolve from monoliths to DDD-based microservices, driven by business complexity and traffic.

Deployment evolves from single server to multi-server, single data center to multi-data center, and multi-region, with redundancy and load balancing ensuring high availability.

Data Isolation Strategies:

Business isolation: Separate tenants (e.g., retail, ISV).
Environment isolation: Separate dev, test, staging, and production.
Hot/cold data separation: Frequently accessed data is cached; infrequently accessed data is archived to OSS.

7. Practical Implementation Examples: Turning Concepts into Reality

7.1 Read Optimization: Cache and Database Integration

Read-heavy systems synchronously update the database and invalidate the cache; write-heavy systems synchronously update the cache and asynchronously update the database (see 2.2).

7.2 Write Optimization: Asynchronous Processing for Traffic Peaks

Flash sale scenarios use async processing: quick validation, queue processing, and cache-managed inventory (see 2.2).

7.3 Distributed System Hot Key Handling

Local cache and key randomization distribute hot key traffic (see 3.3).

7.4 System Isolation Practices

System, environment, data, core/non-core, read/write, and thread pool isolation limit the impact of failures (see 4.2).

8. Future Considerations: The Ongoing Journey

Continuous evolution: Adapt to new patterns, tools, and infrastructure.
Business-technology alignment: Technology serves business needs.
Scalability planning: Proactively design for future growth.
Monitoring and observability: Invest in monitoring, logging, tracing, and alerting.
Security and compliance: Secure design, authentication, and compliance to protect user data.

The journey of a full-stack developer building "three-high" systems is full of challenges and opportunities. By adhering to core principles and continuous learning, we can build applications that meet scale demands and deliver exceptional experiences.