01 Undergraduate Research Proposal · Course AG

Stabilizing the Edge

A Two-Layer Genetic Algorithm Framework for PPO in High-Noise Environments

Applied to the "Pongpong" Taiwanese Mahjong Self-Play Platform

Hong-Rui Su (C112156233) | National Kaohsiung University of Science and Technology, Dept. of CSIE | April 2026
02 The Foundation

A High-Throughput Self-Play Ecosystem

Engine
3,000
steps / second
High-performance Go engine, 256 concurrent environments, shared-memory gRPC. Measured mean = 3,001 sps across 736 updates.
Policy
10M
parameters
MahjongTransformer: 4 layers × 8 heads × d=256; input 4,522-dim (133 × 34 tiles); 109-action discrete space.
Live Deployment
2,814
human matches
pongpong-online production deployment. 10 users, 40 days of human-vs-bot matches for cross-validating sync-eval results.
03 The Incident

Achieving and Systematically Losing the Edge

Six training checkpoints evaluated via sync-eval vs. 3× rotating-dealer expert opponents. All numbers are measured (n = 10K or 100K).

CheckpointStepWR %Score/gameDeal-in %Δwr vs baselineΔscore vs baseline
Expert rotating baseline24.49+0.01318.61
BC transformer023.66−0.0519.01−0.83−0.06
S1-best (production model)~10M25.35+0.1416.29+0.86+0.13
S2-early20M21.54−0.1317.28−2.95−0.14
S2-peak (pool)160M24.51+0.0615.63+0.02+0.05
S2-mid180M20.64−0.2315.16−3.85−0.24
S2-end202M19.83−0.2715.13−4.66−0.28
S1 learned robust defense. Deal-in rate drops from 18.61% (expert) to 16.29% (S1-best) — a significant −2.32pp improvement. Meanwhile, avg fan on win stays comparable to expert (2.65 vs 2.62). The edge comes from defensive skill, not bigger-hand pursuit.
S2-end deal-in continues to drop (15.13%), yet overall score collapses. This is not complete forgetting but "over-defensive paralysis" — the model amplifies the "don't discard dangerous tiles" reflex at the cost of all offensive capability.
04 The Root Cause

Irreducible Noise Triggering Structural Collapse

The "Valuebench" Diagnostic

Var(G) — total return variance19.1945
E[Var(G|O)] — irreducible noise17.5812 (91.6%)
Var(E[G|O]) — explainable1.6133 (8.4%)
Even an omniscient oracle critic seeing all concealed tiles and the full wall order can explain at most 8.4% of return variance. The remaining 91.6% is irreducible environmental noise.

The Collapse Trigger (S2 Tensorboard)

Current setting lr_cycle_steps=50M, cycle_mult=2.0. The 3rd restart occurs near step 150M:

LR jump: 1e-6 → 3e-5 (30× increase)
KL: 0.0009 → 0.0098 (×10.9)
Entropy: 0.61 → 0.91 (+0.30)
→ policy collapses within ~15M steps.
Conclusion: When 91.6% of return variance is irreducible, PPO training gains almost no leverage from value signal. In this regime, hyperparameters dictate not convergence speed but whether an acquired edge survives long-term training.
05 The Protocol

A Two-Layer Genetic Algorithm Wrapper

Outer Loop · Genetic Algorithm

1
Initialize Population
12 chromosomes; each encodes a PPO hyperparameter configuration (9 core + 3 optional critic dims).
2
Evaluate Fitness
Inner loop runs PPO self-play for K=1M steps, then measures score/game vs 3× expert (n=3K per point + CRN).
3
Selection
Tournament selection (k=3) fills the mating pool.
4
Crossover + Mutation
SBX (η=20) + polynomial mutation (η=20, p=1/n). Gene-type-aware handling for log-scale continuous vs categorical.
5
Elitism + Next Generation
Top 2 pass directly to next gen. 8 generations total, or stop when best fitness stagnates for 3 generations.

Inner Loop · PPO Self-Play (per fitness eval)

a
Load BC Checkpoint
Darwinian fresh: every individual starts from the same BC weights (eliminates weight-inheritance as a hidden variable).
b
Train with Chromosome HPs
256 parallel envs via shared-memory SHM; opponent pool snapshots every 200K steps.
c
Snapshot at Tail Points
Save checkpoints at {0.8K, 0.9K, K} for the tail-minimum fitness computation.
d
Eval Each Snapshot
vs 3× expert, n=3K/point; all individuals in a generation share the same 3K seeded games (CRN).
e
Compute Tail-Minimum Fitness
f(c) = min over the 3 tail points. Degradation is penalized immediately.
Core concept: the GA searches not for peak performance alone, but for hyperparameters that force long-term stabilization in the presence of 91.6% irreducible noise.
06 The Search Space

9 Core + 3 Optional Critic Dimensions

Mixed-type encoding: log-scale continuous / linear continuous / integer / categorical.

Zone 1: Core Training Hyperparameters (9 Dimensions)

Gene
Range
Encoding
Design Rationale
learning_rate (cosine init)
[1e-6, 1e-3]
log-scale real
Primary stability lever
lr_cycle_steps
[10M, 100M]
log-scale integer
Directly addresses the S2 collapse cause
lr_min
[1e-7, 1e-5]
log-scale real
Controls restart jump magnitude
clip_range (ε)
[0.05, 0.4]
linear real
Trust region size
entropy_coef
[0.001, 0.1]
log-scale real
Exploration strength (critical under critic-free)
ppo_epochs
[1, 5]
integer
Data reuse per rollout
max_grad_norm
[0.1, 2.0]
linear real
Per-step update cap
minibatch_size
{512, 1024, 2048, 4096, 8192}
categorical
Effective sample size (1024 yields only ~74 independent advantage estimates)
num_steps (per env)
{512, 1024, 2048, 4096}
categorical
Rollout length × update frequency

Zone 2: Optional Critic Design (1 Categorical + 2 Conditional)

Gene
Options / Range
Encoding
Design Rationale
critic_mode
{none, reward_predictor, standalone_critic}
categorical
Let the GA decide whether and how to introduce value signal
gae_lambda
[0.9, 0.99] — active iff critic_mode ≠ none
linear real
Bias-variance tradeoff
discount_gamma
[0.95, 0.999] — active iff reward_predictor enabled
linear real
Bellman φ-shaping
Differentiation from prior GA+PPO work: Studies such as [ARZ+23] assume a fixed critic-based PPO with GAE. This study treats critic design itself as a searchable gene, letting the GA dynamically negotiate the bias-variance tradeoff based on environmental hostility.
07 Fitness Design

The "Tail-Minimum" Fitness Protocol

Our S2 training scored curves that oscillate — rising past expert, crashing, briefly recovering, crashing again. A GA that only measures the score at the very last step can be fooled by a chromosome that just happens to be mid-rebound at measurement time. We need a fitness that explicitly rewards stability, not luck at the finish line.

Concrete failure mode — imagine two candidates. Candidate α crashes at 70% of training, bounces back at 95%, ends high. Candidate β grows slowly but stays high through the entire tail. Candidate α has higher final score but will almost certainly collapse again when deployed. We want the GA to prefer β.
Vulnerable · Ignores Collapse

Candidate A: Final Score (Standard)

fA(c) = Score(c, at the final step K)

Only looks at the last dot. The mid-training crash is invisible to fitness — so a chromosome that is secretly unstable but lucky at step K wins.
1× eval cost · misleading.

Resilient · Proposed Solution

Candidate B: Tail-Minimum

fB(c) = min{ Score(c, 0.8K), Score(c, 0.9K), Score(c, K) }

Takes the worst of three late checkpoints. Any chromosome that dips during the tail window is punished. One unlucky step is enough to tank its fitness.
3× eval cost · directly penalizes degradation.

Takeaway for this proposal: the fitness function is the GA's definition of "good". By encoding "stable tail" into fitness, we're asking the GA to find hyperparameters that produce deployable models, not hyperparameters that look impressive on a single snapshot — the exact lesson from our S2 incident.
Future work: E5 will compare A vs. B vs. C (explicit degradation penalty) vs. D (NSGA-II multi-objective) vs. E (composite scalar). Requires ~600–800 additional GPU-hours — beyond this study's 150–200 hr budget.
08 Budget Barrier

Variance Reduction & Proxy Evaluation

Common Random Numbers (CRN)

Per-game return variance σ² = 19.19 (from Valuebench). To reliably distinguish individuals whose fitness differs by Δ = 0.10 score/game, independent sampling requires n ≥ 15,000 games per eval.

CRN solution: all individuals in a given generation play on a fixed set of 3,000 wall-shuffling seeds. With luck noise correlation ρ ≈ 0.5, paired-test SE is reduced by √(1−ρ) = 0.71× — equivalent to n = 6K under independent sampling.

Fitness Proxy Budget

Can't afford to fully train every candidate. Each is trained only to 1M steps (≈10% of when S1-best peaked) as a ranking proxy. The bet: good hyperparameters show their edge early.

K (fitness proxy) 1M steps
Avg episode length 13.88 actions
Training games / indiv ~72,000
Eval games / indiv (3 × 3K) 9,000
Time / individual ~2 GPU-hr
Generation (12 indiv) ~24 GPU-hr
Total (96 evals, 12 × 8) 150–200 GPU-hr
3–4 weeks on a single RTX 4070 Ti (8 hr/day) completes the full core GA search.
09 Roadmap

Experimental Design & Hypothesis Mapping

Core Experiments (within current scope)

ExperimentHypothesisMethod
E1 · GA vs. Hand-Tuned Baseline H1, H2 Best GA individual vs. current hand-tuned S2 baseline vs. random search; measured via score/game @ 50M steps. Tests whether the outer wrapper prevents S2 degradation and stabilizes the +0.13 edge.
E3 · Proxy Reliability H4 Spearman correlation between truncated fitness (K ∈ {0.5M, 1M, 2M}) and full-length training (≥50M). Target ρ > 0.6.
E4 · Sensitivity Analysis H5 Per-gene distributions in the converged population; correlation between each gene and best fitness. Tests whether LR schedule dominates stability.

Future Work (⏳ additional compute required)

ExperimentHypothesisAdditional cost / deferral reason
E2 · HPO Method Comparison (GA vs. BO vs. Grid) H3 ~1,000 GPU-hr
E5 · Fitness Design Sensitivity Matrix H6 ~600–800 GPU-hr
10 Scientific Contributions

Expected Contributions

  1. 01
    Empirical

    Valuebench Methodology

    A reproducible method for isolating theoretical value-function ceilings in extreme-noise, imperfect-information games via conditional variance decomposition (Law of Total Variance). Pongpong's ceiling = 8.4% explainable variance.

  2. 02
    Architectural

    Dynamic Critic Design

    Elevates critic design (critic-free / reward predictor / standalone) from a hardcoded assumption to a categorical genetic variable — letting the GA negotiate the bias-variance tradeoff based on environmental hostility.

  3. 03
    Methodological

    Tail-Minimum GA Framework

    A "Tail-Minimum" fitness framework that rescues deep RL agents from long-term structural collapse — fitness design directly penalizes tail degradation instead of rewarding lucky final snapshots.

The core thesis: When 8.4% of return variance is explainable and 91.6% is irreducible, PPO training success depends not on "finding a better critic" or "training a larger model," but on "finding a training recipe under which an acquired thin edge survives long-term training." The GA searches for exactly that recipe.
Reference deck for Figma re-design. All numbers verified against /home/ray/project/pongpong/figma_data/*.csv.
Source of truth: 09_key_numbers_single_source.csv