01 Undergraduate Research Proposal · Course AG

Stabilizing the Edge

A Two-Layer Genetic Algorithm Framework for PPO in High-Noise Environments

Applied to the "Pongpong" Taiwanese Mahjong Self-Play Platform

Hong-Rui Su (C112156233)　|　National Kaohsiung University of Science and Technology, Dept. of CSIE　|　April 2026

02 The Foundation

A High-Throughput Self-Play Ecosystem

Engine

3,000
steps / second

High-performance Go engine, 256 concurrent environments, shared-memory gRPC. Measured mean = 3,001 sps across 736 updates.

Policy

10M
parameters

MahjongTransformer: 4 layers × 8 heads × d=256; input 4,522-dim (133 × 34 tiles); 109-action discrete space.

Live Deployment

2,814
human matches

pongpong-online production deployment. 10 users, 40 days of human-vs-bot matches for cross-validating sync-eval results.

03 The Incident

Achieving and Systematically Losing the Edge

Six training checkpoints evaluated via sync-eval vs. 3× rotating-dealer expert opponents. All numbers are measured (n = 10K or 100K).

Checkpoint	Step	WR %	Score/game	Deal-in %	Δwr vs baseline	Δscore vs baseline
Expert rotating baseline	—	24.49	+0.013	18.61	—	—
BC transformer	0	23.66	−0.05	19.01	−0.83	−0.06
S1-best (production model)	~10M	25.35	+0.14	16.29	+0.86	+0.13
S2-early	20M	21.54	−0.13	17.28	−2.95	−0.14
S2-peak (pool)	160M	24.51	+0.06	15.63	+0.02	+0.05
S2-mid	180M	20.64	−0.23	15.16	−3.85	−0.24
S2-end	202M	19.83	−0.27	15.13	−4.66	−0.28

S1 learned robust defense. Deal-in rate drops from 18.61% (expert) to 16.29% (S1-best) — a significant −2.32pp improvement. Meanwhile, avg fan on win stays comparable to expert (2.65 vs 2.62). The edge comes from defensive skill, not bigger-hand pursuit.

S2-end deal-in continues to drop (15.13%), yet overall score collapses. This is not complete forgetting but "over-defensive paralysis" — the model amplifies the "don't discard dangerous tiles" reflex at the cost of all offensive capability.

04 The Root Cause

Irreducible Noise Triggering Structural Collapse

The "Valuebench" Diagnostic

Var(G) — total return variance	19.1945
E[Var(G\|O)] — irreducible noise	17.5812 (91.6%)
Var(E[G\|O]) — explainable	1.6133 (8.4%)

Even an omniscient oracle critic seeing all concealed tiles and the full wall order can explain at most 8.4% of return variance. The remaining 91.6% is irreducible environmental noise.

The Collapse Trigger (S2 Tensorboard)

Current setting lr_cycle_steps=50M, cycle_mult=2.0. The 3rd restart occurs near step 150M:

LR jump: 1e-6 → 3e-5 (30× increase)
KL: 0.0009 → 0.0098 (×10.9)
Entropy: 0.61 → 0.91 (+0.30)
→ policy collapses within ~15M steps.

Conclusion: When 91.6% of return variance is irreducible, PPO training gains almost no leverage from value signal. In this regime, hyperparameters dictate not convergence speed but whether an acquired edge survives long-term training.

05 The Protocol

A Two-Layer Genetic Algorithm Wrapper

Outer Loop · Genetic Algorithm

Initialize Population
12 chromosomes; each encodes a PPO hyperparameter configuration (9 core + 3 optional critic dims).

Evaluate Fitness
Inner loop runs PPO self-play for K=1M steps, then measures score/game vs 3× expert (n=3K per point + CRN).

Selection
Tournament selection (k=3) fills the mating pool.

Crossover + Mutation
SBX (η=20) + polynomial mutation (η=20, p=1/n). Gene-type-aware handling for log-scale continuous vs categorical.

Elitism + Next Generation
Top 2 pass directly to next gen. 8 generations total, or stop when best fitness stagnates for 3 generations.

Inner Loop · PPO Self-Play (per fitness eval)

Load BC Checkpoint
Darwinian fresh: every individual starts from the same BC weights (eliminates weight-inheritance as a hidden variable).

Train with Chromosome HPs
256 parallel envs via shared-memory SHM; opponent pool snapshots every 200K steps.

Snapshot at Tail Points
Save checkpoints at {0.8K, 0.9K, K} for the tail-minimum fitness computation.

Eval Each Snapshot
vs 3× expert, n=3K/point; all individuals in a generation share the same 3K seeded games (CRN).

Compute Tail-Minimum Fitness
f(c) = min over the 3 tail points. Degradation is penalized immediately.

Core concept: the GA searches not for peak performance alone, but for hyperparameters that force long-term stabilization in the presence of 91.6% irreducible noise.

06 The Search Space

9 Core + 3 Optional Critic Dimensions

Mixed-type encoding: log-scale continuous / linear continuous / integer / categorical.

Zone 1: Core Training Hyperparameters (9 Dimensions)

Gene

Range

Encoding

Design Rationale

learning_rate (cosine init)

[1e-6, 1e-3]

log-scale real

Primary stability lever

lr_cycle_steps

[10M, 100M]

log-scale integer

Directly addresses the S2 collapse cause

lr_min

[1e-7, 1e-5]

log-scale real

Controls restart jump magnitude

clip_range (ε)

[0.05, 0.4]

linear real

Trust region size

entropy_coef

[0.001, 0.1]

log-scale real

Exploration strength (critical under critic-free)

ppo_epochs

[1, 5]

integer

Data reuse per rollout

max_grad_norm

[0.1, 2.0]

linear real

Per-step update cap

minibatch_size

{512, 1024, 2048, 4096, 8192}

categorical

Effective sample size (1024 yields only ~74 independent advantage estimates)

num_steps (per env)

{512, 1024, 2048, 4096}

categorical

Rollout length × update frequency

Zone 2: Optional Critic Design (1 Categorical + 2 Conditional)

Gene

Options / Range

Encoding

Design Rationale

critic_mode

{none, reward_predictor, standalone_critic}

categorical

Let the GA decide whether and how to introduce value signal

gae_lambda

[0.9, 0.99] — active iff critic_mode ≠ none

linear real

Bias-variance tradeoff

discount_gamma

[0.95, 0.999] — active iff reward_predictor enabled

linear real

Bellman φ-shaping

Differentiation from prior GA+PPO work: Studies such as [ARZ+23] assume a fixed critic-based PPO with GAE. This study treats critic design itself as a searchable gene, letting the GA dynamically negotiate the bias-variance tradeoff based on environmental hostility.

07 Fitness Design

The "Tail-Minimum" Fitness Protocol

Our S2 training scored curves that oscillate — rising past expert, crashing, briefly recovering, crashing again. A GA that only measures the score at the very last step can be fooled by a chromosome that just happens to be mid-rebound at measurement time. We need a fitness that explicitly rewards stability, not luck at the finish line.

Concrete failure mode — imagine two candidates. Candidate α crashes at 70% of training, bounces back at 95%, ends high. Candidate β grows slowly but stays high through the entire tail. Candidate α has higher final score but will almost certainly collapse again when deployed. We want the GA to prefer β.

Vulnerable · Ignores Collapse

Candidate A: Final Score (Standard)

f_A(c) = Score(c, at the final step K)

Only looks at the last dot. The mid-training crash is invisible to fitness — so a chromosome that is secretly unstable but lucky at step K wins.
1× eval cost · misleading.

Resilient · Proposed Solution

Candidate B: Tail-Minimum

f_B(c) = min{ Score(c, 0.8K), Score(c, 0.9K), Score(c, K) }

Takes the worst of three late checkpoints. Any chromosome that dips during the tail window is punished. One unlucky step is enough to tank its fitness.
3× eval cost · directly penalizes degradation.

Takeaway for this proposal: the fitness function is the GA's definition of "good". By encoding "stable tail" into fitness, we're asking the GA to find hyperparameters that produce deployable models, not hyperparameters that look impressive on a single snapshot — the exact lesson from our S2 incident.

Future work: E5 will compare A vs. B vs. C (explicit degradation penalty) vs. D (NSGA-II multi-objective) vs. E (composite scalar). Requires ~600–800 additional GPU-hours — beyond this study's 150–200 hr budget.

08 Budget Barrier

Variance Reduction & Proxy Evaluation

Common Random Numbers (CRN)

Per-game return variance σ² = 19.19 (from Valuebench). To reliably distinguish individuals whose fitness differs by Δ = 0.10 score/game, independent sampling requires n ≥ 15,000 games per eval.

CRN solution: all individuals in a given generation play on a fixed set of 3,000 wall-shuffling seeds. With luck noise correlation ρ ≈ 0.5, paired-test SE is reduced by √(1−ρ) = 0.71× — equivalent to n = 6K under independent sampling.

Fitness Proxy Budget

Can't afford to fully train every candidate. Each is trained only to 1M steps (≈10% of when S1-best peaked) as a ranking proxy. The bet: good hyperparameters show their edge early.

K (fitness proxy) 1M steps

Avg episode length 13.88 actions

Training games / indiv ~72,000

Eval games / indiv (3 × 3K) 9,000

Time / individual ~2 GPU-hr

Generation (12 indiv) ~24 GPU-hr

Total (96 evals, 12 × 8) 150–200 GPU-hr

3–4 weeks on a single RTX 4070 Ti (8 hr/day) completes the full core GA search.

09 Roadmap

Experimental Design & Hypothesis Mapping

Core Experiments (within current scope)

Experiment	Hypothesis	Method
E1 · GA vs. Hand-Tuned Baseline	H1, H2	Best GA individual vs. current hand-tuned S2 baseline vs. random search; measured via score/game @ 50M steps. Tests whether the outer wrapper prevents S2 degradation and stabilizes the +0.13 edge.
E3 · Proxy Reliability	H4	Spearman correlation between truncated fitness (K ∈ {0.5M, 1M, 2M}) and full-length training (≥50M). Target ρ > 0.6.
E4 · Sensitivity Analysis	H5	Per-gene distributions in the converged population; correlation between each gene and best fitness. Tests whether LR schedule dominates stability.

Future Work (⏳ additional compute required)

Experiment	Hypothesis	Additional cost / deferral reason
E2 · HPO Method Comparison (GA vs. BO vs. Grid)	H3	~1,000 GPU-hr
E5 · Fitness Design Sensitivity Matrix	H6	~600–800 GPU-hr

10 Scientific Contributions

Expected Contributions

01
Empirical

Valuebench Methodology

A reproducible method for isolating theoretical value-function ceilings in extreme-noise, imperfect-information games via conditional variance decomposition (Law of Total Variance). Pongpong's ceiling = 8.4% explainable variance.
02
Architectural

Dynamic Critic Design

Elevates critic design (critic-free / reward predictor / standalone) from a hardcoded assumption to a categorical genetic variable — letting the GA negotiate the bias-variance tradeoff based on environmental hostility.
03
Methodological

Tail-Minimum GA Framework

A "Tail-Minimum" fitness framework that rescues deep RL agents from long-term structural collapse — fitness design directly penalizes tail degradation instead of rewarding lucky final snapshots.

The core thesis: When 8.4% of return variance is explainable and 91.6% is irreducible, PPO training success depends not on "finding a better critic" or "training a larger model," but on "finding a training recipe under which an acquired thin edge survives long-term training." The GA searches for exactly that recipe.

  Reference deck for Figma re-design. All numbers verified against /home/ray/project/pongpong/figma_data/*.csv.

  Source of truth: 09_key_numbers_single_source.csv

Stabilizing the Edge

A High-Throughput Self-Play Ecosystem

Achieving and Systematically Losing the Edge

Irreducible Noise Triggering Structural Collapse

The "Valuebench" Diagnostic

The Collapse Trigger (S2 Tensorboard)

A Two-Layer Genetic Algorithm Wrapper

Outer Loop · Genetic Algorithm

Inner Loop · PPO Self-Play (per fitness eval)

9 Core + 3 Optional Critic Dimensions

Zone 1: Core Training Hyperparameters (9 Dimensions)

Zone 2: Optional Critic Design (1 Categorical + 2 Conditional)

The "Tail-Minimum" Fitness Protocol

Candidate A: Final Score (Standard)

Candidate B: Tail-Minimum

Variance Reduction & Proxy Evaluation

Common Random Numbers (CRN)

Fitness Proxy Budget

Experimental Design & Hypothesis Mapping

Core Experiments (within current scope)

Future Work (⏳ additional compute required)

Expected Contributions

Valuebench Methodology

Dynamic Critic Design

Tail-Minimum GA Framework