01 Parameter Visualization

A Chromosome Is a Mahjong Hand

Every GA chromosome encodes a full PPO training configuration — 12 hyperparameters, one tile each. Drawing a good hand means finding a training recipe that keeps the agent's edge alive through 200M+ steps of self-play noise.

Metaphor: Suits group related hyperparameters (learning-rate family, PPO objective, batch structure, critic design). Tile value within a suit represents position along that gene's search range. The GA "draws" and refines hands until it finds one that survives long-term training.
02 Legend

How to Read the Hand

Each hyperparameter family maps to a mahjong suit. Tile value within the suit = position along the search range.

Man · 萬子Learning-rate family
learning_rate · lr_cycle_steps · lr_min  |  3 genes
Pin · 筒子PPO objective family
clip_range · entropy_coef · ppo_epochs  |  3 genes
Sou · 索子Batch & gradient family
max_grad_norm · minibatch_size · num_steps  |  3 genes
Honor · 風牌Critic design (optional)
critic_mode · gae_lambda · discount_gamma  |  1 categorical + 2 conditional
Active tile (green border, slightly lifted) shows the current S2 hand-tuned value (or GA-selected value in §4). Faded tiles are other candidate positions along that gene's search range.
03 Search Space

12 Genes · Current S2 Hand-Tuned Values

Each row shows the discretized search range (5 positions along the range). The highlighted tile is the current hand-tuned value.

Man · 萬子Learning-Rate Family

learning_rate[1e-6, 1e-3] log-scale
Cosine-schedule starting LR; primary stability lever.
1e-6
1e-5
3e-5 ◀
1e-4
1e-3
current = 3e-5
lr_cycle_steps[10M, 100M] log-scale
Cosine warm-restart period; directly tied to the S2 collapse trigger.
10M
25M
50M ◀
75M
100M
current = 50M (collapse hotspot)
lr_min[1e-7, 1e-5] log-scale
Minimum LR at end of cosine cycle; shapes restart jump magnitude.
1e-7
3e-7
1e-6 ◀
3e-6
1e-5
current = 1e-6

Pin · 筒子PPO Objective Family

clip_range (ε)[0.05, 0.4] linear
PPO trust region size; tight clip = conservative updates.
0.05
0.1
0.2 ◀
0.3
0.4
current = 0.2
entropy_coef[0.001, 0.1] log-scale
Exploration pressure; critical under critic-free setup.
0.001
0.003
0.01 ◀
0.03
0.1
current = 0.01
ppo_epochs[1, 5] integer
Number of PPO passes per rollout; more = more data reuse.
1
2 ◀
3
4
5
current = 2

Sou · 索子Batch & Gradient Family

max_grad_norm[0.1, 2.0] linear
Global gradient clipping threshold.
0.1
0.5 ◀
1.0
1.5
2.0
current = 0.5
minibatch_size{512, 1024, 2048, 4096, 8192}
Critical: at 1024, only ~74 independent advantage estimates per step (critic-free).
512
1024 ◀
2048
4096
8192
current = 1024 (low eff-sample)
num_steps (per env){512, 1024, 2048, 4096}
Rollout length per env; affects update frequency vs data freshness.
512
1024 ◀
2048
4096
current = 1024

Honor · 風牌Critic Design (Optional · GA-Searchable)

critic_modecategorical · 3 options
Let GA decide whether to reintroduce a value signal.
東 / none ◀
南 / reward_pred
西 / standalone
current = none (critic-free)
gae_lambda[0.9, 0.99] · active if critic_mode ≠ none
GAE bias-variance tradeoff; inactive in current critic-free setup.
0.90
0.95
0.97
0.99
inactive
discount_gamma[0.95, 0.999] · active if reward_predictor
Only active when reward_predictor shaping is enabled.
0.95
0.97
0.99
0.999
inactive
04 Sample Chromosomes

Three Possible "Hands"

Each hand is a complete PPO configuration. The GA evolves these hands across generations.

Hand 1 · S2 Hand-Tuned score: −0.27 / game
lr 3e-5
cyc 50M
min 1e-6
clip .2
ent .01
epoch 2
gn .5
mb 1024
ns 1024
none
Degraded hand: collapses around step 150M, loses the edge entirely.
Hand 2 · GA Candidate A hypothetical
lr 1e-5
cyc 75M
min 3e-7
clip .1
ent .03
epoch 2
gn 1.0
mb 4096
ns 2048
none
Slower, safer updates + bigger batch (higher effective sample size).
Hand 3 · GA Candidate B hypothetical
lr 3e-5
cyc 100M
min 1e-6
clip .2
ent .005
epoch 3
gn .5
mb 2048
ns 1024
reward_pred
λ=.95
γ=.97
GA activates reward-predictor critic + longer LR cycle.
Evolution narrative: The GA starts with a random 12-tile population, evaluates each hand over 1M training steps, then breeds stronger hands through crossover and mutation. After 8 generations × 12 individuals = 96 hands played, the winning hand — stable and edge-preserving — remains.
05 Encoding Types

Mixed-Type Gene Encoding

Log-Scale Continuous

learning_rate · lr_cycle_steps · lr_min · entropy_coef

Evenly spaced in log domain → crossover/mutation operates in log-space, then maps back to linear. Ensures balanced exploration across orders of magnitude.

Linear Continuous

clip_range · max_grad_norm · gae_lambda · discount_gamma

SBX (Simulated Binary Crossover, η=20) + polynomial mutation (η=20) in original linear space.

Integer / Power-of-Two

ppo_epochs · minibatch_size · num_steps

SBX with rounding to nearest valid integer / power-of-2. Snapped after crossover and mutation.

Categorical

critic_mode (3 options: 東 / 南 / 西)

Uniform crossover (take one parent's value) + mutation (replace with random alternative at probability 1/n_genes).

Parameter visualization deck for Figma re-design. Tile assets from pongpong-online/web/src/assets/tiles/.
Search space: see 06_search_space_correct.csv. Chromosome encoding: see 07_ga_execution_plan.csv.