01 System Overview

Pongpong

A Taiwanese 16-tile Mahjong self-play reinforcement-learning platform. Go-powered engine · Transformer policy · Live production deployment with real human players.

idx = 0一萬
idx = 9一筒
idx = 18一索
idx = 27
idx = 31
idx = 34春 (花)
Every tile is assigned a stable integer idx (0–33 for scored tiles, 34–41 for flowers). The observation tensor and action space both index into this same space — when the policy outputs argmax = 4, it means "discard 五萬".
02 Tile Universe

34 Scored Tiles + 8 Flowers — each with a numeric index

Indices below each tile match pkg/mahjong/tile.go. The integer is the tile — it feeds every observation channel and every action logit.

Man · 萬子Characters (idx 0–8)

0
1
2
3
4
5
6
7
8

Pin · 筒子Dots (idx 9–17)

9
10
11
12
13
14
15
16
17

Sou · 索子Bamboo (idx 18–26)

18
19
20
21
22
23
24
25
26

Honors · 字牌Winds (idx 27–30) + Dragons (idx 31–33)

27
28
29西
30
31
32
33

Flowers · 花牌Bonus (idx 34–41, drawn & replaced — not scored into patterns)

34
35
36
37
38
39
40
41
34
Scored Tile Types
Man 9 + Pin 9 + Sou 9 + Winds 4 + Dragons 3
4
Copies per Scored Tile
136 scored tiles total in the wall
144
Wall Size (incl. flowers)
136 scored + 8 flowers × 1
03 Game Setup

Four Players, 16 Tiles Each

Taiwan 16-tile Mahjong: each player holds 16 tiles (dealer 17). Draw one, discard one, rotate clockwise. Flowers auto-replace. The observation is rotated so the learner is always rel_seat = 0.

rel_seat 2 · 對家 (across)
rel_seat 3 · 上家
rel_seat 1 · 下家
rel_seat 0 · 自己 (Learner)
~14
Learner Actions / Game
Measured from S2 tensorboard (n=736 updates)
~60
Draws per Round
Roughly 15 per seat until wall exhausts
~2.6
Avg Fan on Win
Expert self-play baseline (100K games)
04 Winning Patterns

Hu · 胡牌 = 4 Melds + 1 Eye-Pair + Pair-out Tile

17 tiles total: 4 melds (12 tiles) + 1 eye pair (2 tiles) + 1 pair-out (the winning tile, whether self-drawn or claimed).

Sequence · 順子 (3 consecutive, same suit)
0
1
2
一二三萬 — idx {0, 1, 2}
Triplet · 刻子 (3 copies of the same tile)
13
13
13
五筒 × 3 — idx {13, 13, 13}
Quad · 槓 (4 copies — declared via kan, draws a replacement tile)
22
22
22
22
五索 × 4 — counts as one meld + replacement draw
Eye Pair · 對子 (the "eye")
31
31
中 × 2 — idx {31, 31}
Complete Winning Hand · 一盤胡牌
0
1
2
3
4
5
11
12
13
20
20
20
31
31
4 melds (一二三萬 · 三四五萬 · 三四五筒 · 三索刻) + 1 eye pair (中) = 17 tiles total
Fan scoring: base 1 fan. Examples: 門清 (+1) · 自摸 (+3) · 大三元 (+8) · 大四喜 (+16). Payment = (1 + total_fan) per unit from each loser. All reward signals to the RL agent are scaled versions of this fan payment (zero-sum across 4 seats).
05 Observation Representation

A Concrete Example — from Game State to 34 × 133 Tensor

The observation is laid out as obs[ch × 34 + tile_idx]. Each of the 34 scored-tile indices is a token; each of the 133 channels is a feature of that token. Below we show a tiny game state and trace exactly which values get written.

Example game state (learner = rel_seat 0)

Learner hand (8 tiles for brevity)
0
0
0
4
4
13
31
31
Own discards (rel_seat 0)
8
rel_seat 1 (下家) discards
4
25
Round wind: 東 (idx 27) · Seat wind: 東 (idx 27) · Dealer: self · Draws left: 56

How it becomes the tensor — channels 0–20 (first per-tile block)

Each row is one channel. Each column is one of the 34 tile tokens. Only the first 12 columns shown below; the other 22 are all zero in this example. Colours: 0.25 0.5 0.75 1.0

tile_idx →
0
1
2
3
4
5
6
7
8
13
25
31
Hand ≥ 1 ch 0
1
0
0
0
1
0
0
0
0
1
0
1
Hand ≥ 2 ch 1
1
0
0
0
1
0
0
0
0
0
0
1
Hand ≥ 3 ch 2
1
0
0
0
0
0
0
0
0
0
0
0
Hand ≥ 4 ch 3
0
0
0
0
0
0
0
0
0
0
0
0
Own discards /4 ch 4
0
0
0
0
0
0
0
0
0.25
0
0
0
rel_1 discards /4 ch 5
0
0
0
0
0.25
0
0
0
0
0
0.25
0
Visible /4 ch 20
0.75
0
0
0
0.75
0
0
0
0.25
0.25
0.25
0.5
Read this: column "idx 0" (一萬) shows ≥1, ≥2, ≥3 all firing — so the learner holds 3 copies of 一萬. Column "idx 4" (五萬) shows ≥1, ≥2 only — so 2 copies. Column "idx 31" (中) shows ≥1, ≥2 — so 2 copies. Channel 20 sums up everything visible: 0.75 for 一萬 (3 in hand), 0.75 for 五萬 (2 in hand + 1 seen in rel_1's discards), etc.

Complete channel layout (133 total)

ch 0–3
Hand — count-threshold encodingper-tile
4 binary layers: have ≥1 / ≥2 / ≥3 / ≥4 of this tile. Redundant-but-useful trick: lets the network read "pair vs triplet vs quad" without doing arithmetic.
ch 4–7
Discard piles, 4 seatsper-tile
Each cell = (count in that seat's discard pile) / 4. Seats are rotated: ch 4 = own, ch 5 = rel_1 (下家), ch 6 = rel_2 (對家), ch 7 = rel_3 (上家).
ch 8–19
Open / semi-open melds, 4 seats × 3 meld-typesper-tile
For each seat: one channel each for chi / pon / kan tiles. Concealed kans of opponents are broadcast uniformly (hidden identity).
ch 20
Globally visible countper-tile
(tiles visible everywhere: all discards + all open melds + own hand) / 4. A single "how many of this tile are still in play" feature.
ch 21–23
Last-discard source flagbroadcast
One-hot over rel_1 / rel_2 / rel_3 indicating who just discarded. Ch 68 (per-tile) says what they discarded.
ch 24
Round wind one-hotper-tile
Set on one of idx 27–30 (東南西北). Indicates current round wind.
ch 25
Seat wind one-hotper-tile
Set on idx 27–30 — agent's seat-specific lucky wind (可加台).
ch 26
Dealer flagbroadcast
1 across all 34 tokens if the learner is the current dealer, else 0. Dealer loses/wins 2× score.
ch 27
Draws left / 80broadcast
How deep into the wall we are. Controls risk tolerance (defend harder late).
ch 28
First round flagbroadcast
1 if still in the dealer's first draw cycle (for 天胡 / 地胡 eligibility).
ch 29
After-kan flagbroadcast
1 if the current draw came from a kan replacement (for 槓上開花).
ch 30–33
Flower counts per seat / 8broadcast
rel_0..3 flower counts. Flowers auto-draw replacements and contribute fan at scoring time.
ch 34–45
Meld source (who contributed)per-tile
4 seats × 3 relative from-directions. Marks which tile, and from whom, each open meld was claimed.
ch 46
門清 (concealed) flagbroadcast
1 if no open melds — enables +1 fan bonus for winning without calls.
ch 47–50
Suit ratios (Man/Pin/Sou/Honor)broadcast
Fraction of learner's total tiles in each suit. Lets the network quickly see "going one-suit (清一色)".
ch 51, 52
Concealed-triplet count / Pair countbroadcast
Normalized counts of "≥3" and "≥2" in hand. Quick structural summary.
ch 53–59
Dragon + wind tile counts / 4broadcast
Per-honor counts (hand + melds). Dragons separately (53–55), winds separately (56–59). For 三元 / 四喜 reasoning.
ch 60
Chi count / 4broadcast
How many 順子 melds the learner has already called.
ch 61
Current shanten / 8broadcast
Minimum tile-swaps to reach tenpai. 0 = tenpai, 1 = 1-away, etc.
ch 62, 63
Shanten-if-discard / Ukeire-if-discardper-tile
For each tile: the shanten and tile-accepts (ukeire) if you discarded that tile. This is the key signal for discard choice.
ch 64
Tenpai wait maskper-tile
1 on each tile the learner is waiting on (only set when current shanten = 0).
ch 65–67
Opponent hand sizes / 16broadcast
rel_1 / rel_2 / rel_3 hand counts. Tells us if they've recently called (making hand smaller).
ch 68
Last-discard identity one-hotper-tile
1 on the tile that was just discarded (can be called for pon/chi/win).
ch 69–132
Discard history (last 16 turns × 4 seats = 64 channels)per-tile
For each seat and each of the last 16 discard slots, one channel. Value = 0 (not discarded), 1 (手切), or 2 (摸切, tsumogiri). This gives the transformer temporal context to infer opponents' intentions.
Total: 69 game-state channels + 64 discard-history channels = 133 channels × 34 tile tokens = 4,522-dim observation. Two scopes coexist: per-tile channels write real per-tile values; broadcast channels copy the same scalar across all 34 tokens (treated as a "global feature for this step"). The transformer reshapes this to (34, 133) and runs self-attention over tile tokens.
06 Action Space

109 Discrete Actions

Model outputs a 109-dim logit vector. An action mask zeros out illegal moves before softmax, so the policy can never emit an illegal action.

0–33
Discard (one action per tile idx)
0
1
2
3
4
… idx 5–29 …
30
31
32
34
34
Hu · 胡牌
Declare a winning hand (self-drawn 自摸 or from opponent's discard 放槍).
1
35
Pass · 過
Decline any available reaction; let the turn proceed.
1
36
Pon · 碰 — triplet from an opponent's discard
13
13
← claimed
1
37
MinKan · 明槓 — quad from an opponent's discard
22
22
22
← claimed
1
38–40
Chi · 吃 — sequence from upstream's discard, 3 positions: low / mid / high
4 ←
5
6
4
5 ←
6
4
5
6 ←
3
41–74
AnKan · 暗槓 — concealed quad from hand (one per tile idx)
4
4
4
4
34
75–108
KaKan · 加槓 — upgrade an open pon by adding the 4th copy from hand
13
13
prior pon
+ hand
34
Action masking: illegal action logits are set to −∞ before softmax. Legal-action count per step is typically 2–10 (usually discard-only: 1 legal discard per distinct hand tile, typically 4–14 distinct tiles).
07 Model Architecture

MahjongTransformer · 10M Parameters

Self-attention over the 34 tile tokens lets the policy reason about arbitrary tile-to-tile relationships (sequences, triplets, honor pairs, defensive patterns). Below: the full tensor pipeline, attention visualization, and inside-the-layer breakdown.

Input
34 × 133
reshaped 4,522-dim observation
Transformer Encoder
4L × 8H × d=256
pre-norm; ff_dim=1024; no dropout
Output
109 logits
flatten(34 × 256) → Linear → masked softmax

7.1 · Full Tensor Pipeline

Each block is a tensor labeled with its shape (batch dim B omitted). Each dashed box between tensors is the operation that transforms one into the next.

(B, 4522)
raw observation — flat 1-D vector
.view(−1, 133, 34)
(B, 133, 34)
restore (channel, tile) layout
.transpose(1, 2)
(B, 34, 133)
34 tokens × 133 features each
input_proj : Linear(133 → 256), shared across all 34 tokens
(B, 34, 256)
tokens projected to d_model = 256
+ pos_embed (1, 34, 256) — learnable per-tile position
(B, 34, 256)
token embeddings ready for attention
encoder — 4 × TransformerEncoderLayer (8 heads, ff_dim=1024)
(B, 34, 256)
contextualized token representations
LayerNorm
(B, 34, 256)
normalized output
.flatten(1) — concatenate all 34 tokens
(B, 8704)
34 × 256 = 8704 features
policy_head : Linear(8704 → 109)
(B, 109)
raw action logits
masked_fill(~legal_mask, −∞) → softmax
(B, 109)
legal action distribution
# python/pongpong_client/model.py — MahjongTransformer._encode def _encode(self, obs: torch.Tensor) -> torch.Tensor: x = obs.view(-1, 133, 34).transpose(1, 2) # (B, 34, 133) x = self.input_proj(x) + self.pos_embed # (B, 34, 256) x = self.encoder(x) # (B, 34, 256) return self.norm(x) # (B, 34, 256)

7.2 · Token = Tile Type, Not Time Step

Unlike NLP Transformers where tokens are words and sequence length varies, here the sequence length is always 34. Each token corresponds to one scored tile type. The 133-dim per-token feature vector packs "what this tile means in the current state" — hand counts, discards, meld participation, shanten-if-discarded, etc.

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
Linear(133 → 256) · shared weights across all 34 tokens
tok0
256-d
tok1
256-d
tok2
256-d
tok3
256-d
tok4
256-d
… tok5 .. tok29
tok30
256-d
tok31
256-d
tok32
256-d
tok33
256-d
+ pos_embed34 × 256 · enters encoder

7.3 · Self-Attention Across 34 Tile Tokens

At every encoder layer, each token computes attention weights over all other tokens. Below shows the query token 五萬 (idx 4) and its attention distribution — thicker edges = higher attention.

QUERY tok₄ (五萬) ← sequence (四萬) sequence (六萬) → 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 keys query
strong attention (sequence / triplet partners)
weak attention (unrelated tiles)
What self-attention learns automatically: without hand-coded rules, the network learns that 五萬 should attend strongly to 三萬, 四萬, 六萬, 七萬 (neighbors for sequence building), to its own column (hand-count features), and to recent discards of the same tile (defensive reasoning). Honor tiles learn to attend to their triplet partners (三元 / 風). These patterns are what MLP / CNN architectures fail to capture efficiently.

7.4 · Inside One Encoder Layer (× 4 stacked)

Pre-norm Transformer block with residual connections. Each of the 4 stacked layers has identical structure; parameters are independent.

Input (B, 34, 256)
↓ LayerNorm
Multi-Head Self-Attention 8 heads × 32d = 256
↓ Dropout → + residual
↓ LayerNorm
Feed-Forward : Linear(256 → 1024) → GELU → Linear(1024 → 256)
↓ Dropout → + residual
Output (B, 34, 256)
MULTI-HEAD ATTENTION
The 256-d token is split into 8 heads of 32-d each. Each head computes its own Q/K/V projection and attention softmax, producing 8 independent "views" of tile relationships. Results are concatenated back to 256-d.
params per layer ≈ 4 × (256 × 256) = 262K
FEED-FORWARD
Position-wise MLP: expands each token to 1024-d, applies GELU, projects back to 256-d. Operates independently per token — no cross-token mixing here (attention did that).
params per layer ≈ 2 × (256 × 1024) = 524K

7.5 · Parameter Budget

Component Shape Params
input_proj Linear(133 → 256) 34 K
pos_embed Param(1, 34, 256) 9 K
encoder × 4 layers attn 262K + FFN 524K + norms per layer ~3.2 M
LayerNorm (final) Norm(256) 0.5 K
policy_head Linear(8704 → 109) 949 K
Total ~4.2 M
Note: the S2 self-play run uses d_model=256 + an auxiliary danger_head; with older wider configs (d=384, 6 layers) parameter count reaches ~10M quoted elsewhere. The structure is identical.
Architecture choice rationale: three architectures were tested at similar parameter scale: MLP, CNN, Transformer. Only the Transformer reached expert-parity. Self-attention's ability to capture any tile-to-tile relationship (cross-suit globality + long-range defense reasoning) is the key unlock for Mahjong decision-making — and it is the single most important design lever behind the S1-best milestone.
08 Training Infrastructure

256 Parallel Environments · 3,000 Steps/Second

Go Engine
256 envs
shared-memory gRPC, complete 台灣 16-tile rules
Python PPO Trainer
PyTorch + AMP
opponent pool, rollout, update loop
Live Deployment
pongpong-online
Web + WebSocket, 2,814 human matches
3,000
Steps / Second
Measured mean on single RTX 4070 Ti
202M+
Total Training Steps
S2 self-play accumulated (ongoing)
2,814
Real Human Matches
40-day live deployment, 10 users
Training pipeline stages: Behavioral Cloning (BC) from expert traces → S1 self-play (curriculum against growing opponent pool) → S2 self-play (critic-free, leave-one-out baseline). Full pipeline detailed in the Incident-trajectory deck.
Pongpong system visualization · tile assets from pongpong-online/web/src/assets/tiles/
Channel layout verified against pkg/mahjong/observe.go (total = 69 + 4×16 = 133).
Companion deck to ppt_reference_en.html (the Incident narrative).