A Taiwanese 16-tile Mahjong self-play reinforcement-learning platform.
Go-powered engine · Transformer policy · Live production deployment with real human players.
idx = 0一萬
idx = 9一筒
idx = 18一索
idx = 27東
idx = 31中
idx = 34春 (花)
Every tile is assigned a stable integer idx (0–33 for scored tiles, 34–41 for flowers).
The observation tensor and action space both index into this same space — when the policy outputs
argmax = 4, it means "discard 五萬".
02Tile Universe
34 Scored Tiles + 8 Flowers — each with a numeric index
Indices below each tile match pkg/mahjong/tile.go. The integer is the tile — it feeds every observation channel and every action logit.
Flowers · 花牌Bonus (idx 34–41, drawn & replaced — not scored into patterns)
34春
35夏
36秋
37冬
38梅
39蘭
40菊
41竹
34
Scored Tile Types
Man 9 + Pin 9 + Sou 9 + Winds 4 + Dragons 3
4
Copies per Scored Tile
136 scored tiles total in the wall
144
Wall Size (incl. flowers)
136 scored + 8 flowers × 1
03Game Setup
Four Players, 16 Tiles Each
Taiwan 16-tile Mahjong: each player holds 16 tiles (dealer 17). Draw one, discard one, rotate clockwise. Flowers auto-replace. The observation is rotated so the learner is always rel_seat = 0.
Fan scoring: base 1 fan. Examples: 門清 (+1) · 自摸 (+3) · 大三元 (+8) · 大四喜 (+16). Payment = (1 + total_fan) per unit from each loser. All reward signals to the RL agent are scaled versions of this fan payment (zero-sum across 4 seats).
05Observation Representation
A Concrete Example — from Game State to 34 × 133 Tensor
The observation is laid out as obs[ch × 34 + tile_idx]. Each of the 34 scored-tile indices is a token; each of the 133 channels is a feature of that token. Below we show a tiny game state and trace exactly which values get written.
How it becomes the tensor — channels 0–20 (first per-tile block)
Each row is one channel. Each column is one of the 34 tile tokens. Only the first 12 columns shown below; the other 22 are all zero in this example.
Colours: 0.250.50.751.0
tile_idx →
0
1
2
3
4
5
6
7
8
13
25
31
Hand ≥ 1 ch 0
1
0
0
0
1
0
0
0
0
1
0
1
Hand ≥ 2 ch 1
1
0
0
0
1
0
0
0
0
0
0
1
Hand ≥ 3 ch 2
1
0
0
0
0
0
0
0
0
0
0
0
Hand ≥ 4 ch 3
0
0
0
0
0
0
0
0
0
0
0
0
Own discards /4 ch 4
0
0
0
0
0
0
0
0
0.25
0
0
0
rel_1 discards /4 ch 5
0
0
0
0
0.25
0
0
0
0
0
0.25
0
Visible /4 ch 20
0.75
0
0
0
0.75
0
0
0
0.25
0.25
0.25
0.5
Read this: column "idx 0" (一萬) shows ≥1, ≥2, ≥3 all firing — so the learner holds 3 copies of 一萬.
Column "idx 4" (五萬) shows ≥1, ≥2 only — so 2 copies. Column "idx 31" (中) shows ≥1, ≥2 — so 2 copies.
Channel 20 sums up everything visible: 0.75 for 一萬 (3 in hand), 0.75 for 五萬 (2 in hand + 1 seen in rel_1's discards), etc.
Complete channel layout (133 total)
ch 0–3
Hand — count-threshold encodingper-tile
4 binary layers: have ≥1 / ≥2 / ≥3 / ≥4 of this tile. Redundant-but-useful trick: lets the network read "pair vs triplet vs quad" without doing arithmetic.
ch 4–7
Discard piles, 4 seatsper-tile
Each cell = (count in that seat's discard pile) / 4. Seats are rotated: ch 4 = own, ch 5 = rel_1 (下家), ch 6 = rel_2 (對家), ch 7 = rel_3 (上家).
ch 8–19
Open / semi-open melds, 4 seats × 3 meld-typesper-tile
For each seat: one channel each for chi / pon / kan tiles. Concealed kans of opponents are broadcast uniformly (hidden identity).
ch 20
Globally visible countper-tile
(tiles visible everywhere: all discards + all open melds + own hand) / 4. A single "how many of this tile are still in play" feature.
ch 21–23
Last-discard source flagbroadcast
One-hot over rel_1 / rel_2 / rel_3 indicating who just discarded. Ch 68 (per-tile) says what they discarded.
ch 24
Round wind one-hotper-tile
Set on one of idx 27–30 (東南西北). Indicates current round wind.
ch 25
Seat wind one-hotper-tile
Set on idx 27–30 — agent's seat-specific lucky wind (可加台).
ch 26
Dealer flagbroadcast
1 across all 34 tokens if the learner is the current dealer, else 0. Dealer loses/wins 2× score.
ch 27
Draws left / 80broadcast
How deep into the wall we are. Controls risk tolerance (defend harder late).
ch 28
First round flagbroadcast
1 if still in the dealer's first draw cycle (for 天胡 / 地胡 eligibility).
ch 29
After-kan flagbroadcast
1 if the current draw came from a kan replacement (for 槓上開花).
ch 30–33
Flower counts per seat / 8broadcast
rel_0..3 flower counts. Flowers auto-draw replacements and contribute fan at scoring time.
ch 34–45
Meld source (who contributed)per-tile
4 seats × 3 relative from-directions. Marks which tile, and from whom, each open meld was claimed.
ch 46
門清 (concealed) flagbroadcast
1 if no open melds — enables +1 fan bonus for winning without calls.
ch 47–50
Suit ratios (Man/Pin/Sou/Honor)broadcast
Fraction of learner's total tiles in each suit. Lets the network quickly see "going one-suit (清一色)".
ch 51, 52
Concealed-triplet count / Pair countbroadcast
Normalized counts of "≥3" and "≥2" in hand. Quick structural summary.
For each seat and each of the last 16 discard slots, one channel. Value = 0 (not discarded), 1 (手切), or 2 (摸切, tsumogiri). This gives the transformer temporal context to infer opponents' intentions.
Total: 69 game-state channels + 64 discard-history channels = 133 channels × 34 tile tokens = 4,522-dim observation.
Two scopes coexist: per-tile channels write real per-tile values; broadcast channels copy the same scalar across all 34 tokens (treated as a "global feature for this step"). The transformer reshapes this to (34, 133) and runs self-attention over tile tokens.
06Action Space
109 Discrete Actions
Model outputs a 109-dim logit vector. An action mask zeros out illegal moves before softmax, so the policy can never emit an illegal action.
0–33
Discard (one action per tile idx)
0
1
2
3
4
… idx 5–29 …
30
31
32
34
34
Hu · 胡牌 Declare a winning hand (self-drawn 自摸 or from opponent's discard 放槍).
1
35
Pass · 過 Decline any available reaction; let the turn proceed.
1
36
Pon · 碰 — triplet from an opponent's discard
13
13
← claimed
1
37
MinKan · 明槓 — quad from an opponent's discard
22
22
22
← claimed
1
38–40
Chi · 吃 — sequence from upstream's discard, 3 positions: low / mid / high
4 ←
5
6
4
5 ←
6
4
5
6 ←
3
41–74
AnKan · 暗槓 — concealed quad from hand (one per tile idx)
4
4
4
4
34
75–108
KaKan · 加槓 — upgrade an open pon by adding the 4th copy from hand
13
13
prior pon
+ hand
34
Action masking: illegal action logits are set to −∞ before softmax. Legal-action count per step is typically 2–10 (usually discard-only: 1 legal discard per distinct hand tile, typically 4–14 distinct tiles).
07Model Architecture
MahjongTransformer · 10M Parameters
Self-attention over the 34 tile tokens lets the policy reason about arbitrary tile-to-tile relationships (sequences, triplets, honor pairs, defensive patterns). Below: the full tensor pipeline, attention visualization, and inside-the-layer breakdown.
Input
34 × 133
reshaped 4,522-dim observation
→
Transformer Encoder
4L × 8H × d=256
pre-norm; ff_dim=1024; no dropout
→
Output
109 logits
flatten(34 × 256) → Linear → masked softmax
7.1 · Full Tensor Pipeline
Each block is a tensor labeled with its shape (batch dim B omitted). Each dashed box between tensors is the operation that transforms one into the next.
(B, 4522)
raw observation — flat 1-D vector
.view(−1, 133, 34)
(B, 133, 34)
restore (channel, tile) layout
.transpose(1, 2)
(B, 34, 133)
34 tokens × 133 features each
input_proj : Linear(133 → 256), shared across all 34 tokens
(B, 34, 256)
tokens projected to d_model = 256
+ pos_embed (1, 34, 256) — learnable per-tile position
Unlike NLP Transformers where tokens are words and sequence length varies, here the sequence length is always 34. Each token corresponds to one scored tile type. The 133-dim per-token feature vector packs "what this tile means in the current state" — hand counts, discards, meld participation, shanten-if-discarded, etc.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
↓ Linear(133 → 256) · shared weights across all 34 tokens
tok0 256-d
tok1 256-d
tok2 256-d
tok3 256-d
tok4 256-d
… tok5 .. tok29 …
tok30 256-d
tok31 256-d
tok32 256-d
tok33 256-d
↓ + pos_embed34 × 256 · enters encoder
7.3 · Self-Attention Across 34 Tile Tokens
At every encoder layer, each token computes attention weights over all other tokens. Below shows the query token 五萬 (idx 4) and its attention distribution — thicker edges = higher attention.
strong attention (sequence / triplet partners)
weak attention (unrelated tiles)
What self-attention learns automatically: without hand-coded rules, the network learns that 五萬 should attend strongly to 三萬, 四萬, 六萬, 七萬 (neighbors for sequence building), to its own column (hand-count features), and to recent discards of the same tile (defensive reasoning). Honor tiles learn to attend to their triplet partners (三元 / 風). These patterns are what MLP / CNN architectures fail to capture efficiently.
7.4 · Inside One Encoder Layer (× 4 stacked)
Pre-norm Transformer block with residual connections. Each of the 4 stacked layers has identical structure; parameters are independent.
The 256-d token is split into 8 heads of 32-d each. Each head computes its own Q/K/V projection and attention softmax, producing 8 independent "views" of tile relationships. Results are concatenated back to 256-d.
params per layer ≈ 4 × (256 × 256) = 262K
FEED-FORWARD
Position-wise MLP: expands each token to 1024-d, applies GELU, projects back to 256-d. Operates independently per token — no cross-token mixing here (attention did that).
params per layer ≈ 2 × (256 × 1024) = 524K
7.5 · Parameter Budget
Component
Shape
Params
input_proj
Linear(133 → 256)
34 K
pos_embed
Param(1, 34, 256)
9 K
encoder × 4 layers
attn 262K + FFN 524K + norms per layer
~3.2 M
LayerNorm (final)
Norm(256)
0.5 K
policy_head
Linear(8704 → 109)
949 K
Total
~4.2 M
Note: the S2 self-play run uses d_model=256 + an auxiliary danger_head; with older wider configs (d=384, 6 layers) parameter count reaches ~10M quoted elsewhere. The structure is identical.
Architecture choice rationale: three architectures were tested at similar parameter scale: MLP, CNN, Transformer. Only the Transformer reached expert-parity. Self-attention's ability to capture any tile-to-tile relationship (cross-suit globality + long-range defense reasoning) is the key unlock for Mahjong decision-making — and it is the single most important design lever behind the S1-best milestone.
08Training Infrastructure
256 Parallel Environments · 3,000 Steps/Second
Go Engine
256 envs
shared-memory gRPC, complete 台灣 16-tile rules
⇄
Python PPO Trainer
PyTorch + AMP
opponent pool, rollout, update loop
→
Live Deployment
pongpong-online
Web + WebSocket, 2,814 human matches
3,000
Steps / Second
Measured mean on single RTX 4070 Ti
202M+
Total Training Steps
S2 self-play accumulated (ongoing)
2,814
Real Human Matches
40-day live deployment, 10 users
Training pipeline stages: Behavioral Cloning (BC) from expert traces → S1 self-play (curriculum against growing opponent pool) → S2 self-play (critic-free, leave-one-out baseline). Full pipeline detailed in the Incident-trajectory deck.
Pongpong system visualization · tile assets from pongpong-online/web/src/assets/tiles/
Channel layout verified against pkg/mahjong/observe.go (total = 69 + 4×16 = 133).
Companion deck to ppt_reference_en.html (the Incident narrative).