Skip to content

Chinese Private Use Area code points

Mingye Wang edited this page Sep 15, 2016 · 10 revisions

Many decoders for legacy Chinese encodings produce PUA code points for certain characters. Such assignments can cause problems as multiple PUA agreements exist. Since almost all of these characters have formal assignments, PUA is no longer necessary for expressing characters. For consistency, it is generally desirable to normalize such characters to actual formal values. ChineseUtils.normalize warns against PUA code points found in strings.

This article contains a Python 3 script that replaces such code points with formal ones.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'''Normalizes PUA code points generated by decoders. Released under CC0.'''
import argparse
import os
import sys

conv = {}

GB 18030 & GBK

GB 18030 & GBK used a total of 95 PUA code points, and even the latest (2005) version of GB 18030 contains 24 PUA code points.

The PUA codepoint ranges are U+E7C7–U+E8C8, U+E7E7–U+E7F3, and U+E816–U+E864.

# CC0,
_gbk_table_bmp = ((
    '\ue85b\ue85c\ue85d\ue85e\ue85f\ue860\ue861\ue862\ue863\ue864'), (
#   '︐︒︑︓︔︕︖︗ ︘' # presentation forms
_gbk_table = (('\ue816\ue817\ue818\ue831\ue83b\ue855'), ('𠂇𠂉𠃌𡗗𢦏𤇾'))
conv['gbk_bmp'] = str.maketrans(*_gbk_table_bmp)
conv['gbk'] = dict(conv['gbk_bmp'])


The HKSCS extension of Big5 uses EUDA areas in Big5, which are mapped to PUA code points by naive Big5 decoders. However, as the characters in HKSCS are well-defined, publishers of the HKSCS have provided separate mappings for the extended parts where available. PUA-free mappings are available for 2004 and 2008 versions of the standard.

PUA ranges used for Big5 EUDA range from U+E000 to U+F848, and for Big5-HKSCS (not mapped in EUDA ranges 81 40–86 FE), the first 6 × 157 = 942 code points (i.e. U+E000–U+E3AE) are unused.

def big5_euda_pua(byteseq):
    H = int(byteseq[0:2], 16)
    L = int(byteseq[2:4], 16)
    if L < 0x40 or (L > 0x7e and L < 0xa1) or L == 0xff:
        raise ValueError(byteseq)  # Not valid Big5

    def _big5_euda_pua_row(L):
        return (L - 0x40) if (L < 0x80) else (L - 0x62)

    if H >= 0x81 and H <= 0x8D:
        return 0xeeb8 + (157 * (H - 0x81)) + big5_euda_pua_row(L)
    elif H >= 0x8E and H <= 0xA0:
        return 0xe311 + (157 * (H - 0x8e)) + big5_euda_pua_row(L)
    elif (H >= 0xC7 or (H == 0xC6 and L >= 0xA1)) and H <= 0xC8:
        return 0xf672 + (157 * (H - 0xc6)) + big5_euda_pua_row(L)
    elif H >= 0xFA and H <= 0xFE:
        return 0xe000 + (157 * (H - 0xfa)) + big5_euda_pua_row(L)
        return None  # DummyVal

The following mapping generation depends on hkscs-2008-big5-iso.txt.

# The table seems too long to be included here.
# You can dump the tables out to replace this chunk of code.
conv['hkscs_bmp'] = {}
conv['hkscs'] = {}
    with open('hkscs-2008-big5-iso.txt') as hkscs_chart:
        import re
        for entry in hkscs_chart:
            c_big5, _, _, c_uni = entry.split()
                _ = int(c_big5, 16)
            except ValueError:

            if c_uni[0] == '<':
                conv['hkscs_bmp'][big5_euda_pua(c_big5)] = ''.join(
                                map(lambda hex: chr(int(hex, 16)),
                cp_uni = int(c_uni, 16)
                if cp_uni < 0xFFFF:
                    conv['hkscs_bmp'][big5_euda_pua(c_big5)] = cp_uni
                    conv['hkscs'][big5_euda_pua(c_big5)] = cp_uni
    import traceback
    print("Failed to load HKSCS:")



GCCS is the precursor of HKSCS, which included a few characters later unified with others in HKSCS. A compatibility mapping to Big5-HKSCS can be used to map PUA code points generated by Big5 EUDA mapping.

# Omitted, rarely needed.
conv['gccs_hkscs'] should be a str.translate dict that maps obsolete GCCS
code points (as EUDA-PUA) to unified HKSCS code points (non-PUA) when available.

The BMP version of this mapping should provide the PUA code point of the
corresponding HKSCS character if the actual code point is found outside of BMP.


# aliases
conv['zh'] = dict(conv['hkscs'])
conv['bmp'] = dict(conv['hkscs_bmp'])

parser = argparse.ArgumentParser(description='Normalize PUA code points '
                                             'for Chinese encodings.')
parser.add_argument('--conv', nargs='?', type=str, default='zh',
                    help=('mappings to use, sorted by fallback priority, '
                          'separated by commas (","). Defaults to "zh". '
                          'Available mappings:' + ', '.join(conv.keys())))
parser.add_argument('--inplace', action='store_true', default=False,
                    help='perform in-place conversion, suppress stdout')
parser.add_argument('--isuffix', type=str, default='', nargs='?',
                    help='suffix for backup file in in-place mode')
parser.add_argument('files', nargs='*', type=str)
args = parser.parse_args()

realconv = {}
for k in reversed(args.conv.split(',')):

def cat_conv(f):
    for ln in f:

if args.files:
    if args.inplace:
        for file in args.files:
            f = open(file, 'r')
            s =

            if args.isuffix:
                os.rename(file, file + args.isuffix)

            f = open(file, 'w')
        for file in args.files:
            with open(file) as f:

This page was previously named as "Chinese Private Use Area codepoints".

Clone this wiki locally