# original source for character list: http://www.gov.cn/gzdt/att/att/site1/20130819/tygfhzb.pdf (通用规范汉字表) # text version taken from http://xh.5156edu.com/page/z6211m4474j19255.html with manual fixed between chars 3649-3668, this was made into table.txt # original source for pronunciation data: ftp://ftp.cuhk.hk/pub/chinese/ifcss/software/data/Uni2Pinyin.gz # converted with: for lines in {1..6500..100}; do echo $lines-$((lines+99))/6500 1>&2; for code in $(unicode --max 100 --brief "$(sed -n "$lines,$((lines+99))p" < table.txt | tr -d '\n')" | g 2 | cut -d+ -f2); do pinyin="$(grep ^$code Uni2Pinyin)"; if [ $? -ne 0 ]; then echo $'\t?'; else echo "$pinyin"; fi ;done | cut -d$'\t' -f2-; done > pinyin-table.txt # manual fixes for pronunciation data: # tone change rules, 变调 一,+yi4,+yi2 不,+bu2,+bu5 # common tone change in for ex. 想一想,试一试 一,+yi5 # common pronunciation was missing 么,+me5 几,+ji3 识,+shi2,shi4 传,+chuan2 陆,+lu4 鲜,+xian3 强,+jiang4 迹,+ji4 绩,+ji4 还,+huan2 脏,+zang1 盖,+gai4 仔,+zai3 构,+gou4 # uncommon characters, source pronunciations from https://zidian.911cha.com/ -- cross referenced with zh.wiktionary.org 㧐,?=>song3 䏝,?=>chun2,zhuan3 㤘,?=>chu4,cu4,zhou4 𠳐,?=>bang1 䓖,?=>qiong2 𥻗,?=>cha2 㸆,?=>kao4 # uncommon character, previous source doesn't have it, source https://baike.baidu.com/ 䦃,?=>zhuo1 # style of writing pinyin changed 嗯,eng4:eng2:eng3 (was ng, made eng) # fixed mistaken pronunciation 触,hong2=>chu4