Find common Chinese characters using pinyin.

Click the image to try it now:

Hanzi searchtool showing common Chinese characters pronounced "shi"

What

It’s a tool that allows you to enter pinyin and optionally a tone number and shows only the commonly used Chinese characters with this pronunciation.

It allows you to switch between the top 3500 character list (frequently used) or the top 6500 (relatively common),

Some other features include showing or hiding all pinyin pronunciations for the characters, and showing or searching by the index of the character from the source Table of General Standard Chinese Characters.

It currently only deals with simplified characters, because that’s what the source deals with and what I’m trying to learn.

Note that this is not a comprehensive Chinese learning tool. To learn Chinese you need to learn words, not just characters. Characters are just one layer in the Chinese language, and even full understanding of all Chinese characters would not mean you’ve learned the language.

Why I made this

While learning Chinese, I wanted to get an understanding of which Chinese characters are actually in common use. If you look in a dictionary for a list of Chinese characters, you will get a bewildering amount of characters. Dictionaries try to be comprehensive. Fortunately, most of these are only in historic use.

I wanted to remove the forest of irrelevant (for my purposes) characters and find get a list of ones actually worth investing time into.

I made this to get an overview and easily be able to answer questions like “how many common characters share this pronunciation?”.

I also wanted it to work without having to wait for a web page to load for each query. I made it work using a fully offline search (you download the data when loading the page itself).

Sources and data

During my search I found this official list of Table of General Standard Chinese Characters made by the Chinese government.

Sadly, this list is a rasterized PDF which does not even contain the pronunciation of these characters.

Still, it seemed like a good authoritative reference of commonly used characters. I started searching around and found a text version of it on this Chinese webpage. While checking the data I found it had some mistakes around characters 3649 to 3668, which I had to fix manually.

To combine this with pronunciation data I found the Unicode Pinyin Table. I’ve added numerouos pronunciations manually, when I found them missing from that source. I’ve logged my manual edits in edits-to-hanzi-pinyin.txt. Note that since I’m only a basic Chinese speaker myself, I can’t exclude that there’s likely more mistakes. I’m open to using a more reliable source for pronunciation data if one is available.

Using this data I created the hanzi-pinyin.6500.csv and hanzi-pinyin.3500.csv files which are the source of the Hanzi search tool.

How did I make this

Originally I just used grep to search the hanzi-pinyin CSV files, but found it to become cumbersome due to the need to write regular expressions when I just wanted to think about pinyin.

I created the cnhz python script for myself, originally as a simple wrapper to build the grep regexes for me so I could just write pinyin. I then added more features around it.

As it got more useful I thought about how I could make it more accessible for other Chinese learners too. Most of them do not use Linux or have Python installed. I thought of rewriting it in Go to make it a single executable - but found the command line format would be awkward for most people. Also Windows cmd.exe does not support Chinese characters.

I eventually decided to turn it into a webpage. the CSV files were sufficiently small that the entire database could be loaded into the browser’s memory so you could do a completely client side search - making the experience much smoother than loading an online dictionary entry.

Also, because I can easily host it on my own web page, which I pay for anyway, I don’t have any need to add advertisements or anything to make the experience worse.

Furthermore, the page can easily be downloaded and used offline by any learner in case my site goes down or for when you don’t have internet.

I tried to support offline working more by making the site into a progressive web app. However, while adding a manifest was easy, I found that to add a service worker (which allows offline caching) - I would have to move all assets including a copy of my main CSS file to /cn/hanzi/. Not having time to figure this out at the moment, I decided to not add service workers for now.

By the way, the hanzi_list.js file was generated from the CSV sources using an awk script:

< ../files/hanzi-pinyin-full.csv \
    awk \
    'NR==1 { printf "const hanzi_3500 = \"" }
     NR==3501 { printf "\"; const hanzi_6500 = hanzi_3500 + \"" }
     { print NR","$0"\\n\\" }
     END { print "\";\nconst hanzi_6500_split = hanzi_6500.split(\"\\n\");" }' \
 > ../js/hanzi_list.js

I hope you find my Chinese hanzi tool useful. If you have any questions or want to contact me, see my details below.