Kanji database 漢字データベース

The new 2136 Japanese Jōyō kanji web-accessible database

Produced in 2015 by the following four scholars:

Katsuo Tamaoka - Nagoya University, Japan
Shogo Makioka - Osaka Prefecture University, Japan
Sander Sanders - Kumulus Centre, Netherlands
Rinus G. Verdonschot - Waseda University, Japan

Background

In 1981 the Japanese government established a standardized list containing 1945 basic Japanese kanji characters. This list was titled the Jōyō Kanji-hyō (the list of commonly-used kanji). The Jōyō Kanji-hyō has been used to standardize Japanese printed texts including newspapers, magazines and educational materials. Two decades later, Tamaoka, Kirsner, Yanase, Miyaoka and Kawakami (2002) produced the first web-accessible database containing these 1945 standardized kanji. In 2004, using the Japanese lexical database of Amano and Kondo (1999, 2000), Tamaoka and Makioka calculated additional information (e.g. word frequencies based on the Asahi Newspaper from 1985 to 1998). As such they produced the fourth edition of the web-accessible kanji database which now included several mathematical indexes such as: entropy, redundancy and symmetry. However, recently (2010) the official Jōyō kanji list has been revised by the Japanese government. It now includes a total of 2136 Japanese kanji which are to serve as a basis for official communication in Japanese. Please refer to the details of the new Jōyō kanji available in Japanese at the Web-site of the Japanese Agency for Cultural Affairs: http://www.bunka.go.jp/kokugo_nihongo/pdf/jouyoukanjihyou_h22.pdf.

Present Database

Kanji lists play an important role in Japanese psycholinguistic research (e.g. Verdonschot et al., 2013). In order to make the detailed properties of the new kanji in the list available to researchers in psychology and linguistics, we have developed a novel web-accessible kanji database including an advanced corpus (i.e. Mainichi Newspaper from 2000 to 2010). The new kanji database also includes a wide range of important properties such as: kanji frequency, On- and Kun-reading frequencies, On-reading ratio, kanji productivity of two-kanji compounds, symmetry of kanji productivity, entropy, number of meanings, etc. This easy-to-use web site (http://www.kanjidatabase.com/) has especially been developed to grant effortless access to the database and allows for:

  1. Easy selection of kanji from the database following criteria which can be defined by the user, as well as
  2. Pasting of kanji (or even complete texts) and looking up specified properties from the pasted kanji in the database.

Resource

To create the current 2136 Jōyō kanji database, we used 11 years of the all-Japanese version of the Mainichi Newspaper from 2000 to 2010. The morphological parsing program MeCab0.991 counted 477,264 morphological units (type frequency) and a total token frequency of 299,695,840 out of this newspaper corpus. Excluding proper nouns from this database, the count was 368,841 for type frequency and 282,816,611 for token frequency. There are four kanji symbols which are not included in Shift Japanese Industrial Standards (Shift-JIS) (i.e.,「𠮟」「塡」「剝」「頰」) which were transformed to the same kanji symbols included in Shift-JIS (i.e.,「叱」「填」「剥」「頬」). Using the total frequency of 282,816,611 morpheme units, the present kanji database calculated the frequency of each of the 2136 commonly-used Jōyō kanji.

Quotation

Please quote the usage of this database/website as: Tamaoka, K., Makioka, S., Sanders, S. & Verdonschot, R.G. (accepted). www.kanjidatabase.com: A new interactive online database for psychological and linguistic research on Japanese kanji and their compound words. Psychological Research.

Contact

If you have any questions regarding the novel 2136 Jōyō Japanese kanji database, please contact: Katsuo Tamaoka (Nagoya University, Japan) at ktamaoka (at) lang.nagoya-u.ac.jp or Rinus G. Verdonschot (Waseda University, Japan) at rinusverdonschot (at) gmail.com.

References:

Version: Februari 15, 2016 10:49 scripts © S.Sanders