NAME Unicode::Japanese - Japanese Character Encoding Handler SYNOPSIS use Unicode::Japanese; # convert utf8 -> sjis print Unicode::Japanese->new($str)->sjis; # convert sjis -> utf8 print Unicode::Japanese->new($str,'sjis')->get; # convert sjis (imode_EMOJI) -> utf8 print Unicode::Japanese->new($str,'sjis-imode')->get; # convert ZENKAKU (utf8) -> HANKAKU (utf8) print Unicode::Japanese->new($str)->z2h->get; DESCRIPTION Module for conversion among Japanese character encodings. FEATURES * The instance stores internal strings in UTF-8. * Supports both XS and Non-XS. Use XS for high performance, or No-XS for ease to use (only by copying Japanese.pm). * Supports conversion between ZENKAKU and HANKAKU. * Safely handles "EMOJI" of the mobile phones (DoCoMo i-mode, ASTEL dot-i and J-PHONE J-Sky) by mapping them on Unicode Private Use Area. * Supports conversion of the same image of EMOJI between different mobile phone's standard mutually. * Considers Shift_JIS(SJIS) as MS-CP932. (Shift_JIS on MS-Windows (MS-SJIS/MS-CP932) differ from generic Shift_JIS encodings.) * On converting Unicode to SJIS (and EUC-JP/JIS), those encodings that cannot be converted to SJIS (except "EMOJI") are escaped in "&#dddd;" format. "EMOJI" on Unicode Private Use Area is going to be '?'. When converting strings from Unicode to SJIS of mobile phones, any characters not up to their standard is going to be '?' METHODS $s = Unicode::Japanese->new($str [, $icode [, $encode]]) Creates a new instance of Unicode::Japanese. If arguments are specified, passes through to set method. $s->set($str [, $icode [, $encode]]) $str: string $icode: character encodings, may be omitted (default = 'utf8') $encode: ASCII encoding, may be omitted. Set a string in the instance. If '$icode' is omitted, string is considered as UTF-8. To specify a encodings, choose from the following; 'jis', 'sjis', 'euc', 'utf8', 'ucs2', 'ucs4', 'utf16', 'utf16-ge', 'utf16-le', 'utf32', 'utf32-ge', 'utf32-le', 'ascii', 'binary', 'sjis-imode', 'sjis-doti', 'sjis-jsky'. '&#dddd' will be converted to "EMOJI", when specified 'sjis-imode' or 'sjis-doti'. For auto encoding detection, you MUST specify 'auto' so as to call getcode() method automatically. For ASCII encoding, only 'base64' may be specified. With it, the string will be decoded before storing. To decode binary, specify 'binary' as the encoding. $str = $s->get $str: string (UTF-8) Gets a string with UTF-8. $code = $s->getcode($str) $str: string $code: character encoding name Detects the character encodings of *$str*. Notice: This method detects NOT encoding of the string in the instance but *$str*. Character encodings are distinguished by the following algorithm: (In case of PurePerl) 1 If BOM of UTF-32 is found, the encoding is utf32. 2 If BOM of UTF-16 is found, the encoding is utf16. 3 If it is in proper UTF-32BE, the encoding is utf32-be. 4 If it is in proper UTF-32LE, the encoding is utf32-le. 5 Without NON-ASCII characters, the encoding is ascii. (control codes except escape sequences has been included in ASCII) 6 If it includes ISO-2022-JP(JIS) escape sequences, the encoding is jis. 7 If it includes "J-PHONE EMOJI", the encoding is sjis-sky. 8 If it is in proper EUC-JP, the encoding is euc. 9 If it is in proper SJIS, the encoding is sjis. 10 If it is in proper SJIS and "EMOJI" of i-mode, the encoding is sjis-imode. 11 If it is in proper SJIS and "EMOJI" of dot-i,the encoding is sjis-doti. 12 If it is in proper UTF-8, the encoding is utf8. 13 If none above is true, the encoding is unknown. (In case of XS) 1 If BOM of UTF-32 is found, the encoding is utf32. 2 If BOM of UTF-16 is found, the encoding is utf16. 3 String is checked by State Transition if it is applicable for any listed encodings below. ascii / euc-jp / sjis / jis / utf8 / utf32-be / utf32-le / sjis-jsky / sjis-imode / sjis-doti 4 The listed order below is applied for a final determination. utf32-be / utf32-le / ascii / jis / euc-jp / sjis / sjis-jsky / sjis-imode / sjis-doti / utf8 5 If none above is true, the encoding is unknown. Regarding the algorithm, pay attention to the following: * UTF-8 is occasionally detected as SJIS. * Can NOT detect UCS2 automatically. * Can detect UTF-16 only when the string has BOM. * Can detect "EMOJI" when it is stored in binary, not in "&#dddd;" format. (If only stored in "&#dddd;" format, getcode() will return incorrect result. In that case, "EMOJI" will be crashed.) Because each of XS and PurePerl has a different algorithm, A result of the detection would be possibly different. In case that the string is SJIS with escape characters, it would be considered as SJIS on PurePerl. However, it can't be detected as S-JIS on XS. This is because by using Algorithm, the string can't be distinguished between SJIS and SJIS-Jsky. This exclusion of escape characters on XS from the detection is suppose to be the same for EUC-JP. $str = $s->conv($ocode, $encode) $ocode: output character encoding (Choose from 'jis', 'sjis', 'euc', 'utf8', 'ucs2', 'ucs4', 'utf16', 'binary') $encode: ASCII encoding, may be omitted. $str: string Gets a string converted to *$ocode*. For ASCII encoding, only 'base64' may be specified. With it, the string encoded in base64 will be returned. $s->tag2bin Replaces the substrings "&#dddd;" in the string with the binary entity they mean. $s->z2h Converts ZENKAKU to HANKAKU. $s->h2z Converts HANKAKU to ZENKAKU. $s->hira2kata Converts HIRAGANA to KATAKANA. $s->kata2hira Converts KATAKANA to HIRAGANA. $str = $s->jis $str: string (JIS) Gets the string converted to ISO-2022-JP(JIS). $str = $s->euc $str: string (EUC-JP) Gets the string converted to EUC-JP. $str = $s->utf8 $str: string (UTF-8) Gets the string converted to UTF-8. $str = $s->ucs2 $str: string (UCS2) Gets the string converted to UCS2. $str = $s->ucs4 $str: string (UCS4) Gets the string converted to UCS4. $str = $s->utf16 $str: string (UTF-16) Gets the string converted to UTF-16(big-endian). BOM is not added. $str = $s->sjis $str: string (SJIS) Gets the string converted to Shift_JIS(MS-SJIS/MS-CP932). $str = $s->sjis_imode $str: string (SJIS/imode_EMOJI) Gets the string converted to SJIS for i-mode. $str = $s->sjis_doti $str: string (SJIS/dot-i_EMOJI) Gets the string converted to SJIS for dot-i. $str = $s->sjis_sky $str: string (SJIS/J-SKY_EMOJI) Gets the string converted to SJIS for j-sky. @str = $s->strcut($len) $len: number of characters @str: strings Splits the string by length(*$len*). $len = $s->strlen $len: `visual width' of the string Gets the length of the string. This method has been offered to substitute for perl build-in length(). ZENKAKU characters are assumed to have lengths of 2, regardless of the coding being SJIS or UTF-8. $s->join_csv(@values); @values: data array Converts the array to a string in CSV format, then stores into the instance. In the meantime, adds a newline("\n") at the end of string. @values = $s->split_csv; @values: data array Splits the string, accounting it is in CSV format. Each newline("\n") is removed before split. DESCRIPTION OF UNICODE MAPPING SJIS Mapped as MS-CP932. Mapping table in the following URL is used. ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT If a character cannot be mapped to SJIS from Unicode, it will be converted to &#dddd; format. Also, any unmapped character will be converted into "?" when converting to SJIS for mobile phones. EUC-JP/JIS Converted to SJIS and then mapped to Unicode. Any non-SJIS character in the string will not be mapped correctly. DoCoMo i-mode Portion of involving "EMOJI" in F800 - F9FF is maapped to U+0FF800 - U+0FF9FF. ASTEL dot-i Portion of involving "EMOJI" in F000 - F4FF is mapped to U+0FF000 - U+0FF4FF. J-PHONE J-SKY "J-SKY EMOJI" are mapped down as follows: "\e\$"(\x1b\x24) escape sequences, the first byte, the second byte and "\x0f". With sequential "EMOJI"s of identical first bytes, it may be compressed by arranging only the second bytes. 4500 - 47FF is mapped to U+0FFB00 - U+0FFDFF, accounting the first and the second bytes make one EMOJI character. Unicode::Japanese will compress "J-SKY_EMOJI" automatically when the first bytes of a sequence of "EMOJI" are identical. PurePerl mode use Unicode::Japanese qw(PurePerl); If module was loaded with 'PurePerl' keyword, it works on Non-XS mode. BUGS * EUC-JP, JIS strings cannot be converted correctly when they include non-SJIS characters because they are converted to SJIS before being converted to UTF-8. * Some characters of CP932 not in standard Shift_JIS (ex; not in Joyo Kanji) will not be detected and converted. When string include such non-standard Shift_JIS, they will not detected as SJIS. Also, getcode() and all convert method will not work correctly. * When using XS, character encoding detection of EUC-JP and SJIS(included all EMOJI) strings when they include "\e" will fail. Also, getcode() and all convert method will not work. * The Japanese.pm file will collapse if sent via ASCII mode of FTP, as it has a trailing binary data. AUTHOR INFORMATION Copyright 2001-2002 SANO Taku (SAWATARI Mikage) and YAMASHINA Hio. All right reserved. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. Bug reports and comments to: mikage@cpan.org. Thank you. CREDITS Thanks very much to: NAKAYAMA Nao SUGIURA Tatsuki & Debian JP Project