NAME
    Unicode::Japanese - Japanese Character Encoding Handler

SYNOPSIS
    use Unicode::Japanese;

    # convert utf8 -> sjis

    print Unicode::Japanese->new($str)->sjis;

    # convert sjis -> utf8

    print Unicode::Japanese->new($str,'sjis')->get;

    # convert sjis (imode_EMOJI) -> utf8

    print Unicode::Japanese->new($str,'sjis-imode')->get;

    # convert ZENKAKU (utf8) -> HANKAKU (utf8)

    print Unicode::Japanese->new($str)->z2h->get;

DESCRIPTION
    Module for conversion among Japanese character encodings.

  FEATURES
    * The instance stores internal strings in UTF-8.

    * Supports both XS and Non-XS. Use XS for high performance, or No-XS for
      ease to use (only by copying Japanese.pm).

    * Supports conversion between ZENKAKU and HANKAKU.

    * Safely handles "EMOJI" of the mobile phones (DoCoMo i-mode, ASTEL
      dot-i and J-PHONE J-Sky) by mapping them on Unicode Private Use Area.

    * Supports conversion of the same image of EMOJI between different
      mobile phone's standard mutually.

    * Considers Shift_JIS(SJIS) as MS-CP932. (Shift_JIS on MS-Windows
      (MS-SJIS/MS-CP932) differ from generic Shift_JIS encodings.)

    * On converting Unicode to SJIS (and EUC-JP/JIS), those encodings that
      cannot be converted to SJIS (except "EMOJI") are escaped in "&#dddd;"
      format. "EMOJI" on Unicode Private Use Area is going to be '?'. When
      converting strings from Unicode to SJIS of mobile phones, any
      characters not up to their standard is going to be '?'

METHODS
    $s = Unicode::Japanese->new($str [, $icode [, $encode]])
        Creates a new instance of Unicode::Japanese.

        If arguments are specified, passes through to set method.

    $s->set($str [, $icode [, $encode]])

        $str: string
        $icode: character encodings, may be omitted (default = 'utf8')
        $encode: ASCII encoding, may be omitted.

        Set a string in the instance. If '$icode' is omitted, string is
        considered as UTF-8.

        To specify a encodings, choose from the following; 'jis', 'sjis',
        'euc', 'utf8', 'ucs2', 'ucs4', 'utf16', 'utf16-ge', 'utf16-le',
        'utf32', 'utf32-ge', 'utf32-le', 'ascii', 'binary', 'sjis-imode',
        'sjis-doti', 'sjis-jsky'.

        '&#dddd' will be converted to "EMOJI", when specified 'sjis-imode'
        or 'sjis-doti'.

        For auto encoding detection, you MUST specify 'auto' so as to call
        getcode() method automatically.

        For ASCII encoding, only 'base64' may be specified. With it, the
        string will be decoded before storing.

        To decode binary, specify 'binary' as the encoding.

    $str = $s->get

        $str: string (UTF-8)

        Gets a string with UTF-8.

    $code = $s->getcode($str)

        $str: string
        $code: character encoding name

        Detects the character encodings of *$str*.

        Notice: This method detects NOT encoding of the string in the
        instance but *$str*.

        Character encodings are distinguished by the following algorithm:

        (In case of PurePerl)

        1   If BOM of UTF-32 is found, the encoding is utf32.

        2   If BOM of UTF-16 is found, the encoding is utf16.

        3   If it is in proper UTF-32BE, the encoding is utf32-be.

        4   If it is in proper UTF-32LE, the encoding is utf32-le.

        5   Without NON-ASCII characters, the encoding is ascii. (control
            codes except escape sequences has been included in ASCII)

        6   If it includes ISO-2022-JP(JIS) escape sequences, the encoding
            is jis.

        7   If it includes "J-PHONE EMOJI", the encoding is sjis-sky.

        8   If it is in proper EUC-JP, the encoding is euc.

        9   If it is in proper SJIS, the encoding is sjis.

        10  If it is in proper SJIS and "EMOJI" of i-mode, the encoding is
            sjis-imode.

        11  If it is in proper SJIS and "EMOJI" of dot-i,the encoding is
            sjis-doti.

        12  If it is in proper UTF-8, the encoding is utf8.

        13  If none above is true, the encoding is unknown.

        (In case of XS)

        1   If BOM of UTF-32 is found, the encoding is utf32.

        2   If BOM of UTF-16 is found, the encoding is utf16.

        3   String is checked by State Transition if it is applicable for
            any listed encodings below.

            ascii / euc-jp / sjis / jis / utf8 / utf32-be / utf32-le /
            sjis-jsky / sjis-imode / sjis-doti

        4   The listed order below is applied for a final determination.

            utf32-be / utf32-le / ascii / jis / euc-jp / sjis / sjis-jsky /
            sjis-imode / sjis-doti / utf8

        5   If none above is true, the encoding is unknown.

        Regarding the algorithm, pay attention to the following:

        * UTF-8 is occasionally detected as SJIS.

        * Can NOT detect UCS2 automatically.

        * Can detect UTF-16 only when the string has BOM.

        * Can detect "EMOJI" when it is stored in binary, not in "&#dddd;"
          format. (If only stored in "&#dddd;" format, getcode() will return
          incorrect result. In that case, "EMOJI" will be crashed.)

        Because each of XS and PurePerl has a different algorithm, A result
        of the detection would be possibly different. In case that the
        string is SJIS with escape characters, it would be considered as
        SJIS on PurePerl. However, it can't be detected as S-JIS on XS. This
        is because by using Algorithm, the string can't be distinguished
        between SJIS and SJIS-Jsky. This exclusion of escape characters on
        XS from the detection is suppose to be the same for EUC-JP.

    $str = $s->conv($ocode, $encode)

        $ocode: output character encoding (Choose from 'jis', 'sjis', 'euc',
        'utf8', 'ucs2', 'ucs4', 'utf16', 'binary')
        $encode: ASCII encoding, may be omitted.
        $str: string

        Gets a string converted to *$ocode*.

        For ASCII encoding, only 'base64' may be specified. With it, the
        string encoded in base64 will be returned.

    $s->tag2bin
        Replaces the substrings "&#dddd;" in the string with the binary
        entity they mean.

    $s->z2h
        Converts ZENKAKU to HANKAKU.

    $s->h2z
        Converts HANKAKU to ZENKAKU.

    $s->hira2kata
        Converts HIRAGANA to KATAKANA.

    $s->kata2hira
        Converts KATAKANA to HIRAGANA.

    $str = $s->jis
        $str: string (JIS)

        Gets the string converted to ISO-2022-JP(JIS).

    $str = $s->euc
        $str: string (EUC-JP)

        Gets the string converted to EUC-JP.

    $str = $s->utf8
        $str: string (UTF-8)

        Gets the string converted to UTF-8.

    $str = $s->ucs2
        $str: string (UCS2)

        Gets the string converted to UCS2.

    $str = $s->ucs4
        $str: string (UCS4)

        Gets the string converted to UCS4.

    $str = $s->utf16
        $str: string (UTF-16)

        Gets the string converted to UTF-16(big-endian). BOM is not added.

    $str = $s->sjis
        $str: string (SJIS)

        Gets the string converted to Shift_JIS(MS-SJIS/MS-CP932).

    $str = $s->sjis_imode
        $str: string (SJIS/imode_EMOJI)

        Gets the string converted to SJIS for i-mode.

    $str = $s->sjis_doti
        $str: string (SJIS/dot-i_EMOJI)

        Gets the string converted to SJIS for dot-i.

    $str = $s->sjis_sky
        $str: string (SJIS/J-SKY_EMOJI)

        Gets the string converted to SJIS for j-sky.

    @str = $s->strcut($len)

        $len: number of characters
        @str: strings

        Splits the string by length(*$len*).

    $len = $s->strlen
        $len: `visual width' of the string

        Gets the length of the string. This method has been offered to
        substitute for perl build-in length(). ZENKAKU characters are
        assumed to have lengths of 2, regardless of the coding being SJIS or
        UTF-8.

    $s->join_csv(@values);
        @values: data array

        Converts the array to a string in CSV format, then stores into the
        instance. In the meantime, adds a newline("\n") at the end of
        string.

    @values = $s->split_csv;
        @values: data array

        Splits the string, accounting it is in CSV format. Each
        newline("\n") is removed before split.

DESCRIPTION OF UNICODE MAPPING
    SJIS
      Mapped as MS-CP932. Mapping table in the following URL is used.

      ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

      If a character cannot be mapped to SJIS from Unicode, it will be
      converted to &#dddd; format.

      Also, any unmapped character will be converted into "?" when
      converting to SJIS for mobile phones.

    EUC-JP/JIS
      Converted to SJIS and then mapped to Unicode. Any non-SJIS character
      in the string will not be mapped correctly.

    DoCoMo i-mode
      Portion of involving "EMOJI" in F800 - F9FF is maapped to U+0FF800 -
      U+0FF9FF.

    ASTEL dot-i
      Portion of involving "EMOJI" in F000 - F4FF is mapped to U+0FF000 -
      U+0FF4FF.

    J-PHONE J-SKY
      "J-SKY EMOJI" are mapped down as follows: "\e\$"(\x1b\x24) escape
      sequences, the first byte, the second byte and "\x0f". With sequential
      "EMOJI"s of identical first bytes, it may be compressed by arranging
      only the second bytes.

      4500 - 47FF is mapped to U+0FFB00 - U+0FFDFF, accounting the first and
      the second bytes make one EMOJI character.

      Unicode::Japanese will compress "J-SKY_EMOJI" automatically when the
      first bytes of a sequence of "EMOJI" are identical.

PurePerl mode
       use Unicode::Japanese qw(PurePerl);

    If module was loaded with 'PurePerl' keyword, it works on Non-XS mode.

BUGS
    * EUC-JP, JIS strings cannot be converted correctly when they include
      non-SJIS characters because they are converted to SJIS before being
      converted to UTF-8.

    * Some characters of CP932 not in standard Shift_JIS (ex; not in Joyo
      Kanji) will not be detected and converted.

      When string include such non-standard Shift_JIS, they will not
      detected as SJIS. Also, getcode() and all convert method will not work
      correctly.

    * When using XS, character encoding detection of EUC-JP and
      SJIS(included all EMOJI) strings when they include "\e" will fail.
      Also, getcode() and all convert method will not work.

    * The Japanese.pm file will collapse if sent via ASCII mode of FTP, as
      it has a trailing binary data.

AUTHOR INFORMATION
    Copyright 2001-2002 SANO Taku (SAWATARI Mikage) and YAMASHINA Hio. All
    right reserved.

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

    Bug reports and comments to: mikage@cpan.org. Thank you.

CREDITS
    Thanks very much to:

    NAKAYAMA Nao

    SUGIURA Tatsuki & Debian JP Project