View skf_1.92_eman.htmlcategory(Tag) treefile info
SYNOPSIS <a href="#DEscriptION">DESCRIPTION OPTIONS FILES AUTHOR ACKNOWLEDGEMENT BUGS AND LIMITATIONS Note Notice NAME
SYNOPSIS
skf is a yet another i18n capable kanji-filter,
designed for reading various CJK-coded files on the Net. It
converts input kanji texts or streams into a character
stream using designated kanji code and output them to
standard output. Specifically, skf is designed to be
a versatile filter to read documents in various code sets,
and does not have fancy features which are not directly
related to code conversion. | |
| Like nkf, skf automatically recognizes input file code when it is a kind of ISO-2022 code, and also recognize Microsoft JIS(Shift_JIS) code and EUC if input file is Japanese text and does not include X0201 kanas. skf 1.9x can read various iso-2022 compliant charsets, including JIS Kanji code (X0208, X0212 and X0213), EUC encoding (euc-jp (with x-0213 support), euc-cn, euc-kr and euc-tw), ISO Europian latins (ISO-8859-1/2/3/4/5/6/7/10/11/14/15/16), BS 4730, NF Z 62-010 and X0201 kana with ESC-(-I, SS0, Locking shift. skf also supports some non-iso2022 compliant sets, including Microsoft Shift-JIS code, KOI-8-R/U, GB2312 (HZ), big5, VISCII(rfc1456, include VIQR), Unicode standard (UCS2/UTF-16, UTF7 and UTF8) and some vendor specific codes (KEIS83, JEF etc). Decoding features for some common encodings (MIME, Punycode and URI codepoint) is also supported. |
| Supported output codesets are X-0208/X-0212/X-0213 JIS, X-0201 JIS, ASCII, Microsoft Shift-JIS, EUC-jp/-kr/-cn, iso-2022-jp/kr, big5 and Unicode. |
| Unlike nkf, skf is designed to convert input code into some kind of human-readable form under a local environment (i.e. codeset), and has several extra conversion features. Such conversions include Windows/Macintosh specific code swap and old-new jis glyph change, html-format/TeX format conversion and variant unifications. |
| If file name(s) are given, skf read files and output converted stream to stdout. If no file names are given, input is taken from stdin and output to stdout. OPTIONS are taken from Environment Variables SKFENV, skfenv and command line, respectively in this order. Environment variables are not used when skf is running as root. |
| skf does not use LOCALE-related environment variables for conversion, but output error messages are controlled by given LOCALES. |
|
skf-1.9 is written from scratch, and inherits no code
from nkf. However, skf is intended to be a drop-in
replacement for nkf(v1.4) and has a subset of nkf
options. skf 1.9x recognizes following options. |
| buffering control |
| -b |
| use buffered output. This is default. |
| -u |
| use unbuffered output. |
| Input/Output codeset options |
| --ic= |
|
input_code_set specify input codeset is input_code_set. Possible candidates are shown below. |
| --oc= |
|
output_code_set specify output codeset is output_code_set. Possible candidates are shown below. |
| Supported codeset |
| skf supports following codesets. These codeset names are case insensitive. Note that iso-2022 escape-based input codeset (registered to IANA) is recoginized automatically, and for this reason, some codeset is treated as same when specified as input. o in in-column means named codeset can be specified as input and x means named codeset is not for input. output-column is same except it is for output. |
|
| Codeset explanations |
| iso-8859-1 |
| a.k.a. latin1. When specified as output, G0 = GL is ascii and G1 = GR is iso-8859-1. |
| iso-2022-jp, jis |
| Encoding is iso-2022-jp-2 (RFC1496). G0 = GL is JIS x0201 roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS x0212 Supplementary Kanji. |
| jis-x0213 |
| Encoding is iso-2022-jp-3. G0 = GL is JIS x0201 roman, For output, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS x0213 plane2 Kanji. |
| jis-x0213-strict |
| Encoding is subset of iso-2022-jp-3-strict. For output, G0 = GL is JIS x0201 roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is not set. Output code as JIS x0208 whenever possible. JIS X-0213 input is automatically recognized. |
| jis-x0213-2004 |
| Encoding is iso-2022-jp-2003(2004). For output, G0 = GL is JIS x0201 roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS x0213 plane2 Kanji. |
| oldjis |
| Encoding is iso-2022-jp (JIS X-0208(1978)). G0 = GL is JIS x0201 roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS x0212 Supplementary Kanji. |
| euc-jp, euc |
| Encoding is 8-bit EUC using JIS X0208(1997) character set. G0 = GL is ascii, G1 = GR is JIS x0208, G2 is JIS x0201 kana and G3 is JIS x0212 Supplementary Kanji. |
| euc-x0213 |
| Encoding is 8-bit EUC-based JIS X0213(2000). G0 = GL is ascii, G1 = GR is X0213 plane 1, G2 is iso-8859-1 and G3 is JIS x0213 plane2 Kanji. |
| euc-jis-2004 |
| Encoding is 8-bit EUC-based JIS X0213(2004). G0 = GL is ascii, G1 = GR is X0213(2004) plane 1, G2 is iso-8859-1 and G3 is JIS x0213 plane2 Kanji. |
| euc-kr |
| Encoding is 8-bit EUC using KS X-1001 Wansung character set. G0 = GR is KS X1003, G1 = GR is KS X1001, G2 and G3 is not set. |
| euc7-kr iso-2022-kr |
| Encoding is iso-2022-kr (rfc1557). 7-bit EUC using KS X-1001 Wansung character set. G0 = GR is KS X1003, G1 is KS X1001, G2 and G3 is not set. |
| euc-cn |
| Encoding is 8-bit EUC using GB 2312 character set. G0 = GR is GB1988, G1 = GR is GB2312, G2 and G3 is not set. |
| euc7-cn |
| Encoding is 7-bit EUC using GB 2312 character set. G0 = GR is GB1988, G1 is GB2312, G2 and G3 is not set. |
| hz |
| Encoding is HZ encoded (rfc1842) GB 2312 character set. G0 = GR is GB1988, G1 = GR is GB2312, G2 and G3 is not set. |
| euc-tw |
| Encoding is EUC encoded CNS11643 Plane1/2. Subset of iso-2022-cn. G0 = GR is ascii, G1 = GR is CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is not set. |
| gb12345 |
| Encoding is 8-bit EUC using GB 12345 (GBF) character set. G0 = GR is GB1988, G1 = GR is GB12345, G2 and G3 is not set. |
| gbk |
| Encoding is GBK (a.k.a. cp936). G0 = GR is GB1988 and G1 = GR is GBK. G2 and G3 is not set. |
| big5 |
| Encoding is Big5 with ETen extension. Include Euro mapping. Uses ascii as latin part. |
| big5-cp950 |
| Encoding is Big5 (cp950) character set. Uses ascii as latin part. |
| VISCII (experimental) |
| Vietnamise VISCII (rfc1456). Not TCVN-5712. |
| VIQR (experimental) |
| Vietnamise VISCII with VIQR encoding(rfc1456). |
| sjis |
| Encoding is Shift-encoded JIS X0208(1997) character set. This code is same as cp932 as input, but gaiji area is not used for output. Uses JIS x-0201 latin as latin part. |
| sjis-x0213 |
| Encoding is Microsoft JIS using JIS X0213(2000) character set. |
| sjis-x0213-2004 |
| Encoding is Microsoft JIS using JIS X0213(2004) character set. 10 newly defined character added, but Unicode mapping is same as JIS X0213(2000). Uses JIS x-0201 latin as latin part. |
| sjis-cellular (experimental) |
| Encoding is Shift-encoded JIS X0208(1997) character set with NTT Docomo/Vodafone cellular phone glyph mapping. |
| cp932 |
| Encoding is Microsoft JIS with NEC gaiji area. |
| cp943 |
| Encoding is IBM cp943 (OS/2 code). |
| johab |
| Encoding is KS X1001(Johab). |
| ucs2 |
| Encoding is Unicode UTF-16 (v4.0). Input/Output default byte-endian is little, and input byte order mark is recognized. Output includes endian mark by default unless --suppress-endian is specified. Output range is within UTF-32 with surrogate pair unless --limit-to-ucs2 is specified. |
| utf8 |
| Encoding is UTF-8 encoded Unicode (v4.0). Output doesn't include byte order mark unless --enable-endian-mark is specified. |
| utf7 |
| Encoding is UTF-7 encoded Unicode (v4.0). Output range is limited to UTF-16. |
| keis |
| Encoding is Hitachi KEIS83/90. |
| jef (experimental) |
| Encoding is Fujitsu JEF. Only basic part is supported. |
| koi8r |
| Russian KOI-8R code. |
| cp1251 |
| Eastern Europian cyrillic MS cp1251 code. |
| Shortcuts |
| -n -j |
| same as --oc=jis. |
| -s -x |
| same as --oc=sjis. |
| -a -e |
| same as --oc=euc-jp. |
| -q |
| same as --oc=ucs2. |
| -z |
| same as --oc=sjis. |
| -y |
| same as --oc=utf7. |
| -k |
| same as --oc=keis (experimental). |
| -A, -E |
| same as --ic=euc-jp. Assume input code set is EUC-JP. |
| -N |
| same as --ic=jis. Assume input code set is iso-2022-jp. |
| -S, -X |
| same as --ic=sjis. Assume input code set is Microsoft JIS. |
| -Q |
| same as --ic=ucs2. |
| -Y |
| same as --ic=utf7. |
| -Z |
| same as --ic=utf8. |
| -K |
| same as --ic=keis. |
| ISO-2022 Specific controls |
| Swap G0-3 after setting up according to specified input codeset by assigned character set with this option. |
| --set-g0=`char_set' |
| Set code set predefined to plane 0 (G0). Supported `char_set' is `ascii' (default) `x0201' `ksx1003' and `gb1988'. It is automatically invoked to GL (iso-2022-jp-1/2/3 assumption). This option works only with iso-2022-based input. Following option overwrites codeset-specified setting without considering option order. |
| --set-g1=`char_set' |
| Set code set predefined to right plane (G1). Supported `char_set' is ascii, `x0201' (default), `iso8859-1', `iso8859-2', `iso8859-3', `iso8859-7', `iso8859-14', `iso8859-15', `koi8-r', `x0212', 'ks_x_1001' and 'gb_2312'. This option works with iso-2022-based input. |
| --set-g2=`char_set' |
| Set code set predefined to G2 plane. Supported `char_set' is `x0201' (default) `iso8859-1', `iso8859-2', `iso8859-3', `iso8859-7', `iso8859-14', `iso8859-15', `koi8-r', `x0212', 'ks_x_1001' and 'gb_2312'. This option works with iso-2022-based input. |
| --set-g3=`char_set' |
| Set code set predefined to G3 plane. Supported `char_set' is `x0201' (default) `iso8859-1', `iso8859-2', `iso8859-3', `iso8859-7', `iso8859-14', `iso8859-15', `koi8-r', `x0212', 'ks_x_1001' and 'gb_2312'. This option works with iso-2022-based input. |
| --euc-protect-g1 |
| In EUC input mode, suppress sequences to set a charset to G1. Such sequences are discarded. |
| --add-annon |
| Add announcer for JIS X-0208(1990) to X-0208 designate sequence. This option works only with iso-2022-based output. |
| JIS X-0212(Supplement Kanji code) Support |
| --x0212-enable |
| skf by default does not output JIS X-0212 code. This option enables use of JIS X-0212 part. Output code set may be neither Microsoft code nor KEIS. For Unicode variant encodings, this option is on by default. |
| Unicode coding specific control options |
| --use-compat |
| When output is one of translation format of Unicode standard, enable characters in compatibility plane (0xfxxx). skf by default does not use these characters. |
| --use-ms-compat |
| When output is Unicode, make translation Microsoft wind*ws compatible. This only affect some symbols in JIS-Kanji, and adding --use-compat option is recommended. |
| --use-cde-compat |
| When output is Unicode, make translation JIS X-0221-compatible. This codeset is same as CDE standard codeset. |
| --little-endian |
| When output is Unicode, use little endian byte-order. This is default. |
| --big-endian |
| When output is Unicode, use big endian byte-order. |
| --suppress-endian-mark |
| When output is UTF-16, do not use byte order marking. To make UTF-8N, use this option with --little-endian. This is off by default. |
| --enable-endian-mark |
| When output is UTF-8, output byte order marking. This is off by default. |
| --input-little-endian |
| When input is Unicode, assume input is little endian byte-ordered. This is default, but skf respects byte-order mark. |
| --input-big-endian |
| When input is Unicode, assume input is big endian byte-ordered. Note that skf respects byte-order mark. |
| --endian-protect |
| Do not use endian mark in the input stream. Endian mark is just discarded. |
| --use-replace-char |
| skf by default converts undefined (except 0x2xxx part) characters into "geta (U+3013)" code. This option specifies skf to use replacement char (0xfffc in UCS2) instead. |
| --limit-to-ucs2 |
| Do not use > 0x10000 area code in Unicode (i.e. limit code to ucs2 area). |
| --suppress-cjk-extension |
| Treat CJK extension A/B area as undefined. |
| --old-hangle-location |
| Treat U-3400 area as hangle (Unicode 1.0 compatibility). |
| Codeset/Vendor Specific codeset handling flags |
| skf by default assumes machine specific parts of kanji code are Microsoft Windows compatible. Here are some options that control this behavior. |
| --disable-gaiji-support |
| Assume machine specific part is undefined. |
| --use-apple-gaiji |
| Assume machine specific part in input file is Macintosh (System 7,8,9 or OS X) compatible. |
| --dsbl-ibm-gaiji |
| Disable machine specific part in input file. |
| --disable-chart |
| Do not use Moji-keisen characters. This is for old Macintosh system (System 6.x or older) compatibility. |
| --disable-jis90 |
| Disable 2 added characters of JIS X-0208(1990). If this option is specified, these two characters are replaced by Kanji variants. This option is off by default. |
| --input-detect-jis78 |
| Distinguish JIS X-0208(1978) codeset and JIS X-0208(1983/90) codeset. This option is valid only when input encoding is JIS (ISO-2022). |
| Miscellanious codeset related options |
| --old-nec-compat |
| Enable old NEC kanji sequence (ESC-K,H). Needs compile option -DOLD_NEC_COMPAT. |
| --no-utf7 |
| Assume input code set is *NOT* UTF-7 encoded Unicode. This option disables input utf7 testing. |
| OUTPUT Conversions options |
| skf has various features to fit output file to local environment, and many of these are controlled by extended control switch described in this section. |
| --use-g0-ascii |
| set G0(=GL) for output encoding to ASCII, ignoring codeset designation. |
| X-0201 Kana conversions |
| skf by default converts X-0201 kanas to X-0208 kanas. To output X-0201 kana as it is, use one of following options. When output is designated to EUC or SJIS, these three options enable X-0201 kana output by ways provided by each code set. When Unicode output is specified, (equiv.) kana part output is controlled by --use-compat, not following switches. |
| --kana-jis7 |
| use SI/SO locking shift sequence to designate X-0201 kana. |
| --kana-jis8 |
| output X-0201 kana using 8-bit code right plane. |
| --kana-esci --kana-call |
| use ESC-(-I to designate X-0201 kana. |
| --kana-enable |
| use X-0201 kana when EUC (with G2) or SJIS output code is used. When JIS output, it is same as --kana-call. |
| URI/TeX conversion feature options |
| With Unicode(tm) family output codings, skf output non-ascii latin character part as it is, but with other output codings, skf converts these characters using following rules: |
|
(1) If code is defined in a specified output codeset, it is
outputted with this codeset. (2) If one of following html convert modes enabled and code is defined in html/sgml codeset, it is converted to entity-reference or codepoint reference. (3) If tex convert mode enabled and code is defined in tex codeset, it is converted to tex format. (4) If code is a kind of combined ligatures, it is shown by a set of characters. (5) A kind of replacement character is shown, with warning. |
| --convert-html --convert-sgml |
| Enable html convert mode. This mode is cleared by --reset. These two options are synonyms, and are treated as same option. |
| --convert-html-decimal |
| Enable html code-point decimal convert mode. This mode is cleared by --reset. |
| --convert-html-hexadecimal |
| Enable html code-point hexadecimal convert mode. This mode is cleared by --reset. |
| --convert-tex |
| Enable TeX convert mode. This mode is cleared by --reset. |
| Encoding control options |
| --decode=`encoding scheme' |
| Specify encoding scheme for input stream. Supported encoding scheme is `hex', 'mime', 'mime_q', 'mime_b', 'uri_encode', 'puny', 'hex_perc_encode', CAP hex-code, mime, mime Q-encoding, mime B-encoding, uri character reference, ACE punycode, uri percent notation, base64 and rot13/47 respectively. When mime decoding is specified, base text is assumed to be EUC encoding unless specified otherwise. |
| End of line control options |
| --lineend-thru |
| Output end of line code as it is. Also output ^Z code as it is. This is default. |
| --lineend-cr --lineend-mac |
| Use CR as end of line code. Also delete ^Z code from input stream. |
| --lineend-lf --lineend-unix |
| Use LF as end of line code. Also delete ^Z code from input stream. |
| --lineend-crlf --lineend-windows |
| Use CRLF as end of line code. Also delete ^Z code from input stream. |
| File control options |
| --filewise-detect --force-reset |
| Reset and re-detect input code set at the start of each file. |
| --linewise-detect |
| Reset and re-detect input code set at the start of each line. This option needs -DKUNIMOTO at compile time. |
| Misc. Control options |
| --suppress-space-convert |
| skf by default, converts an ideographic space into two ascii spaces. This option suppresses this behavior. |
| --reset |
| Reset all flags specified by extended controls and given input code. |
| --inquiry |
| skf detects code and output detect result to stdout. No filtering output is performed. |
| --show-filename |
| When inquiry(--inquiry) is on, this option adds each file name to output. Enabled by default when multiple input files are specified. |
| --invis-strip |
| Delete all escape sequences not belonging to ISO-2022 code extension. This is intended to replace invisstrip command bundled in inews package. |
| --html-sanitize |
| Convert several characters in HTML document to entity reference expression. Specifically, "!#$&%()/<>:;? is escaped by entity expression. |
| -I |
| Warn if input has unassigned code points. |
| -v |
| print version and exit. |
| -h |
| print brief help. |
| --show-supported-codeset |
| Display supported codeset and exit. |
| --show-supported-charset |
| Display supported character set and exit. |
| /usr/(local/)share/skf/lib/ (Unices) |
| /Program Files/skf/share/lib (MS Windows) |
| These directories are where external codeset conversion tables go. The location that current skf assumes are shown by -h option. |
| skf is written by Seiji Kaneko (skaneko@a2.mbn.or.jp) based on idea from nkf written by Itaru Ichikawa (ichikawa@flab.fujitsu.co.jp) X-0213 code table is derived from work of earthian@tama.or.jp. |
| skf is inspired by works or requests by |
| shinoda@cs.titech, kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh, Hinata(HKE) Ashizawa(CRL) Kunimoto(SDL) |
|
1. skf can handle mixed coding with some limitations.
However, code detection easily fails for mixed code, and
giving explicit input code set is strongly encouraged. In case of emergency, --linewise-detect option may help. |
| 2. When using UCS2, UTF-16, UTF-8 and UTF-7, skf tries to detect input code, but giving explicit code set is encouraged. skf doesn't support UCS4, but does support UTF-16/UTF-32 (i.e. surrogate pairs). skf just pass Composite characters to output. No further process is performed. |
| 3. skf implements ISO-2022 with following exceptions |
| (1) GL 0x20 is always space. |
| (2) Sequences for setting codes to C1 and C2 is always ignored. |
| (3) if unknown sequence is given to G0, G0 is set to ascii, and locking/single shift is cleared. |
| (4) Sequences for 96 character multibyte coding is ignored. |
| (5) Sequences for standard return, calling coding system with or without standard return may generate unpredictable result. |
| 4. Since skf by default tests input stream to detect utf7 coding, skf sometimes misdetects pure ascii text as utf7. If this occurs, use --no-utf7 option. |
| 5. error output coding is controlled by LOCALE environment variables in UN*X system. Since skf don't care about stdout and stderr is redirecting into same stream, this case should be handled by user. |
| 6. skf-1.91 converts KEIS/JIS X-0213 code using CJK-extension B and CJK compatibility area. For this reason, X-0213 and KEIS convert result varies depending on --use-compat and --limit-to-ucs2 switches. |
| 7. Current external table format supports only UCS2 characters. |
| 8. JIS X-0207(1979) is not supported. JIS X-0211(1987) is designed to be supported (i.e. common terminal control sequence is transparently passed to output). |
| 9. Even if unbuffer option(-u) is specified, some code-translation related bufferings are still performed (in MIME, kana, VIQR etc.). |
| 1. Extended options are changed extensively from skf-1.3. Some archaic options (eg. -B, -@ and -r) have been deleted from this version. |
| 2. From version 1.9, default code set assumed by skf has changed to JIS X-0208(1990) with Microsoft Japanese Windows gaiji (i.e. CP932). |
| 3. From version 1.9, skf supports iso8859 and other charset by using Unicode as internal code set. For this reason, skf-1.9 behaves differently from earlier versions. |
| 4. Code autodetection is not perfect by design. If it has failed to detect input code properly, please give input code information explicitly. |
| 5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are converted using JIS X-0124 and other convention. During this conversion, its byte length is not preserved. |
| 6. skf is intended to pass ANSI compatible terminal control code transparently, but this is not guaranteed. |
| 7. nkf's -i and -o options still works, but valid only when iso-2022-jp and is independent with codeset specifications. Using these options are strongly discouraged. |
| 8. There are some undocumented options. These options should be considered as highly experimental. |
| Unicode(TM) is a trademark of Unicode, Inc. Microsoft and Windows are registered trademarks of Microsoft corporation. Macintosh is a registered trademark of Apple Computer Inc. Vodafone is a trademark of Vodafone K.K. Other names and terms may be trademarks or registered trademark of their respective owner. Trademark symbol (TM) is omitted in this manual page. |