Develop and Download Open Source Software

View skf_1.92_eman.html

category(Tag) tree

file info

category(Tag)
root
file name
skf_1.92_eman
last update
2004-05-02 23:20
type
HTML
editor
Seiji Kaneko
description
skf_1.92 English HTML man page
language
English
translate
SKF MANPAGE

SKF (1)

NAME
SYNOPSIS
<a href="#DEscriptION">DESCRIPTION
OPTIONS
FILES
AUTHOR
ACKNOWLEDGEMENT
BUGS AND LIMITATIONS
Note
Notice

NAME

skf - simple Kanji Filter (v1.92)

SYNOPSIS

skf [-AEIJKNQRSXZabdehjknqrsuvxz] [ long_format_options ] [infiles..]
<a name="DEscriptION"> <h2>DEscriptION
skf is a yet another i18n capable kanji-filter, designed for reading various CJK-coded files on the Net. It converts input kanji texts or streams into a character stream using designated kanji code and output them to standard output. Specifically, skf is designed to be a versatile filter to read documents in various code sets, and does not have fancy features which are not directly related to code conversion.
Like nkf, skf automatically recognizes input file code when it is a kind of ISO-2022 code, and also recognize Microsoft JIS(Shift_JIS) code and EUC if input file is Japanese text and does not include X0201 kanas. skf 1.9x can read various iso-2022 compliant charsets, including JIS Kanji code (X0208, X0212 and X0213), EUC encoding (euc-jp (with x-0213 support), euc-cn, euc-kr and euc-tw), ISO Europian latins (ISO-8859-1/2/3/4/5/6/7/10/11/14/15/16), BS 4730, NF Z 62-010 and X0201 kana with ESC-(-I, SS0, Locking shift. skf also supports some non-iso2022 compliant sets, including Microsoft Shift-JIS code, KOI-8-R/U, GB2312 (HZ), big5, VISCII(rfc1456, include VIQR), Unicode standard (UCS2/UTF-16, UTF7 and UTF8) and some vendor specific codes (KEIS83, JEF etc). Decoding features for some common encodings (MIME, Punycode and URI codepoint) is also supported.
Supported output codesets are X-0208/X-0212/X-0213 JIS, X-0201 JIS, ASCII, Microsoft Shift-JIS, EUC-jp/-kr/-cn, iso-2022-jp/kr, big5 and Unicode.
Unlike nkf, skf is designed to convert input code into some kind of human-readable form under a local environment (i.e. codeset), and has several extra conversion features. Such conversions include Windows/Macintosh specific code swap and old-new jis glyph change, html-format/TeX format conversion and variant unifications.
If file name(s) are given, skf read files and output converted stream to stdout. If no file names are given, input is taken from stdin and output to stdout. OPTIONS are taken from Environment Variables SKFENV, skfenv and command line, respectively in this order. Environment variables are not used when skf is running as root.
skf does not use LOCALE-related environment variables for conversion, but output error messages are controlled by given LOCALES.

OPTIONS

skf-1.9 is written from scratch, and inherits no code from nkf. However, skf is intended to be a drop-in replacement for nkf(v1.4) and has a subset of nkf options.
skf
1.9x recognizes following options.
buffering control
-b
use buffered output. This is default.
-u
use unbuffered output.
Input/Output codeset options
--ic=
input_code_set
specify input codeset is input_code_set. Possible candidates are shown below.
--oc=
output_code_set
specify output codeset is output_code_set. Possible candidates are shown below.
Supported codeset
skf supports following codesets. These codeset names are case insensitive. Note that iso-2022 escape-based input codeset (registered to IANA) is recoginized automatically, and for this reason, some codeset is treated as same when specified as input. o in in-column means named codeset can be specified as input and x means named codeset is not for input. output-column is same except it is for output.
<TR>
in out name description
o o iso8859-1 ascii + iso-8859-1
o o koi-8r koi-8r (Russian)
o o cp1251 Cyrillic latin MS cp1251
o o jis iso-2022-jp (rfc1496 7bit JIS)
o o jis-x0213 iso-2022-jp-3 (JIS X-0213(2000))
o o jis-x0213-strict iso-2022-jp-3-strict
o o jis-x0213-2004 iso-2022-jp-2004 (JIS X-0213(2004))
o o oldjis iso-2022-jp-1978 (JIS X-0208(1978))
o o euc-jp EUC-encoded JIS X-0208(1997)
o o euc-x0213 EUC-encoded JIS X-0213(2000)
o o euc-jis-2004 EUC-encoded JIS X-0213(2004)
o o euc-kr EUC-encoded KS X-1001 Korian
o o euc7-kr 7bit EUC-encoded KS X-1001 Korian
o o johab KS X-1001-johab Korian
o o euc-cn EUC-encoded GB2312 chinese
o o euc7-cn 7bit EUC-encoded GB2312 chinese
o o hz HZ-encoded GB2312 chinese
o o euc-tw EUC-encoded CNS 11643 chinese
o o gb12345 EUC-encoded GB12345 chinese
o o gbk GB2312 Extension (cp936)
o o big5 BIG5 (with Eten extension with EURO)
o o big5-cp950 BIG5 (Microsoft cp950 with EURO)
o o sjis Shift-jis (Microsoft cp943)
o o sjis-x0213 Shift-jis-encoded JIS X-0213(2000)
o o sjis-x0213-2004 Shift-jis-encoded JIS X-0213(2004)
o x sjis-cellular Shift-jis-encoded JIS X-0208 (with NTT Docomo, Vodafone phone glyph)
o o cp932 Shift-jis-encoded MS cp932
o o cp943 Shift-jis-encoded IBM cp943
o o viscii VISCII (rfc1456) Vietnamise
o o viqr VISCII (rfc1456-VIQR) Vietnamise
o o keis Hitachi KEIS83/90
o x jef Fujitsu JEF (basic support only)
o o ucs2 Unicode(TM) UCS-2/UTF-32LE
o o utf7 Unicode(TM) UTF-7
o o utf8 Unicode(TM) UTF-8
Codeset explanations
iso-8859-1
a.k.a. latin1. When specified as output, G0 = GL is ascii and G1 = GR is iso-8859-1.
iso-2022-jp, jis
Encoding is iso-2022-jp-2 (RFC1496). G0 = GL is JIS x0201 roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS x0212 Supplementary Kanji.
jis-x0213
Encoding is iso-2022-jp-3. G0 = GL is JIS x0201 roman, For output, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS x0213 plane2 Kanji.
jis-x0213-strict
Encoding is subset of iso-2022-jp-3-strict. For output, G0 = GL is JIS x0201 roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is not set. Output code as JIS x0208 whenever possible. JIS X-0213 input is automatically recognized.
jis-x0213-2004
Encoding is iso-2022-jp-2003(2004). For output, G0 = GL is JIS x0201 roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS x0213 plane2 Kanji.
oldjis
Encoding is iso-2022-jp (JIS X-0208(1978)). G0 = GL is JIS x0201 roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS x0212 Supplementary Kanji.
euc-jp, euc
Encoding is 8-bit EUC using JIS X0208(1997) character set. G0 = GL is ascii, G1 = GR is JIS x0208, G2 is JIS x0201 kana and G3 is JIS x0212 Supplementary Kanji.
euc-x0213
Encoding is 8-bit EUC-based JIS X0213(2000). G0 = GL is ascii, G1 = GR is X0213 plane 1, G2 is iso-8859-1 and G3 is JIS x0213 plane2 Kanji.
euc-jis-2004
Encoding is 8-bit EUC-based JIS X0213(2004). G0 = GL is ascii, G1 = GR is X0213(2004) plane 1, G2 is iso-8859-1 and G3 is JIS x0213 plane2 Kanji.
euc-kr
Encoding is 8-bit EUC using KS X-1001 Wansung character set. G0 = GR is KS X1003, G1 = GR is KS X1001, G2 and G3 is not set.
euc7-kr iso-2022-kr
Encoding is iso-2022-kr (rfc1557). 7-bit EUC using KS X-1001 Wansung character set. G0 = GR is KS X1003, G1 is KS X1001, G2 and G3 is not set.
euc-cn
Encoding is 8-bit EUC using GB 2312 character set. G0 = GR is GB1988, G1 = GR is GB2312, G2 and G3 is not set.
euc7-cn
Encoding is 7-bit EUC using GB 2312 character set. G0 = GR is GB1988, G1 is GB2312, G2 and G3 is not set.
hz
Encoding is HZ encoded (rfc1842) GB 2312 character set. G0 = GR is GB1988, G1 = GR is GB2312, G2 and G3 is not set.
euc-tw
Encoding is EUC encoded CNS11643 Plane1/2. Subset of iso-2022-cn. G0 = GR is ascii, G1 = GR is CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is not set.
gb12345
Encoding is 8-bit EUC using GB 12345 (GBF) character set. G0 = GR is GB1988, G1 = GR is GB12345, G2 and G3 is not set.
gbk
Encoding is GBK (a.k.a. cp936). G0 = GR is GB1988 and G1 = GR is GBK. G2 and G3 is not set.
big5
Encoding is Big5 with ETen extension. Include Euro mapping. Uses ascii as latin part.
big5-cp950
Encoding is Big5 (cp950) character set. Uses ascii as latin part.
VISCII (experimental)
Vietnamise VISCII (rfc1456). Not TCVN-5712.
VIQR (experimental)
Vietnamise VISCII with VIQR encoding(rfc1456).
sjis
Encoding is Shift-encoded JIS X0208(1997) character set. This code is same as cp932 as input, but gaiji area is not used for output. Uses JIS x-0201 latin as latin part.
sjis-x0213
Encoding is Microsoft JIS using JIS X0213(2000) character set.
sjis-x0213-2004
Encoding is Microsoft JIS using JIS X0213(2004) character set. 10 newly defined character added, but Unicode mapping is same as JIS X0213(2000). Uses JIS x-0201 latin as latin part.
sjis-cellular (experimental)
Encoding is Shift-encoded JIS X0208(1997) character set with NTT Docomo/Vodafone cellular phone glyph mapping.
cp932
Encoding is Microsoft JIS with NEC gaiji area.
cp943
Encoding is IBM cp943 (OS/2 code).
johab
Encoding is KS X1001(Johab).
ucs2
Encoding is Unicode UTF-16 (v4.0). Input/Output default byte-endian is little, and input byte order mark is recognized. Output includes endian mark by default unless --suppress-endian is specified. Output range is within UTF-32 with surrogate pair unless --limit-to-ucs2 is specified.
utf8
Encoding is UTF-8 encoded Unicode (v4.0). Output doesn't include byte order mark unless --enable-endian-mark is specified.
utf7
Encoding is UTF-7 encoded Unicode (v4.0). Output range is limited to UTF-16.
keis
Encoding is Hitachi KEIS83/90.
jef (experimental)
Encoding is Fujitsu JEF. Only basic part is supported.
koi8r
Russian KOI-8R code.
cp1251
Eastern Europian cyrillic MS cp1251 code.
Shortcuts
-n -j
same as --oc=jis.
-s -x
same as --oc=sjis.
-a -e
same as --oc=euc-jp.
-q
same as --oc=ucs2.
-z
same as --oc=sjis.
-y
same as --oc=utf7.
-k
same as --oc=keis (experimental).
-A, -E
same as --ic=euc-jp. Assume input code set is EUC-JP.
-N
same as --ic=jis. Assume input code set is iso-2022-jp.
-S, -X
same as --ic=sjis. Assume input code set is Microsoft JIS.
-Q
same as --ic=ucs2.
-Y
same as --ic=utf7.
-Z
same as --ic=utf8.
-K
same as --ic=keis.
ISO-2022 Specific controls
Swap G0-3 after setting up according to specified input codeset by assigned character set with this option.
--set-g0=`char_set'
Set code set predefined to plane 0 (G0). Supported `char_set' is `ascii' (default) `x0201' `ksx1003' and `gb1988'. It is automatically invoked to GL (iso-2022-jp-1/2/3 assumption). This option works only with iso-2022-based input. Following option overwrites codeset-specified setting without considering option order.
--set-g1=`char_set'
Set code set predefined to right plane (G1). Supported `char_set' is ascii, `x0201' (default), `iso8859-1', `iso8859-2', `iso8859-3', `iso8859-7', `iso8859-14', `iso8859-15', `koi8-r', `x0212', 'ks_x_1001' and 'gb_2312'. This option works with iso-2022-based input.
--set-g2=`char_set'
Set code set predefined to G2 plane. Supported `char_set' is `x0201' (default) `iso8859-1', `iso8859-2', `iso8859-3', `iso8859-7', `iso8859-14', `iso8859-15', `koi8-r', `x0212', 'ks_x_1001' and 'gb_2312'. This option works with iso-2022-based input.
--set-g3=`char_set'
Set code set predefined to G3 plane. Supported `char_set' is `x0201' (default) `iso8859-1', `iso8859-2', `iso8859-3', `iso8859-7', `iso8859-14', `iso8859-15', `koi8-r', `x0212', 'ks_x_1001' and 'gb_2312'. This option works with iso-2022-based input.
--euc-protect-g1
In EUC input mode, suppress sequences to set a charset to G1. Such sequences are discarded.
--add-annon
Add announcer for JIS X-0208(1990) to X-0208 designate sequence. This option works only with iso-2022-based output.
JIS X-0212(Supplement Kanji code) Support
--x0212-enable
skf by default does not output JIS X-0212 code. This option enables use of JIS X-0212 part. Output code set may be neither Microsoft code nor KEIS. For Unicode variant encodings, this option is on by default.
Unicode coding specific control options
--use-compat
When output is one of translation format of Unicode standard, enable characters in compatibility plane (0xfxxx). skf by default does not use these characters.
--use-ms-compat
When output is Unicode, make translation Microsoft wind*ws compatible. This only affect some symbols in JIS-Kanji, and adding --use-compat option is recommended.
--use-cde-compat
When output is Unicode, make translation JIS X-0221-compatible. This codeset is same as CDE standard codeset.
--little-endian
When output is Unicode, use little endian byte-order. This is default.
--big-endian
When output is Unicode, use big endian byte-order.
--suppress-endian-mark
When output is UTF-16, do not use byte order marking. To make UTF-8N, use this option with --little-endian. This is off by default.
--enable-endian-mark
When output is UTF-8, output byte order marking. This is off by default.
--input-little-endian
When input is Unicode, assume input is little endian byte-ordered. This is default, but skf respects byte-order mark.
--input-big-endian
When input is Unicode, assume input is big endian byte-ordered. Note that skf respects byte-order mark.
--endian-protect
Do not use endian mark in the input stream. Endian mark is just discarded.
--use-replace-char
skf by default converts undefined (except 0x2xxx part) characters into "geta (U+3013)" code. This option specifies skf to use replacement char (0xfffc in UCS2) instead.
--limit-to-ucs2
Do not use > 0x10000 area code in Unicode (i.e. limit code to ucs2 area).
--suppress-cjk-extension
Treat CJK extension A/B area as undefined.
--old-hangle-location
Treat U-3400 area as hangle (Unicode 1.0 compatibility).
Codeset/Vendor Specific codeset handling flags
skf by default assumes machine specific parts of kanji code are Microsoft Windows compatible. Here are some options that control this behavior.
--disable-gaiji-support
Assume machine specific part is undefined.
--use-apple-gaiji
Assume machine specific part in input file is Macintosh (System 7,8,9 or OS X) compatible.
--dsbl-ibm-gaiji
Disable machine specific part in input file.
--disable-chart
Do not use Moji-keisen characters. This is for old Macintosh system (System 6.x or older) compatibility.
--disable-jis90
Disable 2 added characters of JIS X-0208(1990). If this option is specified, these two characters are replaced by Kanji variants. This option is off by default.
--input-detect-jis78
Distinguish JIS X-0208(1978) codeset and JIS X-0208(1983/90) codeset. This option is valid only when input encoding is JIS (ISO-2022).
Miscellanious codeset related options
--old-nec-compat
Enable old NEC kanji sequence (ESC-K,H). Needs compile option -DOLD_NEC_COMPAT.
--no-utf7
Assume input code set is *NOT* UTF-7 encoded Unicode. This option disables input utf7 testing.
OUTPUT Conversions options
skf has various features to fit output file to local environment, and many of these are controlled by extended control switch described in this section.
--use-g0-ascii
set G0(=GL) for output encoding to ASCII, ignoring codeset designation.
X-0201 Kana conversions
skf by default converts X-0201 kanas to X-0208 kanas. To output X-0201 kana as it is, use one of following options. When output is designated to EUC or SJIS, these three options enable X-0201 kana output by ways provided by each code set. When Unicode output is specified, (equiv.) kana part output is controlled by --use-compat, not following switches.
--kana-jis7
use SI/SO locking shift sequence to designate X-0201 kana.
--kana-jis8
output X-0201 kana using 8-bit code right plane.
--kana-esci --kana-call
use ESC-(-I to designate X-0201 kana.
--kana-enable
use X-0201 kana when EUC (with G2) or SJIS output code is used. When JIS output, it is same as --kana-call.
URI/TeX conversion feature options
With Unicode(tm) family output codings, skf output non-ascii latin character part as it is, but with other output codings, skf converts these characters using following rules:
(1) If code is defined in a specified output codeset, it is outputted with this codeset.
(2) If one of following html convert modes enabled and code is defined in html/sgml codeset, it is converted to entity-reference or codepoint reference.
(3) If tex convert mode enabled and code is defined in tex codeset, it is converted to tex format.
(4) If code is a kind of combined ligatures, it is shown by a set of characters.
(5) A kind of replacement character is shown, with warning.
--convert-html --convert-sgml
Enable html convert mode. This mode is cleared by --reset. These two options are synonyms, and are treated as same option.
--convert-html-decimal
Enable html code-point decimal convert mode. This mode is cleared by --reset.
--convert-html-hexadecimal
Enable html code-point hexadecimal convert mode. This mode is cleared by --reset.
--convert-tex
Enable TeX convert mode. This mode is cleared by --reset.
Encoding control options
--decode=`encoding scheme'
Specify encoding scheme for input stream. Supported encoding scheme is `hex', 'mime', 'mime_q', 'mime_b', 'uri_encode', 'puny', 'hex_perc_encode', CAP hex-code, mime, mime Q-encoding, mime B-encoding, uri character reference, ACE punycode, uri percent notation, base64 and rot13/47 respectively. When mime decoding is specified, base text is assumed to be EUC encoding unless specified otherwise.
End of line control options
--lineend-thru
Output end of line code as it is. Also output ^Z code as it is. This is default.
--lineend-cr --lineend-mac
Use CR as end of line code. Also delete ^Z code from input stream.
--lineend-lf --lineend-unix
Use LF as end of line code. Also delete ^Z code from input stream.
--lineend-crlf --lineend-windows
Use CRLF as end of line code. Also delete ^Z code from input stream.
File control options
--filewise-detect --force-reset
Reset and re-detect input code set at the start of each file.
--linewise-detect
Reset and re-detect input code set at the start of each line. This option needs -DKUNIMOTO at compile time.
Misc. Control options
--suppress-space-convert
skf by default, converts an ideographic space into two ascii spaces. This option suppresses this behavior.
--reset
Reset all flags specified by extended controls and given input code.
--inquiry
skf detects code and output detect result to stdout. No filtering output is performed.
--show-filename
When inquiry(--inquiry) is on, this option adds each file name to output. Enabled by default when multiple input files are specified.
--invis-strip
Delete all escape sequences not belonging to ISO-2022 code extension. This is intended to replace invisstrip command bundled in inews package.
--html-sanitize
Convert several characters in HTML document to entity reference expression. Specifically, "!#$&%()/<>:;? is escaped by entity expression.
-I
Warn if input has unassigned code points.
-v
print version and exit.
-h
print brief help.
--show-supported-codeset
Display supported codeset and exit.
--show-supported-charset
Display supported character set and exit.

FILES

/usr/(local/)share/skf/lib/ (Unices)
/Program Files/skf/share/lib (MS Windows)
These directories are where external codeset conversion tables go. The location that current skf assumes are shown by -h option.

AUTHOR

skf is written by Seiji Kaneko (skaneko@a2.mbn.or.jp) based on idea from nkf written by Itaru Ichikawa (ichikawa@flab.fujitsu.co.jp) X-0213 code table is derived from work of earthian@tama.or.jp.

ACKNOWLEDGEMENT

skf is inspired by works or requests by
shinoda@cs.titech, kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh, Hinata(HKE) Ashizawa(CRL) Kunimoto(SDL)

BUGS AND LIMITATIONS

1. skf can handle mixed coding with some limitations. However, code detection easily fails for mixed code, and giving explicit input code set is strongly encouraged.
In case of emergency, --linewise-detect option may help.
2. When using UCS2, UTF-16, UTF-8 and UTF-7, skf tries to detect input code, but giving explicit code set is encouraged. skf doesn't support UCS4, but does support UTF-16/UTF-32 (i.e. surrogate pairs). skf just pass Composite characters to output. No further process is performed.
3. skf implements ISO-2022 with following exceptions
(1) GL 0x20 is always space.
(2) Sequences for setting codes to C1 and C2 is always ignored.
(3) if unknown sequence is given to G0, G0 is set to ascii, and locking/single shift is cleared.
(4) Sequences for 96 character multibyte coding is ignored.
(5) Sequences for standard return, calling coding system with or without standard return may generate unpredictable result.
4. Since skf by default tests input stream to detect utf7 coding, skf sometimes misdetects pure ascii text as utf7. If this occurs, use --no-utf7 option.
5. error output coding is controlled by LOCALE environment variables in UN*X system. Since skf don't care about stdout and stderr is redirecting into same stream, this case should be handled by user.
6. skf-1.91 converts KEIS/JIS X-0213 code using CJK-extension B and CJK compatibility area. For this reason, X-0213 and KEIS convert result varies depending on --use-compat and --limit-to-ucs2 switches.
7. Current external table format supports only UCS2 characters.
8. JIS X-0207(1979) is not supported. JIS X-0211(1987) is designed to be supported (i.e. common terminal control sequence is transparently passed to output).
9. Even if unbuffer option(-u) is specified, some code-translation related bufferings are still performed (in MIME, kana, VIQR etc.).

Note

1. Extended options are changed extensively from skf-1.3. Some archaic options (eg. -B, -@ and -r) have been deleted from this version.
2. From version 1.9, default code set assumed by skf has changed to JIS X-0208(1990) with Microsoft Japanese Windows gaiji (i.e. CP932).
3. From version 1.9, skf supports iso8859 and other charset by using Unicode as internal code set. For this reason, skf-1.9 behaves differently from earlier versions.
4. Code autodetection is not perfect by design. If it has failed to detect input code properly, please give input code information explicitly.
5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are converted using JIS X-0124 and other convention. During this conversion, its byte length is not preserved.
6. skf is intended to pass ANSI compatible terminal control code transparently, but this is not guaranteed.
7. nkf's -i and -o options still works, but valid only when iso-2022-jp and is independent with codeset specifications. Using these options are strongly discouraged.
8. There are some undocumented options. These options should be considered as highly experimental.

Notice

Unicode(TM) is a trademark of Unicode, Inc. Microsoft and Windows are registered trademarks of Microsoft corporation. Macintosh is a registered trademark of Apple Computer Inc. Vodafone is a trademark of Vodafone K.K. Other names and terms may be trademarks or registered trademark of their respective owner. Trademark symbol (TM) is omitted in this manual page.

OpenSource Downloads

7-Zip  (3,741)  
HandBrake Japanese Language Version  (1,771)  
CrystalDiskInfo  (1,721)  
Tera Term  (1,588)  
CrystalDiskMark  (850)  
FFFTP  (791)  
ffdshow  (755)  
mixfont-mplus-ipa  (598)  
MergeDoc  (573)  
10  Boookends  (490)  
11  TortoiseSVN  (490)  
12  FreeMind  (418)  
13  Amateras  (378)  
14  えこでこツール  (369)  
15  BathyScaphe  (356)  
More >>
SourceForge.JP is a Japanese version of SourceForge.net. For developments that are not related to Japan, we recommend you to use SourceForge.net.