View skf_1.94_man
file info- category(Tag)
- root
- file name
- skf_1.94_man.txt
- last update
- 2006-07-30 23:27
- type
- Plain Text
- editor
- Seiji Kaneko
- description
- skf 1.94 English man page
- language
- English
- translate
SKF(1) SKF(1)
NAME
skf - simple Kanji Filter (v1.94)
SYNOPSIS
skf [-AEIJKNQRSXZabdehjknqrsuvxz] [ long_format_options ] [infiles..]
DESCRIPTION
skf is a yet another i18n capable kanji-filter, designed for reading
various CJK-coded files on the Net. It converts input kanji texts or
streams into a character stream using designated codeset and output
them to standard output. Specifically, skf is designed to be a versa-
tile filter to read documents in various code sets, and does not have
fancy features which are not directly related to code conversion.
Like nkf, skf automatically recognizes input file code when it is a
kind of ISO-2022 compliant code, and also detects EUC-variant codes if
input file is Japanese text without X0201 kanas. skf 1.9x can read
various iso-2022 compliant charsets, including JIS Kanji code (X0208,
X0212 and X0213), EUC encoding (euc-jp (with x-0213 support), euc-cn,
euc-kr and euc-tw), ISO Europian latins (ISO-8859-1 to 11,
13/14/15/16), BS 4730, NF Z 62-010 and X0201 kana with ESC-(-I, SS0,
Locking shift. skf also supports some non-iso2022 compliant sets,
including Microsoft Shift-JIS code, KOI-8-R/U, GB2312 (HZ), big5,
VISCII(rfc1456, include VIQR), Unicode standard (UCS2/UTF-16, UTF7 and
UTF8), some of MS codesets (cp1250 etc.) and some other vendor specific
codes (KEIS83, JEF etc).
Supported output codesets include X-0208/X-0212/X-0213 JIS, X-0201 JIS,
ASCII, Microsoft Shift-JIS, EUC-jp/-kr/-cn, HZ, iso-2022-jp/kr, big5,
VISCII and Unicode.
skf also provide some basic decoding features for some common encodings
including MIME, Punycode and URI codepoint.
Unlike nkf, skf is designed to convert input code into some kind of
human-readable form under a local environment (i.e. codeset), and has
several extra conversion features like GNU recode. Such conversions
include Windows/Macintosh specific code swap and old-new jis glyph
change, html-format/TeX format conversion and variant unifications.
If file name(s) are given, skf read the files and output converted
stream to stdout. If no file names are given, input is taken from stdin
and output to stdout. OPTIONS are taken from Environment Variables
SKFENV, skfenv and command line, respectively in this order. Environ-
ment variables are not used when skf is running as priviledged user.
skf does not use LOCALE-related environment variables for conversion,
but output error messages are controlled by given LOCALES.
OPTIONS
skf-1.9 is written from scratch, and inherits no code from nkf. How-
ever, skf is intended to be a drop-in replacement for nkf(v1.4) and has
a similar commonly-used nkf option set.
skf 1.9x recognizes following options. Defaults are all off if not
explicitly specified.
buffering control
-b use buffered output. This is default.
-u use unbuffered output. Code detection feature is disabled when
this option is on.
Input/Output codeset options
--ic= input_code_set
specify input codeset is input_code_set. Possible candidates
are shown below.
--oc= output_code_set
specify output codeset is output_code_set. Possible candidates
are shown below. Default codeset in distribution package is euc-
jp, but depends on compile option. Default codeset is shown by
Supported codeset
skf recognizes following codesets as an input/output codeset. These
codeset names are case insensitive, and minus ('-') and underscore
('_') is ignored. Note that iso-2022 escape-based input codeset (reg-
istered to IANA) is recoginized automatically, even when non-iso2022
codeset is specified. o in in-column means named codeset can be speci-
fied as input and x means named codeset is not for input. output-column
is same except it is for output.
in out name description
o o iso8859-1 ascii + iso-8859-1 (latin-1)
o o iso8859-2 ascii + iso-8859-2 (latin-2)
o o iso8859-3 ascii + iso-8859-3 (latin-3)
o o iso8859-4 ascii + iso-8859-4 (latin-4)
o o iso8859-5 ascii + iso-8859-5 (Cyrillic)
o o iso8859-6 ascii + iso-8859-6 (Arabic)
o o iso8859-7 ascii + iso-8859-7 (Greek)
o o iso8859-8 ascii + iso-8859-8 (Hebrew)
o o iso8859-9 ascii + iso-8859-9 (latin-5)
o o iso8859-10 ascii + iso-8859-10 (latin-6)
o o iso8859-11 ascii + iso-8859-11 (Thai)
o o iso8859-13 ascii + iso-8859-13 (Baltic Rim)
o o iso8859-14 ascii + iso-8859-14 (Celtic)
o o iso8859-15 ascii + iso-8859-15 (Latin-9)
o o iso8859-16 ascii + iso-8859-16
o o koi-8r koi-8r (Russian)
o o cp1251 Cyrillic latin MS cp1251
o o jis iso-2022-jp (rfc1496 7bit JIS)
o o iso-2022-jp-x0213 iso-2022-jp-3 (JIS X-0213:2000).
a.k.a. jis-x0213
o o jis-x0213-strict iso-2022-jp-3-strict
o o iso-2022-jp-2004 iso-2022-jp-2004(JIS X-0213:2004)
a.k.a. jis-x0213-2004
o o oldjis iso-2022-jp-1978(JIS X-0208:1978)
o o cp50220 JIS-encoded Microsoft codepage 932.
o o euc-jp EUC-encoded JIS X-0208:1997
o o euc-x0213 EUC-encoded JIS X-0213:2000
o o euc-jis-2004 EUC-encoded JIS X-0213:2004
o o cp51932 EUC-encoded Microsoft codepage 932
o o euc-kr EUC-encoded KS X-1001 Korian
o o euc7-kr 7bit EUC-encoded KS X-1001 Korian
o o johab KS X-1001-johab Korian
o o euc-cn EUC-encoded GB2312 Chinese
o o euc7-cn 7bit EUC-encoded GB2312 Chinese
o o hz HZ-encoded GB2312 Chinese
o o euc-tw EUC-encoded CNS 11643 Chinese
o o gb12345 EUC-encoded GB12345 Chinese
o o gbk GB2312 Extension(cp936) Chinese
o o gb18030 GB18030 chinese
o o big5 BIG5 (with Eten extension + EURO)
o o cp950 BIG5 (Microsoft cp950 + EURO)
o o big5p BIG5 plus (with HKSCS)
o o sjis Shift-jis (Microsoft cp943)
o o shift_jis-x0213 Shift-jis-encoded JIS X-0213:2000
o o shift_jis-2004 Shift-jis-encoded JIS X-0213:2004
o x sjis-cellular Shift-jis-encoded JIS X-0208
with NTT Docomo, Vodafone phone glyph
o o cp932 Shift-jis-encoded MS cp932
o o cp50220 Jis-encoded MS cp50220
o o cp51932 EUC-jp-encoded MS cp51932
o o oldsjis Shift-jis (JIS X-0208:1978)
o o viscii VISCII (rfc1456) Vietnamise
o o viqr VISCII (rfc1456-VIQR) Vietnamise
o o keis Hitachi KEIS83/90
o x jef Fujitsu JEF (basic support only)
o x ibm930 IBM EBCDIC DBCS Japanese
o x ibm931 IBM EBCDIC DBCS Japanese w.latin
o x ibm931 IBM EBCDIC DBCS Korian
o x ibm935 IBM EBCDIC DBCS Simpl. Chinese
o x ibm937 IBM EBCDIC DBCS Trad. Chinese
o o ucs2 Unicode(TM) UCS-2/UTF-32LE
o o utf7 Unicode(TM) UTF-7
o o utf8 Unicode(TM) UTF-8
o x transparent Transparent mode (see below)
Codeset explanations
iso-8859-*
a.k.a. latin*. When specified as output, G0 = GL is ascii and G1
= GR is iso-8859-*. 8bit encoding is used.
iso-2022-jp, jis
Encoding is iso-2022-jp-2 (RFC1496). G0 = GL is JIS x0201 roman,
G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS x0212
Supplementary Kanji.
jis-x0213, iso-2022-jp-3, iso-2022-jp-2003
Encoding is iso-2022-jp-3. G0 = GL is JIS x0201 roman, For out-
put, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS
x0213 plane2 Kanji.
jis-x0213-strict
Encoding is subset of iso-2022-jp-3-strict (uses Plane 1 only).
For output, G0 = GL is JIS x0201 roman, G1 = GR is JIS x0201
kana, G2 is iso-8859-1 and G3 is not set. Output code as JIS
x0208 whenever possible. JIS X-0213 input is automatically rec-
ognized.
jis-x0213-2004, iso-2022-jp-2004
Encoding is iso-2022-jp-2003:2004. For output, G0 = GL is JIS
x0201 roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3
is JIS x0213 plane2 Kanji.
oldjis
Encoding is iso-2022-jp (JIS X-0208:1978). G0 = GL is JIS x0201
roman, G1 = GR is JIS x0201 kana, G2 is iso-8859-1 and G3 is JIS
x0212 Supplementary Kanji.
euc-jp, euc
Encoding is 8-bit EUC using JIS X-0208:1997 character set. G0 =
GL is ascii, G1 = GR is JIS x0208, G2 is JIS x0201 kana and G3
is JIS x0212 Supplementary Kanji.
euc-x0213, euc-jis-2003
Encoding is 8-bit EUC-based JIS X0213:2000. G0 = GL is ascii,
G1 = GR is X0213 plane 1, G2 is iso-8859-1 and G3 is JIS x0213
plane2 Kanji.
euc-jis-2004
Encoding is 8-bit EUC-based JIS X0213:2004. G0 = GL is ascii,
G1 = GR is X0213:2004 plane 1, G2 is iso-8859-1 and G3 is JIS
x0213 plane2 Kanji.
euc-kr
Encoding is 8-bit EUC using KS X-1001 Wansung character set. G0
= GR is KS X1003, G1 = GR is KS X1001, G2 and G3 is not set.
euc7-kr iso-2022-kr
Encoding is iso-2022-kr (rfc1557): 7-bit EUC using KS X-1001
Wansung character set. G0 = GR is KS X1003, G1 is KS X1001, G2
and G3 is not set.
euc-cn
Encoding is 8-bit EUC using GB 2312 simplified chinese character
set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3 is not set.
euc7-cn
Encoding is 7-bit EUC using GB 2312 simplified chinese character
set. G0 = GR is ASCII, G1 is GB2312, G2 and G3 is not set.
hz
Encoding is HZ encoded (rfc1842) GB 2312 simplified chinese
character set. G0 = GR is ASCII, G1 = GR is GB2312, G2 and G3
is not set.
euc-tw
Encoding is EUC encoded CNS11643 Plane1/2 traditional chinese
character set. Subset of iso-2022-cn. G0 = GR is ASCII, G1 = GR
is CNS11643 plane 1, G2 is CNS11643 plane 2 and G3 is not set.
gb12345
Encoding is 8-bit EUC using GB 12345 (GBF) traditional chinese
character set. G0 = GR is ASCII, G1 = GR is GB12345, G2 and G3
is not set.
gbk, cp936
Encoding is GBK simplified chinese character set. G0 = GR is
ASCII and G1 = GR is GBK. G2 and G3 is not set.
gb18030
Encoding is GB18030 (ibm-1392, Windows cp54936) chinese charac-
ter set. G0 = GR is ASCII and G1 = GR is GB18030. G2 and G3 is
not set.
big5
Encoding is Big5 traditional chinese character set with ETen
extension. Include Euro mapping. Uses ASCII as latin part.
big5-cp950
Encoding is cp950-Big5 traditional chinese character set. Uses
ASCII as latin part.
big5p
Encoding is cp950-Big5 traditional chinese character set with
HKSCS extension. Uses ASCII as latin part.
VISCII (experimental)
Vietnamise VISCII (rfc1456) character set. Not TCVN-5712.
VIQR (experimental)
Vietnamise VISCII character set with VIQR encoding(rfc1456).
sjis
Encoding is Shift-encoded JIS X-0208:1997 character set. Note
that this is not cp932. Uses JIS x-0201 latin as latin(GL) part.
sjis-x0213, shift_jis-2003
Encoding is Microsoft JIS using JIS X0213:2000 character set.
sjis-x0213-2004, shift_jis-2004
Encoding is Microsoft JIS using JIS X0213:2004 character set.
10 newly defined character added, but Unicode mapping is same as
JIS X0213:2000. Uses JIS x-0201 latin as latin(GL) part.
sjis-cellular (experimental)
Encoding is Shift-encoded JIS X-0208:1997 character set with NTT
Docomo/Vodafone cellular phone glyph mapping.
cp932
Encoding is Microsoft SJIS cp932 with NEC/IBM gaiji area, based
on Windows XP mapping. Uses JIS x-0201 latin as latin(GL) part.
cp51932
Encoding is Microsoft EUC-based cp51932 with NEC/IBM gaiji area,
based on Windows XP mapping. Uses JIS x-0201 latin as EUC G2
part.
cp50220
Encoding is Microsoft JIS-based cp50220 with NEC/IBM gaiji area,
based on Windows XP mapping. For input, skf accepts cp50220,
50221 and 50222. Note that this codeset is NOT compatible with
iso-2022.
oldsjis
Encoding is Microsoft SJIS (JIS X-0208:1978 a.k.a. old JIS).
Uses JIS x-0201 latin as latin(GL) part.
johab
Encoding is KS X1001(Johab) character set. Uses KS X1003 latin
as latin(GL) part.
uhc
Encoding is UHC (cp949) character set. Uses KS X1003 latin as
latin(GL) part.
ucs2, utf16
Encoding is Unicode UTF-16 (v4.1). Input/Output default byte-
endian is little, and input byte order mark is recognized. Out-
put includes endian mark by default unless --disable-endian-mark
is specified. Output range is within UTF-32 with surrogate pair
unless --limit-to-ucs2 is specified. Note that ucs2 is not sup-
ported within perl/ruby extension in both in and output, because
of data structure limitation. Specify to ucs2 will generate
error.
utf8
Encoding is UTF-8 encoded Unicode (v4.1). Output doesn't include
byte order mark unless --enable-endian-mark is specified. Out-
put range is within UTF-32 unless --limit-to-ucs2 is specified.
utf7
Encoding is UTF-7 encoded Unicode (v4.1). Output range is lim-
ited to UTF-16, and value above U+10000 is regarded as unde-
fined.
keis (experimental)
Encoding is Hitachi KEIS83/90. Output range is limited to EBCDIK
and JIS X-0208 area.
jef (experimental)
Encoding is Fujitsu JEF. Input only. Only basic part is sup-
ported.
ibm930 (experimental)
Encoding is IBM DBCS Japanese with EBCDIC Kana
ibm931 (experimental)
Encoding is IBM DBCS Japanese with EBCDIC latin (ibm037)
ibm933 (experimental)
Encoding is IBM DBCS Korian with EBCDIC Wansung character set
ibm935 (experimental)
Encoding is IBM DBCS Simplified Chinese with EBCDIC Chinese
ibm937 (experimental)
Encoding is IBM DBCS Traditional Chinese with EBCDIC Chinese
koi8r
Russian KOI-8R code.
cp1250
Central Europian latin Microsoft cp1250 code
cp1251
Eastern Europian cyrillic Microsoft cp1251 code.
transparent
Transparent mode. Various code control features, include folding
and line end code conversion, is ignored.
Shortcuts
-n -j same as --oc=jis
-s -x same as --oc=sjis
-a -e same as --oc=euc-jp
-q same as --oc=ucs2
-z same as --oc=sjis
-y same as --oc=utf7
-k same as --oc=keis
-A, -E same as --ic=euc-jp. Assume input code set is EUC-JP.
-N same as --ic=jis. Assume input code set is iso-2022-jp.
-S, -X same as --ic=sjis. Assume input code set is Microsoft JIS.
-Q same as --ic=ucs2.
-Y same as --ic=utf7.
-Z same as --ic=utf8.
-K same as --ic=keis.
ISO-2022 Specific controls
Replace G0-3 after setting up according to specified input codeset by
assigned character set with this option.
--set-g0=`charset name'
Predefine specified code set to plane 0 (G0). Also set to GL at
initial state.
--set-g1=`charset name'
Predefine specified code set to right plane (G1). Also set to GR
at initial state.
--set-g2=`charset name'
Predefine specified code set to right plane (G2).
--set-g3=`charset name'
Predefine specified code set to right plane (G3).
Supported `char_set' is as follows. 'o' means the codeset can be spaci-
fied to set to the plane. 'x' means you can't.
g0 g1 g2 g3 codeset name description
o o o o ascii ANSI X3.4 ASCII
o o o o x0201 JIS X 0201 (latin part)
x o o o iso8859-1 ISO 8859-1 latin
x o o o iso8859-2 ISO 8859-2 latin
x o o o iso8859-3 ISO 8859-3 latin
x o o o iso8859-4 ISO 8859-4 latin
x o o o iso8859-5 ISO 8859-5 Cyrillic
x o o o iso8859-6 ISO 8859-6 Arabic
x o o o iso8859-7 ISO 8859-7 Greek-latin
x o o o iso8859-8 ISO 8859-8 Hebrew
x o o o iso8859-9 ISO 8859-9 latin
x o o o iso8859-10 ISO 8859-10 latin
x o o o iso8859-11 ISO 8859-11 Thai
x o o o iso8859-13 ISO 8859-13 latin
x o o o iso8859-14 ISO 8859-14 latin
x o o o iso8859-15 ISO 8859-15 latin
x o o o iso8859-16 ISO 8859-16 latin
x o o o tcvn5712 TCVN 5712 (Vietnamese)
x o o o ecma113 ECMA 113 Cyrillic
o o o o x0212 JIS X-0212:1990
o o o o x0208 JIS X-0208:1997
o o o o x0213 JIS X-0213 Plane 1:2000
o o o o x0213-2 JIS X-0213 Plane 2:2000
o o o o x0213n JIS X-0213 Plane 1:2004
o o o o gb2312 Simplified Chinese GB2312
o o o o gb1988 Chinese GB1988(latin)
o o o o gb12345 Traditional Chinese GB12345
o o o o ksx1003 Korian KS X 1003(latin)
o o o o ksx1001 Korian KS X 1001
x o o o koi8-r Cyrillic KOI-8R
x o o o koi8-u Ukrainean Cyrillic KOI-8U
o o o o cns11643 Traditional Chinese CNS11643-1
x o o o viscii-r RFC1496 VISCII (right plane)
o o o o viscii-l RFC1496 VISCII (left plane)
o o o o vni Vietnamese VNI
x o o o cp437 Microsoft cp437 (US latin)
x o o o cp737 Microsoft cp737
x o o o cp775 Microsoft cp775
x o o o cp850 Microsoft cp850
x o o o cp852 Microsoft cp852
x o o o cp855 Microsoft cp855
x o o o cp857 Microsoft cp857
x o o o cp860 Microsoft cp860
x o o o cp861 Microsoft cp861
x o o o cp862 Microsoft cp862
x o o o cp863 Microsoft cp863
x o o o cp864 Microsoft cp864
x o o o cp865 Microsoft cp865
x o o o cp866 Microsoft cp866
x o o o cp869 Microsoft cp869
x o o o cp874 Microsoft cp874
x o o o cp932 Microsoft cp932 (Japanese)
x o o o cp1250 Microsoft cp1250(Central Europe)
x o o o cp1251 Microsoft cp1251 (Cyrillic)
x o o o cp1252 Microsoft cp1252 (Latin-1)
x o o o cp1253 Microsoft cp1253 (Greek)
x o o o cp1254 Microsoft cp1254 (Turkish)
x o o o cp1255 Microsoft cp1255
x o o o cp1258 Microsoft cp1258
--euc-protect-g1
In EUC input mode, suppress sequences to set a charset to G1.
Such sequences are discarded.
--add-annon
Add announcer for JIS X-0208:1997 to X-0208 designate sequence.
This option works only with iso-2022-based output.
--disable-jis90
Disable 2 added characters of JIS X-0208:1997. If this option is
specified, these two characters are replaced by Kanji variants.
This option is off by default.
--input-detect-jis78
Distinguish JIS X-0208:1978 codeset and JIS X-0208:1997 codeset.
By default, these two charset is regarded as X-0208:1997. This
option is valid only when input encoding is JIS (ISO-2022).
JIS X-0212(Supplement Kanji code) Support
--x0212-enable
skf by default does not output JIS X-0212 code. This option
enables use of JIS X-0212 part. Output code set may be neither
Microsoft code nor KEIS. For Unicode variant encodings, this
option is on by default. This option is supported for backward
compatibility. May not be supported in future versions.
Unicode coding specific control options
--use-compat
When output is one of translation format of Unicode standard,
enable characters in compatibility plane (0xfxxx). If disabled,
these characters is converted to variants or undefined.
--use-ms-compat
When output is Unicode, make translation to be Microsoft windows
compatible). This only affect some symbols in JIS-Kanji, and
adding --use-compat option is recommended.
--use-cde-compat
When output is Unicode, make translation CDE standard codeset
compatible.
--little-endian
When output is Unicode, use little endian byte-order. This is
default.
--big-endian
When output is Unicode, use big endian byte-order.
--disable-endian-mark
When output is UTF-16, do not use byte order marking. To make
UTF-16N, use this option with --little-endian. This is off by
default.
--enable-endian-mark
When output is UTF-8, output byte order marking. This is off by
default.
--input-little-endian
When input is Unicode, assume input is little endian byte-
ordered. This is default, but skf respects byte-order mark.
--input-big-endian
When input is Unicode, assume input is big endian byte-ordered.
Note that skf respects byte-order mark.
--endian-protect
Do not use endian mark in the input stream. Endian mark is just
discarded. This is off by default.
--use-replace-char
skf by default converts undefined (except 0x2xxx part) charac-
ters into "geta (U+3013)" code in Japanese codeset. This option
specifies skf to use replacement char (U-fffc) instead.
--limit-to-ucs2
Do not use > 0x10000 area code in Unicode (i.e. limits code to
ucs2 area). This is off by default.
--disable-cjk-extension
Treat CJK extension A/B area as undefined. This is off (i.e.
these areas are enabled) by default.
--old-hangul-location
Treat input U-3400 area as hangul (Unicode 1.0 compatibility).
This is off by default.
Codeset/Vendor Specific codeset handling flags
skf by default assumes machine specific parts of kanji code are
Microsoft Windows compatible. Here are some options that control this
behavior. Option in this category is valid when output codeset is
Japanese codeset, except --disable-charts.
--use-apple-gaiji
Assume machine specific part in input file is Macintosh (System
7,8,9 or OS X) compatible.
--disable-ibm-gaiji
Disable machine specific part in input file.
--disable-chart
Do not use Moji-keisen characters. This is for old Macintosh
system (System 6.x or older) compatibility.
Miscellanious codeset related options
--old-nec-compat
Enable old NEC kanji sequence (ESC-K,H). Needs compile option
--enable-oldnec at configuration.
--no-utf7
Assume input code set is *NOT* UTF-7 encoded Unicode. This
option disables input utf7 testing.
--no-kana
Assume input code set does *NOT* include JIS x0201 kana. Also
suppresses Unicode half width variants.
OUTPUT Conversions options
skf has various features to fit output files to local environment, and
many of these are controlled by extended control switch described in
this section.
--use-g0-ascii
set G0(=GL) for output encoding to ASCII, ignoring codeset des-
ignation.
X-0201 Kana/latin conversions
skf by default converts X-0201 kanas to X-0208 kanas. To output X-0201
kana as it is, use one of following options. When output is designated
to EUC or SJIS, these three options enable X-0201 kana output by ways
provided by each code set. When Unicode output is specified, (equiv.)
kana part output is controlled by --use-compat, not following switches.
Valid only when output codeset is non-Unicode Japanese codeset.
--kana-jis7
use SI/SO locking shift sequence to designate X-0201 kana. This
switch is valid for jis, jis-x0213 and cp50220 (i.e. cp50221)
encoding. For other codeset, this option is ignored.
--kana-jis8
output X-0201 kana using 8-bit code right plane. This switch is
valid for jis and jis-x0213 encoding. For other codeset, this
option is ignored.
--kana-esci --kana-call
use ESC-(-I to designate X-0201 kana. This switch is valid for
jis, jis-x0213 and cp50220 (i.e. cp50222) encoding. For other
codeset, this option is ignored.
--kana-enable
use X-0201 kana when EUC (with G2) or SJIS output code is used.
When JIS output, it is same as --kana-call.
URI/TeX conversion feature options
With Unicode(tm) family output codings, skf output non-ascii latin
character part as it is, but with other output codings, skf converts
these characters using following rules:
(1) If code is defined in a specified output codeset, it is outputted
with this codeset.
(2) If one of following html convert modes enabled (i.e. --con-
vert-html --convert-sgml) and code is defined in html/sgml codeset, it
is converted to entity-reference or codepoint reference.
(3) If tex convert mode enabled and code is defined in tex codeset, it
is converted to tex format.
(4) If code is a kind of combined ligatures, it is shown by a set of
characters.
(5) A kind of replacement character is shown, with warning.
--convert-html --convert-sgml
Enable html convert mode. This mode is cleared by --reset. These
two options are synonyms, and are treated as same option.
--convert-html-decimal
Enable html code-point decimal convert mode. This mode is
cleared by --reset.
--convert-html-hexadecimal
Enable html code-point hexadecimal convert mode. This mode is
cleared by --reset.
--convert-tex
Enable TeX convert mode. This mode is cleared by --reset.
--use-iso8859-1
Enable iso-8859-1 output. Iso-8859-1 is invoked to G1 and set to
GR plane.
--use-iso8859-1-right
Enable 7-bit iso-8859-1 output. Iso-8859-1 is invoked to G1
plane.
Encoding control options
--decode=`encoding scheme'
Specify encoding scheme for input stream. Supported encoding
scheme is `hex', 'mime', 'mime_q', 'mime_b', 'uri_encode',
'puny', 'hex_perc_encode', CAP hex-code, mime, mime Q-encoding,
mime B-encoding, uri character reference, ACE punycode, uri per-
cent notation, base64, Q-encoding, rfc2231 and rot13/47 respec-
tively. Only one decode option is valid, and if more than one
option is specified, last one is used. When mime decoding is
specified, base text is assumed to be EUC encoding unless speci-
fied otherwise. Except rot, which assumes input stream is
Shift_JIS, EUC or iso-2022-jp, these encodings assumes input
stream is ascii (as defined in RFC2045). Some encodings may co-
exist with encoding, but this is not guaranteed. Especially, if
input is UTF-16/UCS2 code, these encoding is ignored in skf.
End of line control options
--lineend-thru
Output end of line code as it is. Also output ^Z code as it is.
This is default.
--lineend-cr --lineend-mac
Use CR as end of line code. Also delete ^Z code from input
stream.
--lineend-lf --lineend-unix
Use LF as end of line code. Also delete ^Z code from input
stream.
--lineend-crlf --lineend-windows
Use CR+LF as end of line code. Also delete ^Z code from input
stream. This option doesn't preserve original order of cr and
lf.
--input-cr
Assume input stream uses CR as end of line code.
--input-lf
Assume input stream uses LF as end of line code.
--input-crlf
Assume input stream uses CR+LF as end of line code.
-F[line_length[-kinsoku]]
-f[line_length[-kinsoku]] -f[line_length[+kinsoku]]
Wrap input lines by line_length columns. f option deletes
CR/LF's in input, and F option doesn't delete them. For Japanese
convension, both gyoutou-kinsoku(by burasage-gumi) and
gyoumatsu-kinsoku(by oidasi-gumi) is supported. The burasage-
length is controlled by kinsoku option. Default value for
line_length is 66, and must be < 1000. Default value for kinsoku
is 5, and must be <= 10. In 'f' option, skf autodetects para-
graph and retains some CR/LF. 2nd 'f' option format (with '+')
disables this behaviour. In nkf compatible mode, some fold
behavior changes as follows.
(1) Default line_length is set to 60, and kinsoku value is 10.
(2) alpha numeric characters become gyoutou-kinsoku characters.
File control options
--filewise-detect --force-reset
Reset and re-detect input code set at the start of each file.
--linewise-detect
Reset and re-detect input code set at the start of each line.
This option needs -DKUNIMOTO at compile time.
Compatibility options
--nkf-compat
interpret following options as nkf compatible manners.
--skf-compat
interpret following options as skf-native manners.
Misc. Control options
--disable-space-convert
skf by default, converts an ideographic space into two ascii
spaces. This option disables this behavior.
--html-sanitize
Convert several characters in HTML document to entity reference
expression. Specifically, "!#$&%()/<>:;?' is escaped by entity
expression.
--filewise-detect --force-reset
If multiple input files are given, detect input code for each
file.
--linewise-detect
Detect input code line-wise. Note this option weakens code
detect feature. Need compile option (at configure) --enable-
kunimoto.
--reset
Reset all flags specified by extended controls and given input
code.
--inquiry --guess
skf detects code and output detect result to stdout. No filter-
ing output is performed. If multiple input file is given,
--show-filename is automatically enabled.
--hard-inquiry
Similar as inquiry, but reports both code and line end charac-
ter.
--suppress-filename
When inquiry(--inquiry) is on, this option disables file name
output. This option overrides --show-filename.
--show-filename
When inquiry(--inquiry) is on, this option adds each file name
to output.
--invis-strip
Delete all escape sequences not belonging to ISO-2022 code
extension. This is intended to replace invisstrip command bun-
dled in inews package.
-I Warn if input has unassigned code points.
-v print version and exit.
-h --help
print brief help.
--show-supported-codeset
Display supported codeset (input) and exit.
--show-supported-charset
Display supported character set (output) and exit.
-%[debug_level]
Enable skf debugging. Debug level is one digit. 0 is the least
verbose, and with -%9 you'll get whole traces within skf. This
option needs compile option --enable-debug.
FILES
/usr/(local/)share/skf/lib/ (Unices)
/Program Files/skf/share/lib (MS Windows)
These directories are where external codeset conversion tables
go. The location that current skf assumes are shown by -h
option.
AUTHOR
skf is written by Seiji Kaneko (skaneko@a2.mbn.or.jp) based on idea
from nkf written by Itaru Ichikawa (ichikawa@flab.fujitsu.co.jp) X-0213
code table is derived from work of earthian@tama.or.jp. Some codeset
mapping is derived from various sources. Detailed origin is shown in
copyright document included in this distribution.
ACKNOWLEDGEMENT
skf is inspired by works or requests by shinoda@cs.titech,
kato@cs.titech, uematsu@cs.titech, void@global ohta@ricoh, Hinata(HKE)
Ashizawa(CRL) Kunimoto(SDL) Oohara(Univ of Kyoto), Jokagi(elf2000) and
naruse (at sourceforge.jp). Thanks.
BUGS AND LIMITATIONS
1. skf can handle mixed coding with some limitations. However, code
detection tends to fail for mixed code, and giving explicit input code
set is strongly encouraged, if codeset is known beforehand.
In case of need, --linewise-detect option may help, but more likely to
fail to detect codes.
2. When using UCS2, UTF-16, UTF-8 and UTF-7, skf tries to detect input
code, but giving explicit code set is encouraged. skf doesn't support
UCS4, but does support UTF-16/UTF-32 (i.e. surrogate pairs). skf just
pass Composite characters to output. No further normalization process
is performed.
3. skf implements ISO-2022 with following exceptions
i) GL 0x20 is always space. Even when 96-character codeset is invoked
to GL.
ii) Sequences for setting codes to C1 and C2 is always ignored.
iii) if unknown sequence is given to G0, G0 is set to ascii, and lock-
ing/single shift is cleared. Unknown sequece call to G1-G3 is just
ignored.
iv) Sequences for 96 character multibyte coding is ignored (Currently,
no codeset is registered).
v) Calling UTF-8, UTF-16 coding system from iso-2022 is supported, and
returns to previous coding system by standard return.
Callings and returns to/from other coding schemes are ignored.
vi) Because of cellular phone glyph support, several private (not reg-
istered) codeset is defined in skf, and can be called by appropriate
sequence.
4. Since skf by default tests input stream to detect utf7 coding, skf
sometimes misdetects pure ascii text as utf7. If this occurs, use
--no-utf7 option.
5. error output coding is controlled by LOCALE environment variables in
UN*X system. Since skf don't care about stdout and stderr is redirect-
ing into same stream, this case should be handled by user.
6. skf-1.9x converts KEIS/JIS X-0213 code using CJK-extension B and CJK
compatibility area. For this reason, X-0213 and KEIS convert result
varies depending on --use-compat and --limit-to-ucs2 switches.
7. JIS X-0207(1979) is not supported. JIS X-0211(1987) is designed to
be supported (i.e. common terminal control sequence will be transpar-
ently passed to output).
8. Even if unbuffer option(-u) is specified, some code-translation
related bufferings are still performed (in MIME, kana, VIQR etc.).
9. skf-1.9x recognizes and handles languages in iso639-1(alpha 2).
iso639-2 is not supported as a valid language set.
10. Ucs2 is not supported within perl/ruby extension in both in and
output, because of data structure limitation. Specify to ucs2 will gen-
erate error. This is a limitation of language itself, rather than a
limitation of skf.
Notes
1. Extended options are changed extensively since skf-1.9. Some archaic
options (eg. -B, -@ and -r) have been deleted from this version.
2. skf is derived project from nkf, but doesn't contain nkf codes.
Copyright notice is retained by honor.
3. From version 1.9, default Japanese character set assumed by skf has
changed to JIS X-0208:1990 with Microsoft Japanese Windows gaiji (i.e.
CP932).
4. Code autodetection is not perfect by design. If it has failed to
detect input code properly, please give input code information explic-
itly.
5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are converted
using JIS X-0124 and other convention. During this conversion, its
byte length is not preserved.
6. skf is intended to pass ANSI compatible terminal control codes
transparently, but this is not guaranteed.
7. nkf's -i and -o options still works at 1.94, but is obsolete and
valid only when iso-2022-jp and without considering output codeset
specifications. Using these options are strongly discouraged.
8. For unconverted character, skf uses geta and undefined character as
--use-replace-char option. If output codeset doesn't contain geta
code, skf prefers 'black square character', then uses '.' respectively.
9. There are some undocumented options. These options should be consid-
ered as highly experimental.
10. In lineend_thru mode and using folding, skf remembers order of cr
and lf appears in stream, and use that order. For this design, if skf
needs to output line-end character before any line-end character
appears in input stream, input order may not be preserved.
Notice
Unicode(TM) is a trademark of Unicode, Inc. Microsoft and Windows are
registered trademarks of Microsoft corporation. Macintosh is a regis-
tered trademark of Apple Computer Inc. Vodafone is a trademark of Voda-
fone K.K. Other names and terms may be trademarks or registered trade-
marks of their respective owner. Trademark symbol (TM) may be omitted
in this manual page.
30/JUL/2006 SKF(1)
| 
|