.de XA .RS .PP \\$1 .RE .PP .. .TH "enca" "1" "Dec 2005" "enca 1.9" " " .SH "NAME" .PP enca \-\- detect and convert encoding of text files . . .SH "SYNOPSIS" .PP \fBenca\fR [\fB\-L\fR \fILANGUAGE\fR] [\fIOPTION\fR]... [\fIFILE\fR]... .br \fBenconv\fR [\fB\-L\fR \fILANGUAGE\fR] [\fIOPTION\fR]... [\fIFILE\fR]... . .SH "INTRODUCTION AND EXAMPLES" .PP If you are lucky enough, the only two things you will ever need to know are: command .XA "enca \fIFILE\fR" will tell you which encoding file \fIFILE\fR uses (without changing it), and .XA "enconv \fIFILE\fR" will convert file \fIFILE\fR to your locale native encoding. To convert the file to some other encoding use the \fB-x\fR option (see \fB\-x\fR entry in section \fBOPTIONS\fR and sections \fBCONVERSION\fR and \fBENCODINGS\fR for details). .PP Both work with multiple files and standard input (output) too. E.g. .XA "enca \-x latin2 2,1) /4321@Byte order reversed in quadruples (1,2,3,4 -> 4,3,2,1) N.A.@Both little and big endian chunks, concatenated /qp@Quoted-printable encoded .TE .PP Note some surfaces have N.A. in place of identifier\-\-they cannot be specified on command line, they can only be reported by Enca. This is intentional because they only inform you why the file cannot be considered surface-consistent instead of representing a real surface. .PP Each charset has its natural surface (called `implied' in recode) which is not reported, e.g., for IBM 852 charset it's `CRLF line terminators'. For UCS encodings, big endian is considered as natural surface; unusual byte orders are constructed from 21 and 4321 permutations: 2143 is reported simply as 21, while 3412 is reported as combination of 4321 and 21. .PP Doubly-encoded UTF-8 is neither charset nor surface, it's just reported. .PP . .SS About charsets, encodings and surfaces .PP Charset is a set of character entities while encoding is its representation in the terms of bytes and bits. In Enca, the word \fIencoding\fR means the same as `representation of text', i.e. the relation between sequence of character entities constituting the text and sequence of bytes (bits) constituting the file. .PP So, encoding is both character set and so-called surface (line terminators, byte order, combining, Base64 transformation, etc.). Nevertheless, it proves convenient to work with some {charset,surface} pairs as with genuine charsets. So, as in \fIrecode\fR(1), all UCS- and UTF- encodings of Universal character set are called charsets. Please see recode documentation for more details of this issue. .PP The only good thing about surfaces is: when you don't start playing with them, neither Enca won't start and it will try to behave as much as possible as a surface-unaware program, even when talking to recode. .PP . . .SH "LANGUAGES" .PP Enca needs to know the language of input files to work reliably, at least in case of regular 8bit encoding. Multibyte encodings should be recognised for any Latin, Cyrillic or Greek language. .PP You can (or have to) use \fB\-L\fR option to tell Enca the language. Since people most often work with files in the same language for which they have configured locales, Enca tries tries to guess the language by examining value of \fBLC_CTYPE\fR and other locale categories (please see \fIlocale\fR(7)) and using it for the language when you don't specify any. Of course, it may be completely wrong and will give you nonsense answers and damage your files, so please don't forget to use the \fB\-L\fR option. You can also use \fBENCAOPT\fR environment variable to set a default language (see section \fBENVIRONMENT\fR). .PP Following languages are supported by Enca (each language is listed together with supported 8bit encodings). .PP .TS tab (@); l l. Belarussian@CP1251 IBM866 ISO\-8859\-5 KOI8\-UNI maccyr IBM855 Bulgarian @CP1251 ISO\-8859\-5 IBM855 maccyr ECMA\-113 Czech @ISO\-8859\-2 CP1250 IBM852 KEYBCS2 macce KOI\-8_CS_2 CORK Estonian @ISO\-8859\-4 CP1257 IBM775 ISO\-8859\-13 macce baltic Croatian @CP1250 ISO\-8859\-2 IBM852 macce CORK Hungarian @ISO\-8859\-2 CP1250 IBM852 macce CORK Lithuanian @CP1257 ISO\-8859\-4 IBM775 ISO\-8859\-13 macce baltic Latvian @CP1257 ISO\-8859\-4 IBM775 ISO\-8859\-13 macce baltic Polish @ISO\-8859\-2 CP1250 IBM852 macce ISO\-8859\-13 ISO\-8859\-16 baltic CORK Russian @KOI8\-R CP1251 ISO\-8859\-5 IBM866 maccyr Slovak @CP1250 ISO\-8859\-2 IBM852 KEYBCS2 macce KOI\-8_CS_2 CORK Slovene @ISO\-8859\-2 CP1250 IBM852 macce CORK Ukrainian @CP1251 IBM855 ISO\-8859\-5 CP1125 KOI8\-U maccyr Chinese @GBK BIG5 HZ none @ .TE .PP The special language \fBnone\fR can be shortened to \fB__\fR, it contains no 8bit encodings, so only multibyte encodings are detected. .PP . . .SH "FEATURES" .PP Several Enca's features depend on what is available on your system and how it was compiled. You can get their list with .XA "enca \-\-version" Plus sign before a feature name means it's available, minus sign means this build lacks the particular feature. .PP \fBlibrecode\-interface\fR. Enca has interface to GNU recode library charset conversion functions. .sp \fBiconv\-interface\fR. Enca has interface to UNIX98 iconv charset conversion functions. .sp \fBexternal\-convertor\fR. Enca can use external conversion programs (if you have some suitable installed). .sp \fBlanguage\-detection\fR. Enca tries to guess language (\fB\-L\fR) from locales. You don't need the \fB\-\-language\fR option, at least in principle. .sp \fBlocale\-alias\fR. Enca is able to decrypt locale aliases used for language names. .sp \fBtarget\-charset\-auto\fR. Enca tries to detect your preferred charset from locales. Option \fB\-\-auto\-convert\fR and calling Enca as \fBenconv\fR works, at least in principle. .sp \fBENCAOPT\fR. Enca is able to correctly parse this environment variable before command line parameters. Simple stuff like \fBENCAOPT="\-L uk"\fR will work even without this feature. .PP . . .SH "ENVIRONMENT" .PP The variable \fBENCAOPT\fR can hold set of default Enca options. Its content is interpreted before command line arguments. Unfortunately, this doesn't work everywhere (must have +ENCAOPT feature). .PP \fBLC_CTYPE\fR, \fBLC_COLLATE\fR, \fBLC_MESSAGES\fR (possibly inherited from \fBLC_ALL\fR or \fBLANG\fR) is used for guessing your language (must have +language-detection feature). .PP The variable \fBDEFAULT_CHARSET\fR can be used by \fBenconv\fR as the default target charset. .PP . . .SH "DIAGNOSTICS" .PP Enca returns exit code\~0 when all input files were successfully proceeded (i.e. all encodings were detected and all files were converted to required encoding, if conversion was asked for). Exit code\~1 is returned when Enca wasn't able to either guess encoding or perform conversion on any input file becuase it's not clever enough. Exit code\~2 is returned in case of serious (e.g. I/O) troubles. .PP . . .SH "SECURITY" .PP It should be possible to let Enca work unattended, it's its goal. However: .PP There's no warranty the detection works 100%. Don't bet on it, you can easily lose valuable data. .PP Don't use enca (the program), link to libenca instead if you want anything resembling security. You have to perform the eventual conversion yourself then. .PP Don't use external convertors. Ideally, disable them compile-time. .PP Be aware of \fBENCAOPT\fR and all the built-in automagic guessing various things from environment, namely locales. .PP . . .SH "SEE ALSO" .PP \fIautoconvert\fR(1), \fIcstocs\fR(1), \fIfile\fR(1), \fIiconv\fR(1), \fIiconv\fR(3), \fInl_langinfo\fR(3), \fImap\fR(1), \fIpiconv\fR(1), \fIrecode\fR(1), \fIlocale\fR(5), \fIlocale\fR(7), \fIltt\fR(1), \fIumap\fR(1), \fIunicode\fR(7), \fIutf-8\fR(7), \fIxcode\fR(1) .PP . . .SH "KNOWN BUGS" .PP It has too many \fIunknown\fR bugs. .PP The idea of using \fBLC_*\fR value for language is certainly braindead. However I like it. .PP It can't backup files before mangling them. .PP In certain situations, it may behave incorrectly on >31bit file systems and/or over NFS (both untested but shouldn't cause problems in practice). .PP Built\-in convertor does not convert character `ch' from \fIKOI8-CS2\fR, and possibly some other characters you've probably never heard about anyway. .PP EOL type recognition works poorly on Quoted-printable encoded files. This should be fixed someday. .PP There are no command line options to tune libenca parameters. This is intentional (Enca should DWIM) but sometimes this is a nuisance. .PP The manual page is too long, especially this section. This doesn't matter since nobody does read it. .PP Send bugs / questions / wishes / money to , if you believe I will fix / answer / fulfil / don't defraud them. Please include `enca' in subject. . . .SH "TRIVIA" .PP Enca is Extremely Naive Charset Analyser. Nevertheless, the `enc' originally comes from `encoding' so the leading\~`e' should be read as in `encoding' not as in `extreme'. . . .SH "AUTHORS" .PP David Necas (Yeti) .sp Unicode data has been generated from various (free) on\-line resources or using GNU recode. Statistical data has been generated from various texts on the Net, I hope character counting doesn't break anyone's copyright. . . .SH "ACKNOWLEDGEMENTS" .PP Please see the file THANKS in distribution. . . .SH "COPYRIGHT" .PP Copyright (C) 2000-2003 David Necas (Yeti). .sp Enca is free software; you can redistribute it and/or modify it under the terms of version 2 of the GNU General Public License as published by the Free Software Foundation. .sp Enca is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. .sp You should have received a copy of the GNU General Public License along with Enca; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. .