#============================================================================ # Enca v1.9 (2005-12-18) guess and convert encoding of text files # Copyright (C) 2000-2003 David Necas (Yeti) #============================================================================ TO THE NEXT RELEASE: (this list must be empty at the time of release) IN FUTURE: (should be done, but maybe not right now) * LCUC check for cyrillic charsets. * Backups -- like cp, mv, etc. This will be hard to get right with all the silly convertors. * More tests * Structured documentation (the manual page is ugly) - keep a reasonably brief manual page - put all the boring doc stuff somewhere else, there are possibilities: info: searchable, has links, partly portable, has console viewers HTML: poorly searchable, has links, most portable, has console viewers TeX (ps): not searchable, no links, portable, most pleasant to read, no console viewers => use SGML (or info itself?) and generate the others MAYBE SOMEDAY: (when I will have mood for it, items are freely moved here and removed again) * Detect all-caps texts OK. After several experiments it seems we have to - use pair occurences, at least, with specificaly computed difference-maximising weights - guess in two steps - first with uncapitalization and pair weights, and check whether the sample looks like natural text (garbageness test, but better) - if the first approach fails, do it as we do it now * design better levels of verbosity/warnings (or: remove the --verbose option, keep important messages and remove all others?) 0: only messages followed by exit(EXIT_FAILURE) (or abort()) are printed plus `cannot convert...' 1: all nonfatal errors/warnings 2: what convertors are tried, what language gets detected (do not duplicate --details) >2: debug * _real_ paranoiac behaviour assuring that nothing gets lost and that conversion output is either correctly converted text or untouched original (requires major redesign of all the conversion stuff) NEVER: (you can do anything GNU GPL v2 allows, but I'll restrain) * features that nobody needs (mm, well, ... ok, let it be) * duplicate other tools functionality more than necessary, use them instead * dependency on anything that is not ISO C and/or POSIX (moreover do not use braindead features of both); important functionallity must be present everywhere nevertheless, enca can be smaller, faster or cleverer on some (GNU) systems * localization; please correct my english instead ;-> * convertor calling generalization (would require inlcuding the whole wordexp thing in enca, and: launching external convertor is Bad Thing(TM) anyway) * data in run-time files (needs parser (could live with) and disallows hooks (can't live without)) * loadable module support (it's not very portable) ------------- KNOWN ISO C CONFLICTS: (perhaps to be solved someday) All constants and typedefs. They start with ENCA_ and Enca, but: Names beginning with a capital `E' followed a digit or uppercase letter may be used for additional error code names. [errno.h] And additionally inside libenca (i.e. not so serious): * libenca.h: #define EPSILON [errno.h] * filters.c: isvbox[] [ctype.h] * guess.c: #define isbinary [ctype.h] * guess.c: #define istext [ctype.h] * multibyte.c: is_valid_utf7() [ctype.h] * multibyte.c: is_valid_utf8() [ctype.h] Some probably can't conflict.