=== Programs === doit.sh -- Regenerates all */*.base and */*.c files from the source one (given as first parameter in */doit.sh), used by */doit.sh to regenerate stuff in individual directories too. Uses many of following scripts. */doit.sh -- Customized scripts for individual directories. Once a directory contains doit.sh, it's run by the main one. clean.sh -- Removes most auxiliary files from language subdirs. basetoc.c -- [filter] Converts one .base file to .c file, used by doit.sh $ ./basetoc CHARSET.c totals.pl -- Reads generated .c files and computes significancy data, weight sums and other summary data, writes file `totals.c' $ ./totals.pl CHARSET1.c ... CHARSETn.c normalize.pl -- [filter] Does some kind of funny weight normalization, useful for producing CHARSET.base files, since the weights must fit into unsigned short int: $ ./normalize.pl NORMALIZED_COUNTS Given a file on command line, it normalizes input to have exactly(!) the same weight sum: $ ./normalize.pl REFERENCE_COUNTS RENORMALIZED_COUNTS This is not run by doit.sh. extreme.pl -- Given two count files, it finds characters most suitable for hook deciding between these two, i.e. characters with the biggest difference of occurences: $ ./extreme.pl COUNT1 COUNT2 xlt.c -- [filter] Extremely simple charset converter, to become independent on the other broken converters: $ ./xlt SOURCE.map TARGET.map CONVERTED_TEXT mystrings.c -- [filter] Extract text chunks from input (strings(1) doesn't seem to do good job on 8bit files): $ ./mystrings rawcounts.CHARSET countpair.c -- [filter] Count 8bit letter pair frequencies and print a table containing as much pairs as to get 95% of all $ ./countpair CHARSET.letters paircounts.CHARSET findletters.c -- [filter] Find what 8bit characters from a charset map are letters $ ./findletters CHARSET.map CHARSET.letters map2letters.sh -- Run findletters.c for all charsets in maps/. === Data === Letters -- Unicode characters assumed to be letters, excluding 7bits. Also excluding non-European scripts, to keep it small. maps/ -- 8bit charset -> UCS2 maps, notable ones: ibm866-bad.map -- Translates Latin `i' and `I' to Cyrillic 0x0456 and 0x0406, thus approximates them the opposite way when used as TARGET. maccyr.map -- It's Macintosh Cyrillic after Apple unification of Russian and Ukriainian variants and adding Euro symbol there, in Mac OS 9.0 or so (recode uses the old Russian maccyr -- FIXME with iconv it doesn't?). macce.map -- Macintosh Central European encoding, the real one, not the crappy one used by recode. koi8u.map -- KOI8-U (Ukrainian) (recode uses some strange mapping?). koi8uni.map -- KOI8-Unified. koi8ub.map -- KOI8-UB (Ukrainian/Belarussian). cork.map -- T1 Cork encoding (recode uses some strange mapping?). iso885913.map -- ISO-8859-13 map (recode uses some strange mapping?). letters/ -- lists of 8bit charset that are letters (generated) for various charsets, run map2letters.sh to create it