Gobo Eiffel Unicode Class generation tool ========================================= Geuc is a tool for generating routines from the Unicode Character Database (see http://www.unicode.org/ucd/) A note on canonical combining classes ------------------------------------- Canonical combing classes range from 0 to 240 (at present - version 4.1.0). These could be nicely represented with a NATURAL_8, but VE ans SE 1 do not support these currently (2005/11/02). In order to avoid the space wastage caused by ARRAY [INTEGER_16], and the danger that VE would run out of memory when compiling, I project these values 1-to-1 onto the range 0-60. This was achieved by visual inspection of the data, to see which combining classes were actually present in the UCD (Unicode Character Database) - I inserted print statements into geuc to get the output. With each new version of Unicode, you must check to see if any new values are in use (hopefully, the changes files will tell you this). When all supported compilers support NATURAL_8, then we can drop these projections. N.B. This is now the case, but I don't feel energetic enough to change the code right now. Anyone who feels energetic - feel free. A note on Hangul syllables and conjoining jamo behavior ------------------------------------------------------- The UCD does not contain any decompositions for Hangul syllables and conjoining Hangul jamo, as these compositions/decompositions are completely determined algorithmically. Accordingly, one has the choice of trading memory for speed, by calculating the decompositions and compositions at runtime, or calculating them at array generation time. I have chosen to do the latter, as: 1) Users of Korean will not suffer a speed penalty compared to other users. 2) In order to save the memory, the segments of the array will have to be Void references. This would destroy the invariant, that there are no Void entries. 3) All users would suffer a slight performance degradation, as each segment would have to be checked as non-void before it could be accessed. 4) The memory savings, although not insignifiacnt, are very small compared with the total Unicode array memory usage for an application. Usage ----- The basic procedure for use is: 1) Change directory to $GOBO/library/string/tool/geuc/src and then remove any existing generated classes by: geant clean (or geant clobber). 2) Compile the geuc program with: geant compile or some variant of that command. 3) Save copies of the Unicode Character Database in $GOBO/library/string/tool/geuc/src. The following files need to be imported: ftp://www.unicode.org/Public/UNIDATA/UnicodeData.txt ftp://www.unicode.org/Public/UNIDATA/DerivedCoreProperties.txt ftp://www.unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt ftp://www.unicode.org/Public/UNIDATA/SpecialCasing.txt Do NOT add these files to the Git repository. Note that only the latest version of the files should be obtained. The idea is that we generate unversioned Eiffel classes, which always represent the latest version available, and also versioned Eiffel classes (in case programmers want to code for a specific version of Unicode for their application - this might be wanted if a particular public standard mandates using a specific version of Unicode). Since the format of the Unicode Character Database has some tendency to change incompatibly over time, geuc will only support the format for the current version. If for some reason you need an archaic version of the Unicode Character Database, then please copy geuc.e to geuc-vxxx.e, and make the necessary changes. 4) Run the utility to produce versioned files. e.g. ./geuc --uc_version=410 The character string after --uc_version= will be used unchanged (apart from case) in the generated class names. So, for instance, if the unversioned class name is: UC_CHARACTER_CLASS_ROUTINES then the above command line will generate UC_V410_CHARACTER_CLASS_ROUTINES 5) Move the generated classes to their target libraries by: geant deploy The target library is $GOBO/library/string/src/unicode for normalization routines, and $GOBO/library/kernel/src/unicode for all other routines. 6) Edit the unversioned classes to inherit from the latest versioned class. If any of the classes are new (which will be the case if you are generating a new version, or if you are adding a new facility from the database), then don't forget to add the files to the Git repository (contrary to the rest of Gobo, with geuc, we store the GENERATED files, and not the source files). Also create by hand and add to the Git repository a shared access class. E.g. ST_UNICODE_CHARACTER_CLASS_ROUTINES has a corresponding shared access class ST_IMPORTED_UNICODE_CHARACTER_CLASS_ROUTINES, and ST_UNICODE_V410_CHARACTER_CLASS_ROUTINES has ST_IMPORTED_UNICODE_V410_CHARACTER_CLASS_ROUTINES Also, create a UC_Vnnn_CTYPE class (use an existing one as a model) and if it is the latest version, change UC_CTYPE to inherit from it. 7) Fetch the file: ftp://www.unicode.org/Public/UNIDATA/NormalizationTest.txt and save it in $GOBO/library/string/test/unit/data The test class ST_TEST_NORMALIZATION_ROUTINES will look for this file, and use it to test the normalization code. These tests MUST be run after any changes, to ensure the correct working of the normalization code, which is highly optimized. -- Copyright (c) 2005, Colin Adams and others