> There are no 'levels' of UTF-8. It is an encoding compatible
> with US-ASCII in the 'left-half' of any octet of the FULL
> UCS-2 Unicode set of 2-byte code points. For Western European
> languages, it probably averages 2.5 bytes per character.
> For non-Western languages, it takes up to 4 bytes per character.
Ira, help me out here. (I know just enough on this topic to be
dangerous.) What I assumed for Latin languages was that in typical
text, 90-95% of the characters would be present in ASCII, and you'd only
use the multi-byte sequences for diacriticals and a handful of
non-English characters. So typical expansion would be 10-20% over what
you'd get with a single byte encoding like Latin 1. What am I missing?
David
:: David Kellerman Northlake Software 503-228-3383
::david_kellerman at nls.com Portland, Oregon fax 503-228-5662