Author |
Topic |
|
hank
Starting Member
5 Posts |
Posted - 28 May 2002 : 19:06:58
|
I've been considering working on a language file for Taiwanese (aka Hoklo, Holo, Amoy, Southern Min). It is actually quite widely used with >15 million users in Taiwan, but is far less commonly written. In lieu of a firmly established written standard, any localization implementation is necessarily experimental. The only reason I even bother to try is that a small user base does exist and given the new educational policy, is likely to grow.
The approach I've been considering has the following requirements:
- utf-8: to allow mixing Traditional Chinese characters and Taiwanese romanization (the latter used either as pronounciation key or standing for morphemes poorly represented by Chinese characters), i.e. digraphia.
- built-in forum font specification: to force display in a Taiwanese romanization font, e.g. Taiwanese Serif.
- (optional) MOD to translate romanized input in the form of letter+tone number into letter with diacritic: primarily because of the high frequency of letters requiring tone diacritics (easy to read), and the widespread practice of using numbers to represent tone (easy to type). I already have a prototype that can be adapted for Snitz.
Note that the language (regionalect, topolect, "dialect") lacks an existing LCID. Would any of this be a problem?
Thanks.
< |
|
Deleted
deleted
4116 Posts |
Posted - 29 May 2002 : 07:05:07
|
Hi hank,
There are more than 4000 languages on th world, some of them disappearing with a whole culture or getting governed by others, mostly English. Language is one of the major parts of a culture, so we need to keep them live. If you can get a Taiwanese board, more people can use it written format to recover.
We reserved numbers higher than 90000 for such cases. 90010 is already used for Frisian, you can use 90020. You would also need other aspects of a locale to be specified for v4 (date/time/number formats etc). If M$ decides that you provide a good customer base, they will assign their own LCID, and we can easily move it .
I think we will also use utf-8 for language files, so it will be fine. Please indicate at the beginning of the language file that it is UTF-8 formatted.
I cannot talk about the other two points. Font problems also were the biggest problems in Turkish 7 years ago (until M$ saw the opportunity).
Think Pink ==> Start Internationalization Here< |
|
|
n/a
deleted
593 Posts |
Posted - 29 May 2002 : 15:03:58
|
You may have to convert/create unicode chars for those special Taiwanese characters (i.e. digraphia) if they don't exit in unicode...
Basically unicode provides a unique number (encoding)for every character, regardless of the platform, program or language used. The main benefit of Unicode is that you can get away with just one character encoding for all languages...but the main drawback is that not all browsers come with built in support for unicode so you have an additional programming challenge of converting all data stored in unicode to a widely accepted national character encoding like Shift-JIS for Japanese. (IE of course is well unicode supported, am not sure of Netscape... there are different issues also associated with handling "presentation" layer issues with various browsers with various encoding schemes....) Mixing Traditional Chinese char and Taiwanese romanization font is like Japanese Kanji (Hanzi), katakana (one style of phonetic representation), and hiragana (another style of phonetic representation)..so your approach of mixing itself should not have a problem itself except the issue of encoding some special characters in unicode, for one, I suspect.
Not clear about item 3, sounds more like input method issues...as there are different chinese input methods.
Saying all these, if you have a prototype language prototype already built, probably you can use it in localizing/translating Lang90020.asp (from Lang1033.asp) and test it out.
International version Bozden developed has already a built-in script to handle various html encoding type, but you have to do two things to get it properly shown in web browser.. 1. You have to set a CodePage = to CodePage = utf-8. (This is for both Lang1033.asp, a default language, and for your Lang90020.asp, because enabling Lang1033.asp with UTF-8, you basically default your language format to utf-8.) 2. You have to save your lang90020.asp with codepage = utf-8
You can check whether your unicode (utf-8) formatted language pack is correctly viewable in a browser, open a file in IE (with its encoding set to UTF-8, but if it's IE 5.X and above, it will automatically check your encoding type) and see whether it shows OK. If some chars showing corrupted or incorrectly, that would indicate that those chars are not properly encoded/stored in unicode charsets. (Unicode often demands some upfront leg works on some exotic language charsets to hardcode them into unicode before you can use... so don't be surprised this may happen. But, after that's been done well, Unicode is really good. Also, Bozden is separating language packs/files as separate resource files (not hardcoded in the main program codes, so that's great - as many scripting languages are not good handling doublebyte, etc...and there are font size issues, etc. etc.)
3. Overall installation/deployment of Snitz w/your special Taiwanese locale lang will be pretty straight forward after the above. If you want to have your forum in only Lang90020 (meaning, user interface, system messages, so to speak) but allow messages/contents in any of Traditional and/or Simplified Chinese, or any other languages, you can do that as well, by setting your lang90020 as a default, which can be also done, if you do not want to have a language selector type approach. Unicode allows this as well.
Good luck and wish your success on your project.
Taku i2Asia <>||TestingOut Multilingo Snitz||<>< |
|
|
Deleted
deleted
4116 Posts |
|
n/a
deleted
593 Posts |
Posted - 29 May 2002 : 18:01:14
|
Probably besides looking at a level of unicode implementation with each browser (and their versions), this particular case may require to look at/check unicode character references - related to codepage and character sets itself. One link below has this good info:
http://www.hclrss.demon.co.uk/demos/ent4_frame.html
Questions are whether Taiwanese romanization/digraphia can be represented by existing unicode characters or need to define further using extended charsets... since if any of these charsets not supported in unicode, it would not render properly regardless of an unicode implementation on a browser... I think.
Option 2, mentioned, for example, does not seem to be any different from having a different locale lang (90020 this case) and choose it an independent locale language, as a variant, such as Traditional Chinese Lang1028 (Taiwan), Simplified Chinese Lang2052 (PRC)... and do not need to be in unicode if this is the case and can probably use existing encoding scheme handling Chinese chars (???)
Again, one way to check this is to have Lang1033.asp translated/localized into 90020 and open it up in IE5.x and above with both a national/locale encoding scheme (Big5?) or with unicode (utf-8) and see whether they show correctly. Typically doublebyte enviornment can handle singlebyte charsets, such as Japanese can have English representation in it as well, as they are extended charsets group, if Taiwanese digraphia (romanized) can be expressed using existing ASCI chars, this can be accomplished (to an extent). If Taiwanese digraphia charsets need to be in doublebyte (where Japanese Kanji/hanzi, katakana, hiragana can be expressed both in "full" and "half" size), you still have issues related to properly encoded charsets......But given types of languages supported these days, I suspect that you can most likely be able to accomodate your need - although it requires some works.
Also, I would recommend to take a look at Lang1028.asp, which is Traditional Chinese. One approach is to amend this langfile to put your Taiwanese digraphia and see whether it works that way, since you are talking of combining Traditional Chinese chars with Taiwanese digraphia. Formatting this into UTF8 is relatively simple (again assuming there is no inherent char codes problems with digraphia representation). There is also UFT-8 formatted Lang1028, so you can work with it as well, if you want to try.
Hope this isn't confusing you or anything, but help to point out some areas that you may need to be aware of..
My 2cents...
Taku i2Asia <>||TestingOut Multilingo Snitz||<>< |
|
|
hank
Starting Member
5 Posts |
Posted - 31 May 2002 : 13:20:39
|
Hi, Bodzen,
quote: You would also need other aspects of a locale to be specified for v4 (date/time/number formats etc).
Thanks for the encouragement. I think the locale variables should be straightforward. My guess is Lang1028 (Traditional Chinese) would be a good reference.
quote:
I cannot talk about the other two points. Font problems also were the biggest problems in Turkish 7 years ago (until M$ saw the opportunity).
Indeed font availability and encoding issues remain the major obstacle for Taiwanese. The fonts I use, for example, are ANSI hacks in the 128-255 area. Netscape's Open Directory Project (http://dmoz.org/world) -- another great community project -- has at least a few languages taking a similar approach. Some of them will no doubt head over here; certainly the Indian languages will eventually.
Regarding this case I will need to do more homework, such as trying out v.4b first. (I currently use v.3.) So we'll see. Meanwhile, of course, feel free to give out 90020 to any language! More the better (usually).
quote:
Questions are whether Taiwanese romanization/digraphia can be represented by existing unicode characters or need to define further using extended charsets... since if any of these charsets not supported in unicode, it would not render properly regardless of an unicode implementation on a browser... I think.
Hi, LeoRat,
Thanks for the very detailed responses you gave. You've pinpointed the essence of the problem. Several romanization characters (with tone diacritics) are not explicitly available in Unicode and would currently need to be represented via combinations. In practice browser implementation and Unicode font availability make all the difference. The demos I've seen are not entirely pleasing to the eye, though the characters are recognizable. The fonts are generally quite large to download and thus present a barrier to building a user base.
quote:
Not clear about item 3, sounds more like input method issues...as there are different chinese input methods.
It is an input method to call up roman characters but server-processed.
quote:
Also, I would recommend to take a look at Lang1028.asp, which is Traditional Chinese. One approach is to amend this langfile to put your Taiwanese digraphia and see whether it works that way, since you are talking of combining Traditional Chinese chars with Taiwanese digraphia. Formatting this into UTF8 is relatively simple (again assuming there is no inherent char codes problems with digraphia representation). There is also UFT-8 formatted Lang1028, so you can work with it as well, if you want to try.
This is what I have in mind as well. I think only the utf-8 version would work, as the non-Unicode "Big5" encoding for Traditional Chinese also occupies the range used by the ANSI-hacked roman fonts.
Again, thanks for all the suggestions.
< |
|
|
n/a
deleted
593 Posts |
Posted - 31 May 2002 : 16:58:03
|
hank,
If you are interested to take a look at UTF-8 formatted language pack in Chinese, along with ISO-8859-1/BIG5, then please visit the below - my siggy link - and download them. (You have to register/sign in first to get to this).
One misc item: You see some flexibility of handling different char fonts/sets - including login names.
I assume you have a front end processor type input method for Taiwanese...as Japanese has (for locale machine/OS enviornment, both hw and sw keyboard functionalities)what's called kana kanji kenkan ( conversion) - which basically does is to use a roman input, convert/map that phonetic input to respective phonetic representaion and then convert to appropriate kanji (hanzi) if needed, and usually have a char table containing all phonetically associated char representation.... MS has this as a basic approach in handling their Global IME CJK ...and there are few Chinese (both Traditional and Simplified)input/editor facilities come with it... Does any of these applicable/usable - I mean, to write/edit your Taiwanese langpack? If yes, then, there is a possibility that you can use it and compile Taiwanese language pack and package it in UTF8...although again there are some special chars that you may have to handle... (I know as Bozden mentioned, there are special Turkish chars which do not render well as is in UTF8 unless you do some hardcoding.., meaning a simple language file codepage conversion/change itself is not enough).
((Curious: I bet IBM has someway to handle your Taiwanese codepage/charsets, given - they are the grand DaddyO of Taiwanese PC market and Asian doublebyte codepage development out of IBM Japan's Yamato Lab ....maybe they have some char conversion type things that you may be able to use?.... just a wild guess))
Anyway, hope everything goes well with your efforts....
Taku i2Asia <>||TestingOut Multilingo Snitz||<>< |
|
|
n/a
deleted
593 Posts |
Posted - 31 May 2002 : 17:09:06
|
hank, forgot to mention:
You look for Snitz Zone when you get into i2AsiaForum... and you will immediately see the forum for languages.
Taku i2Asia <>||TestingOut Multilingo Snitz||<>< |
|
|
|
Topic |
|
|
|