Jun 20

“Fun” with Loony-Code (Unicode)

Tag: Windows and Linux DevelopmentBill @ 3:27 pm

“Fun” with Loony-Code (Unicode)

Introduction

I have been trying to finish porting our Teleplex CTI (“Computer Telephony Integration”) server from Windows to Linux, which involves about 230,000 lines of C++. The user interface and a number other things are implemented in a portable way, thanks to wxWidgets, so most porting is complete. The last step is to make the client-server APIs work between Windows and Linux. In the process of investigating this, I found out a few things I didn’t know about Unicode.

Audience

Serious software applications are very complicated these days, and it is impossible for anyone to follow every single detail with every compiler and every operating system. I hope there are a few useful tidbits of information for developers and project managers who are developing or porting internationalized applications.

The Problem

Our software uses MBCS (Multi-Byte Character Set) for anything related to the telephone-end of things. Data for telephone calls is always ASCII (telephone numbers, dial-in numbers, Caller-ID, and so-on), with the exception of SIP (VoIP) calls, which might contain international characters. And, the user interface in the server software happened to end up being MBCS. The APIs, however, are Unicode-based because there is potentially non-English data flying around the network such customer names, agent names, and other data. For years, I have religiously used all of the Microsoft macros such as “_TEXT”, “_T”, and “_stprintf”. These macros allow you to build either ASCII/MBCS or Unicode versions of your software just by changing a build option. The final step in my project ought to be easy, I thought.

To my distress, I quickly discovered that Linux uses UTF-32 (32-bit Unicode), versus UTF-16 (16-bit Unicode, as used in Windows). I had always assumed that UTF-32 was some aberration invented by RISC system vendors because they didn’t want to get a processor fault when you accessed some puny 16-bit piece of memory. So, if UTF-16 covers anything really worth covering in modern times, and even leaves some space for application-specific characters, everyone ought to be using it, right? Wrong! Just consider some of these truly awesome benefits (?) that UTF-32 has to offer:

  • You can express rarely used characters such as Sanskrit Wingdings in a single 32-bit quantity. This would require two 16-bit quantities with UTF-16, so using UTF-32 probably saves a few dozen heat-generating CPU cycles around the globe, and thus halting the true cause of global warming.
  • “32-bit” is chic. “16-bit” is for old people (over 28).
  • There are at least nine researchers of defunct languages from around the globe who are need a standard way of sorting characters such as “the asp with the golden toothpick” without having to also be concerned with CPU performance. UTF-32 apparently fulfills these requirements.

The slackers at Microsoft were only ambitious enough to try to conquer the Earth, so as mentioned before, Windows implements UTF-16—more than enough for ordinary Earthlings. The Linux community, however, decided to conquer the entire universe. Why? I’ll bet that, for example, Martians love Linux. There is not much to do in outer space, so locating, downloading, and compiling the “latest stable release” of  “packages” (software) that aren’t in your “distro” helps to pass the time between phases of the moons. Not to mention that Martians probably love Linux because it is “free as in ‘speech’, not as in ‘beer’”.  Anyway, I suppose it is often necessary to communicate with Martians efficiently because even with an uncongested, warp 3 Internet connection, it takes a whole minute to send a message (one way).

Consequently, the Linux community decided to use UTF-32 in memory, and UTF-8 in files.  (UTF-8 is just an alternative way of encoding UTF-16 and UTF-32.)  Wow!!  Now we can send instant messages to Britney using her native Martian “emoticons”.  And, we can provide tech support—in both classic and modern Martian characters—to all those Martian grandmothers who don’t know the difference between an RPM and an EKG.

The Solution

Enough ranting and raving.  I need something to do.

Day One

If I switch over to full-Unicode, what will happen? I know I’ll have to make a few conversion routines, but the burning question is: “What is the worst case buffer size required for converting between DBCS, UTF-8, UTF-16, and UTF-32?”.  I found some useful information at unicode.org, especially in http://www.unicode.org/faq/utf_bom.html:

Encoding Bytes Required (worst case)
ASCII 1  “Hello, world!” occupies 13 bytes (plus the trailing NULL).
DBCS 2  DBCS is “Double Byte Character System”, and includes Japanese. “Hello, world!” still occupies 13 bytes.
MBCS 4  “DBCS” and “MBCS” are synonymous in Japanese, but not necessarily in other languages. “Hello, world!” still occupies 13 bytes.  (The problem with DBCS/MBCS, by the way, is that you can’t figure out from the character code what language the character is from.  In other words, the codes are not portable between countries.)
UTF-8 4.  ASCII is always 1 byte, so there is no transmission overhead—except you have to waste time converting it “to itself”. “Hello, world!” still occupies 13 bytes because it really ASCII text. Japanese Kanji requires 3 bytes per character, and only special characters (meaning really weird) require 4 bytes.
UTF-16 2–if you don’t use special characters.  Special characters are encoded as a “surrogate pair” (two 16-bit values).  “Hello, world!” occupies 26 bytes, half of the bytes will be “0”, so you are penalized for English, but the penalty allows you to handle most characters in use around the globe.
UTF-32 4. “Hello, world!” occupies 52 bytes. For English-only, 3/4 of the bytes contain “0”, and for other languages you are generally wasting 1/2 of the space. Surrogate pairs are not allowed. (Oh yes, and the trailing NULL character occupies four bytes.)

 Armed with this scary information, I next tried a Unicode build (which I haven’t done for quite a while). The 752 errors that flew by obviated the need for changes in the following areas:

  • The telephony libraries work only with ASCII, including file names. Modify code to convert file names stored internally as Unicode to ASCII to make the driver happy.
  • There are a few other places where we do file I/O using MBCS. Change a few hardcoded file names (such as the names of fixed audio files) from ASCII strings to Unicode strings.
  • Changing to a Unicode build broke many calls to Xerces-C (the XML library from http://www.apache.org. The Xerces-C people decided that because XML documents require a large amount of memory anyway, they would use only UTF-16—even if the platform is based on UTF-32.
     

    You convert between Xerces string object a std::wstring objects with something like this:

        StringType1.assign(StringType2.begin(),StringType2.end());

    This will cram one string into the other, but only 16-bits worth.  This means I won’t be able to do Klingon Text-to-Speech.

     

  • “Invert” lots of string types. For example, data from the telephony drivers is now stored internally as Unicode, so we have to convert to/from ASCII in a few places. Some AStrings (std::string) become WStrings (std::wstring), and visa-versa, and some TStrings (which were typedef’d as AStrings—which now compile as WStrings—were wrong, and really need to be just AStrings.) Also need to modify some places that use the wxConvLocal conversion class from wxWidgets because I was using the wrong variant.
  • Boom!  Access Violation!  Fortunately, the few places that I broke where easy to find (see below).
     

    I should know better, but I forgot this problem again:

        pSomeString = wxConvLocal.cWX2MB(some text).data();

    The wxConvLocal result is immediately destroyed, so pSomeString isn’t valid. You can (1) use wxConvLocal output as a function argument, (2) assign it to another string object (which will copy the data), or (3) you can do this:

        wxCharBuffer SomeBuffer = wxConvLocal.cWX2MB(some text); pSomeString = SomeBuffer.data();

    Using pSomeString is ok because SomeBuffer still exists.

     

  • There is one place where I am even fiddling with shift-in/shift-out (the old-fashioned way of switching character sets) to handle “half-size katakana” caller name display with 7-bit characters. There are two possible approaches here: convert the input to 8-byte characters at that point, or figure out the Unicode value for the half-size katakana, and convert it later.
  • Unfortunately, there are no Unicode-equivalents of most basic network API functions such as “inet_ntoa”, so more conversion macros are needed in some network-related code.
  • Text-conversion macros are required for some ASCII-only error message providers (i.e., the telephony drivers). Fortunately, these were easy to find, and I ended up with something like this:
     

    if (gc_OpenEx((LINEDEV*)&dwDevHandle,szDeviceName,EV_SYNC,this) != GC_SUCCESS) {
        gc_ErrorInfo(&gcInfo);
         rc = gcInfo.gcValue;
        LogMsg(LOGF_ALWAYS|LOGF_ERROR, ::wxGetTranslation(IDRTE_UNABLE_TO_OPEN_DEVICE),
                pszDriverDevName,
                wxConvLocal.cMB2WX(gcInfo.gcMsg).data(),
                rc);
    }

    This soon became both tedious and ugly, plus I found another case (“ATDV_ERRMSGP”), so much of this was later shortened with macros.

  • Lots of new casts required. Many telephony driver function declarations are ancient, and specify “char*” parameters, whereas many of the string functions I use return “const char*”. The compiler doesn’t like this
  • If you have already switched to Visual Studio 2005 or higher, you were probably annoyed by all of the warnings like “Your swprintf is no longer ISO-compliant. To disable this warning, add ‘#define _CRT_NON_CONFORMING_SWPRINTS”.  (That’s exactly what I did when I first upgraded to Visual Studio 2005.)  The ISO standard added “max length” as the second parameter. At least Microsoft was kind enough to add an overloaded version so you didn’t need to upgrade right away.
  • Bad news: You do have to upgrade your swsprintf’s to work with GNU C++, so I made a “STPRINTF” macro, and use the second argument in a Unicode compile, but ignore it in the non-Unicode compile.�

    #ifdef UNICODE
    #define STPRINTF(Dest,MaxChars,Format,…) swprintf(Dest,MaxChars,Format,__VA_ARGS__)
    #else
    #define STPRINTF(Dest,MaxChars,Format,…) sprintf(Dest,MaxChars,Format,__VA_ARGS__)
    #endif

     

  • But, more bad news: Visual C++ allows zero optional arguments to variadic macros (macros with a variable number of arguments, indicated by the ellipses), but Standard C++ does not. Therefore, I also defined a macro named “STPRINTF0”, which takes only the “Dest” and “Format” arguments. (There are places where I build SQL statements, and most of the time I am filling in things like column and table names, but once in a while I need to append a constant string such as a closing parenthesis. Therefore, I need this macro, too.)
  • vswprintf() has the same kind of problem with the extra size argument.
  • Linux does not support Unicode file names (they have to be UTF-8), so on Linux I need to do something different for “_tfopen” or “_wfopen”. There are about twenty places to fix, so the easiest way seems to be by replacing “_tfopen” with a new “FOPEN” macro. On Windows this maps to “_tfopen”, and on Linux I made a simple wrapper function that converts the file name and mode strings to UTF-8, and then calls fopen(). Similar logic is required for “_tremove”, “_trename”, “_topen”, and “_taccess”.
  • I happily stumbled upon <cctype> and <cwctpe> from the standard c++ library. These allow you to get rid of some MS-specific calls such as “_istalpha” by using std::isalpha(), std::iswalpha(), and I additionally made a “ISTALPHA” macro. (I suppose I should actually read all about the Standard C++ library someday.)
  • I lucked out on my user interface and database code. There were just a few lines where I was missing my “_T()” macro or something similar to that.

Day Two

I was able to rebuild on Windows, and the initial conclusions were:

  • The Windows executable file is 364K larger than before (because string constants that are now twice as large), so the Linux executable is going to be 728K larger.
  • The runtime footprint is 10MB larger than before (and will be 20MB larger on Linux).
  • We can potentially generate hundreds of little files containing a backup of the real-time statistical data. These will now be twice as large.  I suppose this isn’t a big problem because we don’t use much disk space in the first place.
  • BUT!  Did my log files just double in size? Better check that. Or, should I leave it as Unicode in case we have to do support for a system on Mars?
  • Is the user interface faster on Windows (since Windows doesn’t have to do MBCS-to-Unicode conversion)? Answer: The difference is imperceptible.

Days Three & Four

I spent a good portion of the third day trying to make the changes compile on Linux. It would help if the compiler wasn’t so slow in the first place.

 By the middle of the fourth day, I had fixed all the new compilation problems. and found:

  •  I noticed that the output files on my Windows machine are now huge. Microsoft has an easy solution: When you open a file with “fopen” (or fwopen), change the mode string from something like “w” (write) or “a” (append) to “w,ccs=UTF-8”. Then the runtime will write the data out in a more reasonable format. Therefore, I finally added a wrapper function for the Windows version of my new FOPEN that appends this option. Thank goodness, GNU C++ converts the data to UTF-8 automatically.
  • More bad news: Lots of messages are not displayed correctly on the Linux system. In Visual C++ you may have been using “%s” and expecting the runtime library to substitute either Unicode or ASCII, depending on the build mode. This doesn’t work with GNU—you have to change all of your “%s” and “%c” format items to “%ls” and “%lc” (which also works with Visual C+). This is the ANSI/ISO standard way of indicating a wide character string. Unfortunately, there is no standard way of saying “the other kind of string” (“%S”) as in Visual C++. This means that if all the formats are changed, you can no longer switch between an ASCII/MBCS and a UNICODE build without making a lot of “spaghetti” code.

Summary

The places where you can expect to have some porting problems include:

  • Anything to do with file opens, and anything to do with text file I/O.
  •  Anything to do with printf, sprintf, etc., especially the format strings.  (If you happen to prefer using C++ “streams”, your code might port a bit more easily. I am an old dog, and so I do things the old way.)
  • Network functions that work with IP addresses and/or host names.
  • Your own APIs that use UTF-16.
  • User interface and database code, perhaps.

If you did your work perfectly on Windows, then you won’t have to spend too much time trying to figure out a mysterious message on Linux, which goes something like: “Pango says you have an invalid UTF-8 character”. (There is no hint as to where, and it doesn’t even show you the part that was ok).

Conclusion

I still have to solve the use of UTF-16 in my APIs. Or, maybe I’ll forget that and move to Hawaii where the 12-letter alphabet fits in 5 bits.

Aloha

One Response to ““Fun” with Loony-Code (Unicode)”

  1. Josh Stern says:

    There are no standard interfaces on Linux that require UTF32 strings. Linux uses UTF8 to the extent that it has any Unicode standard. This is unlike Win32 where some platform APIs require UTF16. What I do for cross platform is to store everything internally as UTF8. I’m convinced this gives the best overall performance because of reduced memory bandwidth. On Win32, MBCStoWideChar can be used for the occasional conversion when calling an API that required UTF16 and it also makes sense to define MBCS rather than Unicode and convert chars to UTF8 where necessary as they are read in (i.e. check whether a string is all 7 bit chars and just convert it where that isn’t true).

    MBCStoWchar where required for a few Win32 calls on Windows, define

Leave a Reply