China was first united in 221 BC, when Shi Huang De, ruler of Qin, defeated the last rival warlord. During his reign as China's first emperor, he ordered the construction of the Great Wall and built a tomb guarded by thousands of terracotta soldiers. He also left another, no less significant, monument: the first standardisation of Chinese script, which, until then, had used more character sets than there had been warring states in China.
In 1985, researchers at Xerox began a similarly ambitious project: creating a standard for encoding all characters used by the world's major tongues, so that computers anywhere could correctly represent text inputted in any language. The 16-bit code set, later dubbed Unicode, includes Arabic, Bengali, Chinese and many other scripts, as well as ASCII. But the scheme ran into trouble; Unicode's available 65,536 codes could not support the 75,000-plus Chinese characters used in Chinese, Japanese and Korean. So the standard's designers decided that only about 20,000 of these would be included. This action has the East Asian computing community enraged at Unicode's perceived insensitivity.
Chinese and Japanese programmers complain that their scripts have been treated unfairly. Greek, Latin and Cyrillic letters received their own codepoints, they say, so why compromise the set of Chinese characters? It was as if the architects of a building chose the best rooms for themselves, then let everyone else crowd into the remaining broom cupboard.
In Unicode's defence, its designers point out that the first few thousand Chinese characters it supports cover all those used regularly in East Asian countries. Even words more often found in dictionaries than in conversation, as well as specialised, historical or archaic proper names, were encoded. And about 8,000 codes were set aside for characters considered mistaken, useless or basically forgotten.
Besides, the alternative - a 32-bit standard - contains 4.3 billion possible codes, which is impractical for most computers. The world's scripts would fit best in, say, an 18- or 19-bit standard, says Asmus Freytag, Unicode's vice president for marketing. But, he points out, computers were designed to work with powers of two, so selecting 16 bits - and 20,000 characters - was the best, if not the perfect, solution.
The debate centres on seemingly minor matters - the inclusion/exclusion of variants or little-used characters. But for societies, small differences have always held a larger-than-life importance. Details, after all, are what distinguish groups from one another. That is why societies remain loyal to their beliefs and customs, at times in the face of what others might consider reasonable.
The Chinese characters that Unicode leaves out may be incorrect, outdated or irrelevant. It is difficult, however, to dispel the concern that people's cultural heritage has been truncated or mutilated for the sake of computers' convenience. It raises a provocative question: are we allowing our cultures to be shaped along lines set by our own inventions?
The emotions Unicode's adversaries draw upon are potent. Yet the standard's backers, which include major computer multinationals, are also influential. The code set is appearing in operating systems like Windows NT and Apple's Copland. Version 2.0, scheduled to be released this autumn, combines codes to support a possible additional one million characters. These will be given to less widely used scripts, dead ones or new ideographs. But it is too late for the remaining 40,000 or so rejected Chinese characters.
Like Shi Huang De centuries ago, computer technology today is imposing its own unifying order. People's pride in their differences, though, should be neither disregarded nor underestimated. Standards are important, but people rebel if they feel they're being jammed into the ordered spaces of circuitry. Unicode's Chinese character quandary will hardly be the last time the fuzzy thinking of humans battles the logic of their machines.
José Manuel Tesoro covers technology for Asiaweek.