[jcifs] Character Set discussions

Wed Feb 5 14:11:02 EST 2003


> -----Original Message-----
> From:	Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> Sent:	Tuesday, February 04, 2003 9:37 PM
> To:	Allen, Michael B (RSCH)
> Cc:	jcifs at lists.samba.org
> Subject:	Re: [jcifs] Character Set discussions
> 
> On Tue, Feb 04, 2003 at 08:17:48PM -0500, Allen, Michael B (RSCH) wrote:
> :
> > > > odd character but for regularly occuring Unicode it's just insanity.
> > > 
> > > Why?  The user only ever sees it if they need to escape something, which
> > > (since they are using Unicode) would only happen if the character is a
> > > reserved character within the ASCII set.
> > > 
> > For non-Latin users I think this would happen pretty frequently although I
> > don't know how they're getting around the problem right now. They are
> > probably restricting themselves to just using ASCII. But that's not an
> > option for SMB URLs.
> 
> I would think that non-Latin users would have keyboards and software that 
> allow them to enter the Unicode characters they need.
> 
	But you will still run into situations where the encoding of files or protocol
	transport does not permit Unicode (like right now with web browsers).

> There is *supposed* to be a header declaring the encoding of the file (if
> it's in HTML, for example).  It will, as you suggest, take the Latin world
> a while to get used to this.
> 
	And HTML has the META tag. I think ultimately this is what should happen.
	Everything should just be widened to Unicode where the encoding is left
	undefined or if it's protocol transport like HTTP they pick UTF-8 or negotiate an
	encoding. I'm in favor of this type of solution actually. But it will take a long
	time. And of course there might be some awful problem with it. I am not well
	versed in such things. We really need to check what the status is. I wouldn't be
	surprised to hear there is just little or no interest right now.

> > > Absolutely not.  That's why then need to be able to enter it as Unicode
> > > text, not as escapes.
> > > 
> > Again, same issue of serialization. But if everyone displays Unicode then
> > we'd be ok. So far Linux isn't up to the task. Actually Red Hat 8 uses a
> > UTF-8 locale now by default so I suppose that is changing. The big
> > question is the browsers. It all hinges on what they support.
> 
> Yes, probably.  With China going Linux, though, I think we'll see a lot
> more emphasis on Internationalization.  Also a lot more emphasis on IPv6.
> 
	Well China is special. They have a BIG characterset. It's something like 30,000
	glyphs. I seriously doubt the Linux Unicode fonts are up to the task of
	displaying Chinese. They will be using Big5 and GB character sets and fonts
	for some time still. That's what all the Chinese software seems to support. You
	have to run in that locale for it to work. This is actually a really good argument
	for not trying to escape Unicode in URLs. It would involve converting to Unicode
	from GB or Big5, escaping them, and then unescaping and converting back. I
	guess the converting isn't that big of a deal because there would probably be a
	bit of converting going on anyway but with Chinese *every* character would end
	up being escaped which just stinks. Even if you dispay it with the regular Chinese
	glyphs developers would just have a fit. There's no point really. They might as
	well just run in "ASCII" mode (I mean with CIFS/jCIFS) if that's even possible with
	GB and Big5 since there not 8bit character encodings. Mmm, I wonder how they
	support that now. Yikes. I don't even want to think about that.

	Mike