[jcifs] Character Set discussions

Christopher R. Hertel crh at ubiqx.mn.org
Wed Feb 5 08:17:38 EST 2003

On Tue, Feb 04, 2003 at 03:49:47PM -0500, Michael B. Allen wrote:
> On Tue, 4 Feb 2003 11:44:34 -0600
> "Christopher R. Hertel" <crh at ubiqx.mn.org> wrote:
> > Eric,
> > 
> > That's the missing piece.  Thanks.  I can dig into that now.
> Not so fast speedy. I don't think the UTF-8 technique is intended to
> address representing Unicode in HTTP URLs. It's fine for the occational

It's URLs in general, not just HTTP URLs.

> odd character but for regularly occuring Unicode it's just insanity.

Why?  The user only ever sees it if they need to escape something, which
(since they are using Unicode) would only happen if the character is a
reserved character within the ASCII set.

Also, the input encoding doesn't matter as long as the underlying system
knows what it is.  It could be UCS-2LE, for instance, and as long as the
terminal or browser knows that it can convert as necessary.

> I would confirm that browsers do or do not support such a thing for HTTP
> URLs or find out what their plans are. After all it would be nice if
> they supported the SMB URL some day.

I imagine that their plans would be to support the spec., but yes it needs 

> Now just because this message would not be complete without a
> pedantic rant, here's one -- let's say you have an SMB URL with Slovak
> characters. I'm not sure how this will look in your e-mail in UTF-8
> (probably garbarge because most of us are still fixated to ISO-8859-1
> at the moment) but here it is:
>   smb://svr/slovak/môžem/jesť/sklo/nezraní/ma.zip

Yes, my mail program (or perhaps rxvt) displays that as garblage.

> I've also attached a file with the exact byte sequence and here's a
> hexdump too:
>   00000:  73 6d 62 3a 2f 2f 73 76 72 2f 73 6c 6f 76 61 6b  |smb://svr/slovak|
>   00010:  2f 6d c3 b4 c5 be 65 6d 2f 6a 65 73 c5 a5 2f 73  |/m....em/jes../s|
>   00020:  6b 6c 6f 2f 6e 65 7a 72 61 6e c3 ad 2f 6d 61 2e  |klo/nezran../ma.|
>   00030:  7a 69 70                                         |zip             |
> Regarding UTF-8 encoding, the c5 be sequence is the bit pattern 11000101
> 10111110 which starts with 2 bits on meaning the sequence consists of two
> bytes so we filter out the x bits in the pattern 110xxxxx 10xxxxxx. So
> 10111110 is the Unicode character U+017E (a little z with an unside down
> carrot). Ok, now that I've established my pedantism the URL escape for
> this in UTF-8 is %c5%be. This would make the URL look like:
>   smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip


> That's pretty ugly. Think people would want to work with URLs like that
> on a regular basis?

Absolutely not.  That's why then need to be able to enter it as Unicode 
text, not as escapes.

> Whohoo! Allright! Here we go .....

I'm not sure what to make of that, Mike.  :)

Chrids -)-----

Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org

More information about the jcifs mailing list