[jcifs] Character Set discussions

Michael B. Allen miallen at eskimo.com
Wed Feb 5 07:49:47 EST 2003


On Tue, 4 Feb 2003 11:44:34 -0600
"Christopher R. Hertel" <crh at ubiqx.mn.org> wrote:

> Eric,
> 
> That's the missing piece.  Thanks.  I can dig into that now.

Not so fast speedy. I don't think the UTF-8 technique is intended to
address representing Unicode in HTTP URLs. It's fine for the occational
odd character but for regularly occuring Unicode it's just insanity. I
would confirm that browsers do or do not support such a thing for HTTP
URLs or find out what their plans are. After all it would be nice if
they supported the SMB URL some day.

Now just because this message would not be complete without a
pedantic rant, here's one -- let's say you have an SMB URL with Slovak
characters. I'm not sure how this will look in your e-mail in UTF-8
(probably garbarge because most of us are still fixated to ISO-8859-1
at the moment) but here it is:

  smb://svr/slovak/môžem/jesť/sklo/nezraní/ma.zip

I've also attached a file with the exact byte sequence and here's a
hexdump too:

  00000:  73 6d 62 3a 2f 2f 73 76 72 2f 73 6c 6f 76 61 6b  |smb://svr/slovak|
  00010:  2f 6d c3 b4 c5 be 65 6d 2f 6a 65 73 c5 a5 2f 73  |/m....em/jes../s|
  00020:  6b 6c 6f 2f 6e 65 7a 72 61 6e c3 ad 2f 6d 61 2e  |klo/nezran../ma.|
  00030:  7a 69 70                                         |zip             |

Regarding UTF-8 encoding, the c5 be sequence is the bit pattern 11000101
10111110 which starts with 2 bits on meaning the sequence consists of two
bytes so we filter out the x bits in the pattern 110xxxxx 10xxxxxx. So
10111110 is the Unicode character U+017E (a little z with an unside down
carrot). Ok, now that I've established my pedantism the URL escape for
this in UTF-8 is %c5%be. This would make the URL look like:

  smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip

That's pretty ugly. Think people would want to work with URLs like that
on a regular basis?

> On Tue, Feb 04, 2003 at 05:45:13AM -0500, Glass, Eric wrote:
> > Chris/All,
> > 
> > I think this was discussed previously, but I figured I'd bring it up again
> > in the interests of pedantry ;).  The SMB URL draft currently specifies the

Whohoo! Allright! Here we go .....

> > following with regards to encoding non-ASCII characters in paths:
> > 
> >    NetBIOS names, share names, and the directory paths and filenames
> >    offered by an SMB server may all contain characters from outside the
> >    7-bit US-ASCII character set.  Applications MUST support the use of
> >    the URL escape sequence as described in [RFC2396] to accommodate
> >    octet values that represent non-US-ASCII characters.
> > 
> > RFC 2396 doesn't appear to address non-ASCII characters; actually, it
> > states:
> > 
> >    This document does not discuss the issues and recommendation for dealing
> >    with characters outside of the US-ASCII character set [ASCII]; those
> >    recommendations are discussed in a separate document.
> > 
> > The "separate document" appears to be RFC 2718, which states:
> > 
> >    When describing URL schemes in which (some of) the elements of the
> >    URL are actually representations of sequences of characters, care
> >    should be taken not to introduce unnecessary variety in the ways
> >    in which characters are encoded into octets and then into URL
> >    characters.  Unless there is some compelling reason for a
> >    particular scheme to do otherwise, translating character sequences
> >    into UTF-8 (RFC 2279) [3] and then subsequently using the %HH
> >    encoding for unsafe octets is recommended.
> > 
> > 
> >  
> > **************************************************************************
> > The information transmitted herewith is sensitive information intended only
> > for use by the individual or entity to which it is addressed. If the reader
> > of this message is not the intended recipient, you are hereby notified that
> > any review, retransmission, dissemination, distribution, copying or other
> > use of, or taking of any action in reliance upon this information is
> > strictly prohibited. If you have received this communication in error,
> > please contact the sender and delete the material from your computer.
> 
> -- 
> Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
> jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
> ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
> OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org


-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived. 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: slovak.utf8
Type: application/octet-stream
Size: 52 bytes
Desc: not available
Url : http://lists.samba.org/archive/jcifs/attachments/20030204/7ca18d3e/slovak.obj


More information about the jcifs mailing list