[jcifs] [Unicode] Creating file with hash ('#') in filename

Thu Jan 16 16:06:09 EST 2003

> -----Original Message-----
> From:	Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> 
> >	They will be used where UNC paths are used. We cannot mandate that
> >	spaces, '@', '#' and unicode characters be escaped.
> 
> On the one hand, you are correct.  It's a pain.  On the other hand, that's 
> the nature of URLs.  The current RFC doesn't talk about Unicode, though.  
> I imagine that characters outside of the US ASCII set would not need to be 
> escaped in this situation.  Not if there's a proper Unicode 
> representation.
> 
> You *do* get into a problem when using extended ASCII, however.  The 
> different DOS codepages use different octet values for extended 
> characters.  They don't all map to Latin1 either.  So here's the problem:
> 
>   Someone enters a filename which includes the character 'Ö' (that's 
>   o-umlout).  In the Latin1 character set (Unicode), the octet value is
>   0xD6, but in DOS Code Page 437 it's 0x99.
> 
>   So the question is: how do you read something like this:
> 
>   smb://server/share/path/Övertone.spew
> 
>   jCIFS would have to know the character set in use at the terminal (maybe 
>   you can do that...if so, it's not a problem) in order to figure out 
>   that the octet value 0x99 maps to Unicode character 0xD600 (that's
>   0x00D6 except that Microsoft uses UCS2LE encoding which is two bytes,
>   little-endian--so the bytes are reversed).
> 
	This is all totally irrellevant. As far as jCIFS is concerned character encoding
	is handled internally. The users terminal will use the encoding set by the
	LC_CTYPE locale variable which Java will set it's file.encoding property to and
	convert anything read in to Unicode. Once it's Unicode you're home free. It
	doesn't matter how the Unicode codes are encoded whether it be UCS-2LE
	or UCS-4BE or UTF-8.

> So, as I said, it's still a problem if the command shell is not using 
> Unicode as well.  On the other hand, if you always read escapes as 
> Unicode, then
> 
>   smb://server/share/path/%D6vertone.spew
> 
> is not ambiguous.
> 
	Chances are when (if) the URL specifications introduce Unicode support the
	characters will be permitted to pass through without being escaped. The
	encoding will probably in UTF-8 for backwords compatability reasons. The
	way XML supports Unicode in URI entity references is to convert the
	extended character or characters into a UTF-8 sequence and encode
	each byte using the normal %HH escapes.