[jcifs] Character Set discussions

Fri Feb 7 11:44:13 EST 2003


> -----Original Message-----
> From:	Glass, Eric [SMTP:eric.glass at capitalone.com]
> 
> > ASCII. I was sort of hoping they would just expand that to 
> > UTF-8 without
> > requiring escaping.
> >
> 
> This is the basis concept behind IRIs; see
> 
> http://www.w3.org/International/O-URL-and-ident
> 
> and
> 
> http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html
> 
> especially the section on legacy URI handling.  This is currently a draft
> specification, however.
> 
	<snip>

> This is the functionality provided by IRIs -- an interpretation of unescaped
> characters outside the ASCII set.  But at the current time, for existing
> URIs, representation must be made using members of the set of valid URI
> characters.  This means encoding and escaping; preferably, the URL scheme
> specifies a standard encoding, and preferably UTF-8.
> 
	As usual this is very good information Eric. I glad to hear about the IRI initiative.
	That will be particularly important to the SMB URL.

	However despite our recent discussion of HTTP and it's handling of HTTP URLs, let's
	not forget that SMB is not HTTP and these SMB URLs will not be submitted in GET
	requests (not directly). It's the HTTP transport that is requiring normalization to
	ASCII. If (when) the browsers support the SMB URL they will use the local CIFS
	client directly which again does not require escaping. Therefore I do not believe the
	jCIFS package should be encoding/decoding or escaping/unescaping Unicode
	characters in SMB URLs (if that's something being suggested; don't know).

	Of course we know there are instances where it is desirable to escape the
	authority and path components of SMB urls such as with NetworkExplorer like
	applications that append these components to the HTTP URL for extraction as the
	PATH_INFO field. In this case however I believe the standard URL encoding is what
	should be used.

	Is everyone in agreement on this or am I missing some important use case?

	Mike

	PS: The "JAVA" encoding is actually UTF-16 with a byte order mark. Don't know
	why I said UCS-2. Incidentally even though it is widely believed that CIFS  uses
	UCS-2LE for encoding path names I do not believe it has ever been conclusively
	determined that they are not UTF-16LE in some places. That is the native character
	encoding on the Windows platform so it's conceivable that it's use may have crept
	into the higher level functions. I recall the latest SNIA document also suggests it is
	UTF-16LE.