[jcifs] Creating file with hash ('#') in filename

Fri Jan 17 16:22:08 EST 2003

> -----Original Message-----
> From:	Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> 
> > But, at least one issue does remain. For all non-Latin1/ISO-8859-1 users
> > normal %HH escaping will result in a URL that is heavily polluted with escape
> > sequences. For non-Latin derived encodings that do not share Latin1
> > characters like Russian the *entire* path will be %HH escapes. Even for
> > Latin1/ISO-8859-1 users CIFS (unlike HTTP) has MANY instances in paths
> > that will need to be encoded and when you listFiles() you have to escape the
> > paths.
> 
> This is the Unicode issue.  URLs must handle full Unicode in order to deal 
> with this stuff.  It is a problem, but it's not specific to the SMB URL.
> 
	Yes, it does seem to be coming down to Unicode. But it is specific to the SMB URL.
	RFC 2396 does not appear to address Unicode. You cannot have Unicode HTTP URLs
	for example. Thus the problem.

> > > Note, however, that doing:
> > > 
> > > URL url = new URL("http://server/directory/file stuff.txt");
> > > URI uri = new URI(url.toExternalForm());
> > > 
> > > throws:
> > > 
> > > java.net.URISyntaxException: Illegal character in path at index 28:
> > > http://server/directory/file stuff.txt
> > > 	at java.net.URI$Parser.fail(URI.java:2701)
> > > 	at java.net.URI$Parser.checkChars(URI.java:2872)
> > > 	at java.net.URI$Parser.parseHierarchical(URI.java:2956)
> > > 	at java.net.URI$Parser.parse(URI.java:2904)
> > > 	at java.net.URI.<init>(URI.java:565)
> > > 
> > This is a good example. So they chose not to escape them and they don't
> > even have the problems we're facing. This suggests they weighed
> > transcribability over syntactic correctness.
> 
> I don't follow your argument there.  What is it you are trying to say?
> Sorry, I'm being dense, I guess.
> 
	What the example shows is that the HTTP URL handler shipped with Java does what
	jCIFS' SMB URL handler does now which is as you put it is syntactically incorrect, not
	a URL, and incomplete.

	Notice it's the URI class rather than the URL class that provokes the exception. I don't
	have Java 1.4 so I don't know what toExternalForm would return but I suspect it would
	be escaped.

> Looking at that example, however, I agree with the exception.  The space
> is an illegal character.  It is also a transcription problem, since spaces
> are difficult to convey properly in handwriting.  They also cause trouble
> with word wrap in e'mail and such.  That's why the RFC warns against them.
> 
> So, I would say that the above example is an argument in favor of both 
> transcribability and correct syntax.
> 
	I am arguing that transcribability is better without the escapes. In fact if we
	were to required what the URI (not L) class mandates it would be rather difficult for
	the average person to construct certain URLs. They would then need application
	support for performing the escaping.

> > Unless someone points out a fatal flaw in not escaping these path components
> > or someone comes up with a very clever solution, I think we should just wait
> > to see how the current code performs. And if we do decide to escape these
> > URLs it's not going to be easy. If non-Latin1 characters are convered to
> > UTF-8 that's not going to be compliant either I'm afraid. RFC 2396 doesn't
> > support Unicode.
> 
> The fatal flaw is that it is not valid URL syntax.  Trying to be 'nice' to
> users is one thing, but we shouldn't be 'breaking' URL syntax to do it.
> 
	It's not "fatal". And jCIFS proves it (knock on wood!).
>   
> Besides, people (especially those who might use jCIFS as the toolkit to
> build their killer application) will expect to be able to hand in a valid
> URL with the smb:// prefix and get expected results.
> 
	I think the results are well defined. We just don't use escapes.

> Some folks (probably not the other early implementors like Apple and
> Thursby, but some folks...) are likely to look at jCIFS as the reference
> implementation.
> 
> ...and here's something else that confuses me.  It seems you're arguing in
> favor of avoiding the escapes for the sake of user convenience (whack me
> on the head if'n I'm misinterpreting that), but at the same time, the
> current code is pedantic, in places, about the trailing slash '/'.  
> People forget the trailing slash all the time.
> 
	Yes, that is a pain. Unfortunately that is mandated by Java's URL parser. So
	any other Java URL implementation will exhibit the same behavior. It's just
	less noticable because of the way SMB URLs are used (listing directories).

> > Incedentally here's a bit from the XML 1.0 spec:
> > 
> > 	4.1. URI Reference Encoding and Escaping 
> > 
> > 	The set of characters allowed in xml:base attributes is the same as for
> > 	XML, namely [Unicode]. However, some Unicode characters are
> > 	disallowed from URI references, and thus processors must encode and
> > 	escape these characters to obtain a valid URI reference from the attribute
> > 	value.
> > 
> > 	The disallowed characters include all non-ASCII characters, plus the
> > 	excluded characters listed in Section 2.4 of [IETF RFC 2396], except for the
> > 	crosshatch (#) and percent sign (%) characters and the square
> > 	bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters
> > 	must be escaped as follows:
> > 
> > 	Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or
> > 	more bytes.
> > 
> > 	Any octets corresponding to a disallowed character are escaped with the URI
> > 	escaping mechanism (that is, converted to %HH, where HH is the hexadecimal
> > 	notation of the byte value).
> > 
> > 	The original character is replaced by the resulting character sequence.
> 
> Ah.  So, by extension, a URI string is an encoded UTF-8 string, where all 
> disallowed and non-ASCII characters (including all two-byte characters) 
> are escaped.
> 
	No. This is from the *XML* specification. It explicitly states that RFC 2396 does not
	support Unicode. There may be another RFC that does deal with that. I don't know. But
	currently URLs and URIs do not support Unicode. I would imagine many products *do*
	use UTF-8 conversion and excaping to get around that. They must. But that is not part
	of the standard.

> Well I certainly agree with Mike that that's a pain-in-the-woodpile.
> 
> I think--going by Eric's guidelines--that the correct way to handle
> non-ASCII characters is to simply accept them on the command line.  There 
> are still the codepage issues to deal with, but if the input can be read 
> as Unicode then I don't see a problem doing it this way.
> 
	There are no codepage issues. Everything in Java is Unicode unless specified
	otherwise using the file.encoding property. jCIFS will convert all Strings to the encoding
	specified by the jcifs.encoding property.

	Mike