[jcifs] Creating file with hash ('#') in filename

Fri Jan 17 11:41:28 EST 2003

> -----Original Message-----
> From:	Glass, Eric [SMTP:eric.glass at capitalone.com]
> Sent:	Thursday, January 16, 2003 6:45 AM
> To:	'Allen, Michael B (RSCH)'; 'Christopher R. Hertel'
> Cc:	jcifs at samba.org
> Subject:	RE: [jcifs] Creating file with hash ('#') in filename
> 
> From the standpoint of the robustness principle, I would look at it this
> way:
> 
> The draft should reflect conformance to existing generic URL specifications.
> jCIFS should accept all SMB URLs which conform to the draft.
> jCIFS MAY accept SMB URLs which do not conform to the draft, but are
> "reasonably" broken.
> In an ideal world, jCIFS would be able to output a conformant URL given an
> accepted but nonconformant input (for compatibility with other agents).
> 
	I agree. Ideally this would be ...well ideal. But there is really one major
	problem with this. I thought there were two major problems but the
	performance argument isn't really an issue after thinking about it because
	internally I don't think escape processing would be necessary. So there would
	not really be a performance impact on volitile IO operations like copying
	whereas previously I argued there would be.

	But, at least one issue does remain. For all non-Latin1/ISO-8859-1 users
	normal %HH escaping will result in a URL that is heavily polluted with escape
	sequences. For non-Latin derived encodings that do not share Latin1
	characters like Russian the *entire* path will be %HH escapes. Even for
	Latin1/ISO-8859-1 users CIFS (unlike HTTP) has MANY instances in paths
	that will need to be encoded and when you listFiles() you have to escape the
	paths.

> As an example, all agents should be able to accept this as a valid URL:
> 
> smb://server/directory/file%23stuff.txt
> 
> Likewise, all agents should be able to PARSE (although not necessarily
> accept as semantically valid):
> 
> smb://server/directory/file#stuff.txt
> 
> Assuming that the draft does not specify a semantic for fragments,
> interpretation of this may vary.  jCIFS may append the fragment onto the
> filename; another agent may indicate that fragments are semantically
> 
	Incedentally appending the #ref fragment to the filename is indeed what we do
	as of 0.7.1.

> meaningless in the context of an SMB URL and give an error.
> 
> jCIFS MAY be able to parse:
> 
> smb://server/directory/file stuff.txt
> 
> But other agents should not be required to do so.  jCIFS (and all agents,
> for that matter) SHOULD be able to represent this as:
> 
> smb://server/directory/file%20stuff.txt
> 
> In an ideal world, doing:
> 
> URL url = new URL("smb://server/directory/file stuff.txt");
> System.out.println(url.toExternalForm());
> 
> would output:
> 
> smb://server/directory/file%20stuff.txt
> 
> to enable jCIFS to output compliant URLs regardless of what particular
> inconsistencies it chooses to accept.  Sun's implementation of HTTP URLs
> does not do this, however, so I wouldn't expect jCIFS to be required to do
> so.  For example, if I do:
> 
> URL url = new URL("http://server/directory/file stuff.txt");
> System.out.println(url.toExternalForm());
> 
> I get (the invalid):
> 
> http://server/directory/file stuff.txt
> 
> Note, however, that doing:
> 
> URL url = new URL("http://server/directory/file stuff.txt");
> URI uri = new URI(url.toExternalForm());
> 
> throws:
> 
> java.net.URISyntaxException: Illegal character in path at index 28:
> http://server/directory/file stuff.txt
> 	at java.net.URI$Parser.fail(URI.java:2701)
> 	at java.net.URI$Parser.checkChars(URI.java:2872)
> 	at java.net.URI$Parser.parseHierarchical(URI.java:2956)
> 	at java.net.URI$Parser.parse(URI.java:2904)
> 	at java.net.URI.<init>(URI.java:565)
> 
	This is a good example. So they chose not to escape them and they don't
	even have the problems were facing. This suggests they weighed
	transcribability over syntactic correctness.

	Unless someone points out a fatal flaw in not escaping these path components
	or someone comes up with a very clever solution, I think we should just wait to
	see how the current code performs. And if we do decide to escape these URLs
	it's not going to be easy. If non-Latin1 characters are convered to UTF-8 that's
	not going to be compliant either I'm afraid. RFC 2396 doesn't support Unicode.

	Incedentally here's a bit from the XML 1.0 spec:

		4.1. URI Reference Encoding and Escaping 

		The set of characters allowed in xml:base attributes is the same as for
		XML, namely [Unicode]. However, some Unicode characters are
		disallowed from URI references, and thus processors must encode and
		escape these characters to obtain a valid URI reference from the attribute
		value.

		The disallowed characters include all non-ASCII characters, plus the
		excluded characters listed in Section 2.4 of [IETF RFC 2396], except for the
		crosshatch (#) and percent sign (%) characters and the square
		bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters
		must be escaped as follows:

		Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or
		more bytes.

		Any octets corresponding to a disallowed character are escaped with the URI
		escaping mechanism (that is, converted to %HH, where HH is the hexadecimal
		notation of the byte value).

		The original character is replaced by the resulting character sequence.