[jcifs] Creating file with hash ('#') in filename
Allen, Michael B (RSCH)
Michael_B_Allen at ml.com
Fri Jan 17 11:41:28 EST 2003
> -----Original Message-----
> From: Glass, Eric [SMTP:eric.glass at capitalone.com]
> Sent: Thursday, January 16, 2003 6:45 AM
> To: 'Allen, Michael B (RSCH)'; 'Christopher R. Hertel'
> Cc: jcifs at samba.org
> Subject: RE: [jcifs] Creating file with hash ('#') in filename
>
> From the standpoint of the robustness principle, I would look at it this
> way:
>
> The draft should reflect conformance to existing generic URL specifications.
> jCIFS should accept all SMB URLs which conform to the draft.
> jCIFS MAY accept SMB URLs which do not conform to the draft, but are
> "reasonably" broken.
> In an ideal world, jCIFS would be able to output a conformant URL given an
> accepted but nonconformant input (for compatibility with other agents).
>
I agree. Ideally this would be ...well ideal. But there is really one major
problem with this. I thought there were two major problems but the
performance argument isn't really an issue after thinking about it because
internally I don't think escape processing would be necessary. So there would
not really be a performance impact on volitile IO operations like copying
whereas previously I argued there would be.
But, at least one issue does remain. For all non-Latin1/ISO-8859-1 users
normal %HH escaping will result in a URL that is heavily polluted with escape
sequences. For non-Latin derived encodings that do not share Latin1
characters like Russian the *entire* path will be %HH escapes. Even for
Latin1/ISO-8859-1 users CIFS (unlike HTTP) has MANY instances in paths
that will need to be encoded and when you listFiles() you have to escape the
paths.
> As an example, all agents should be able to accept this as a valid URL:
>
> smb://server/directory/file%23stuff.txt
>
> Likewise, all agents should be able to PARSE (although not necessarily
> accept as semantically valid):
>
> smb://server/directory/file#stuff.txt
>
> Assuming that the draft does not specify a semantic for fragments,
> interpretation of this may vary. jCIFS may append the fragment onto the
> filename; another agent may indicate that fragments are semantically
>
Incedentally appending the #ref fragment to the filename is indeed what we do
as of 0.7.1.
> meaningless in the context of an SMB URL and give an error.
>
> jCIFS MAY be able to parse:
>
> smb://server/directory/file stuff.txt
>
> But other agents should not be required to do so. jCIFS (and all agents,
> for that matter) SHOULD be able to represent this as:
>
> smb://server/directory/file%20stuff.txt
>
> In an ideal world, doing:
>
> URL url = new URL("smb://server/directory/file stuff.txt");
> System.out.println(url.toExternalForm());
>
> would output:
>
> smb://server/directory/file%20stuff.txt
>
> to enable jCIFS to output compliant URLs regardless of what particular
> inconsistencies it chooses to accept. Sun's implementation of HTTP URLs
> does not do this, however, so I wouldn't expect jCIFS to be required to do
> so. For example, if I do:
>
> URL url = new URL("http://server/directory/file stuff.txt");
> System.out.println(url.toExternalForm());
>
> I get (the invalid):
>
> http://server/directory/file stuff.txt
>
> Note, however, that doing:
>
> URL url = new URL("http://server/directory/file stuff.txt");
> URI uri = new URI(url.toExternalForm());
>
> throws:
>
> java.net.URISyntaxException: Illegal character in path at index 28:
> http://server/directory/file stuff.txt
> at java.net.URI$Parser.fail(URI.java:2701)
> at java.net.URI$Parser.checkChars(URI.java:2872)
> at java.net.URI$Parser.parseHierarchical(URI.java:2956)
> at java.net.URI$Parser.parse(URI.java:2904)
> at java.net.URI.<init>(URI.java:565)
>
This is a good example. So they chose not to escape them and they don't
even have the problems were facing. This suggests they weighed
transcribability over syntactic correctness.
Unless someone points out a fatal flaw in not escaping these path components
or someone comes up with a very clever solution, I think we should just wait to
see how the current code performs. And if we do decide to escape these URLs
it's not going to be easy. If non-Latin1 characters are convered to UTF-8 that's
not going to be compliant either I'm afraid. RFC 2396 doesn't support Unicode.
Incedentally here's a bit from the XML 1.0 spec:
4.1. URI Reference Encoding and Escaping
The set of characters allowed in xml:base attributes is the same as for
XML, namely [Unicode]. However, some Unicode characters are
disallowed from URI references, and thus processors must encode and
escape these characters to obtain a valid URI reference from the attribute
value.
The disallowed characters include all non-ASCII characters, plus the
excluded characters listed in Section 2.4 of [IETF RFC 2396], except for the
crosshatch (#) and percent sign (%) characters and the square
bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters
must be escaped as follows:
Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or
more bytes.
Any octets corresponding to a disallowed character are escaped with the URI
escaping mechanism (that is, converted to %HH, where HH is the hexadecimal
notation of the byte value).
The original character is replaced by the resulting character sequence.
More information about the jcifs
mailing list