[jcifs] Creating file with hash ('#') in filename

Fri Jan 17 15:32:59 EST 2003

On Thu, Jan 16, 2003 at 07:41:28PM -0500, Allen, Michael B (RSCH) wrote:
> 
> 
> > -----Original Message-----
> > From:	Glass, Eric [SMTP:eric.glass at capitalone.com]
> > Sent:	Thursday, January 16, 2003 6:45 AM
> > To:	'Allen, Michael B (RSCH)'; 'Christopher R. Hertel'
> > Cc:	jcifs at samba.org
> > Subject:	RE: [jcifs] Creating file with hash ('#') in filename
> > 
> > From the standpoint of the robustness principle, I would look at it this
> > way:
> > 
> > The draft should reflect conformance to existing generic URL specifications.
> > jCIFS should accept all SMB URLs which conform to the draft.
> > jCIFS MAY accept SMB URLs which do not conform to the draft, but are
> > "reasonably" broken.
> > In an ideal world, jCIFS would be able to output a conformant URL given an
> > accepted but nonconformant input (for compatibility with other agents).
> > 
> I agree. Ideally this would be ...well ideal. But there is really one major
> problem with this. I thought there were two major problems but the
> performance argument isn't really an issue after thinking about it because
> internally I don't think escape processing would be necessary. So there would
> not really be a performance impact on volitile IO operations like copying
> whereas previously I argued there would be.

Pheeww...  That saves me trying to come up with a coherent argument.  :)

> But, at least one issue does remain. For all non-Latin1/ISO-8859-1 users
> normal %HH escaping will result in a URL that is heavily polluted with escape
> sequences. For non-Latin derived encodings that do not share Latin1
> characters like Russian the *entire* path will be %HH escapes. Even for
> Latin1/ISO-8859-1 users CIFS (unlike HTTP) has MANY instances in paths
> that will need to be encoded and when you listFiles() you have to escape the
> paths.

This is the Unicode issue.  URLs must handle full Unicode in order to deal 
with this stuff.  It is a problem, but it's not specific to the SMB URL.

I think it is perfectly valid to accept Unicode characters at the command 
line, un-escaped.

> Incedentally appending the #ref fragment to the filename is indeed what we do
> as of 0.7.1.

Which I'd call an acceptable kludge.

:
:
> > Note, however, that doing:
> > 
> > URL url = new URL("http://server/directory/file stuff.txt");
> > URI uri = new URI(url.toExternalForm());
> > 
> > throws:
> > 
> > java.net.URISyntaxException: Illegal character in path at index 28:
> > http://server/directory/file stuff.txt
> > 	at java.net.URI$Parser.fail(URI.java:2701)
> > 	at java.net.URI$Parser.checkChars(URI.java:2872)
> > 	at java.net.URI$Parser.parseHierarchical(URI.java:2956)
> > 	at java.net.URI$Parser.parse(URI.java:2904)
> > 	at java.net.URI.<init>(URI.java:565)
> > 
> This is a good example. So they chose not to escape them and they don't
> even have the problems we're facing. This suggests they weighed
> transcribability over syntactic correctness.

I don't follow your argument there.  What is it you are trying to say?
Sorry, I'm being dense, I guess.

Looking at that example, however, I agree with the exception.  The space
is an illegal character.  It is also a transcription problem, since spaces
are difficult to convey properly in handwriting.  They also cause trouble
with word wrap in e'mail and such.  That's why the RFC warns against them.

So, I would say that the above example is an argument in favor of both 
transcribability and correct syntax.

> Unless someone points out a fatal flaw in not escaping these path components
> or someone comes up with a very clever solution, I think we should just wait
> to see how the current code performs. And if we do decide to escape these
> URLs it's not going to be easy. If non-Latin1 characters are convered to
> UTF-8 that's not going to be compliant either I'm afraid. RFC 2396 doesn't
> support Unicode.

The fatal flaw is that it is not valid URL syntax.  Trying to be 'nice' to
users is one thing, but we shouldn't be 'breaking' URL syntax to do it.  
Besides, people (especially those who might use jCIFS as the toolkit to
build their killer application) will expect to be able to hand in a valid
URL with the smb:// prefix and get expected results.

Some folks (probably not the other early implementors like Apple and
Thursby, but some folks...) are likely to look at jCIFS as the reference
implementation.

...and here's something else that confuses me.  It seems you're arguing in
favor of avoiding the escapes for the sake of user convenience (whack me
on the head if'n I'm misinterpreting that), but at the same time, the
current code is pedantic, in places, about the trailing slash '/'.  
People forget the trailing slash all the time.

Anyway, going back to Eric's statement:

> > The draft should reflect conformance to existing generic URL specifications.

Must do, or it's not a URL.  :)

> > jCIFS should accept all SMB URLs which conform to the draft.

Must do, or it's not a complete implementation.

> > jCIFS MAY accept SMB URLs which do not conform to the draft, but are
> > "reasonably" broken.

...provided that doing so doesn't break anything else.  Lots of 
applications try to be helpful.  Netscape, Mozilla, etc...

> > In an ideal world, jCIFS would be able to output a conformant URL given an
> > accepted but nonconformant input (for compatibility with other agents).

That would be a very nice feature, yes.

I just tested this with Mozilla, by the way, and (in my test case) Mozilla 
*did* replace a space in a file name with %20.

The corollary would be that jCIFS MAY return the character form if the
character doesn't need to be escaped.  Eg.:, if the user enters 
"smb://server/share/foob%61r/" then jcifs might present it back to the 
user as "smb://server/share/foobar/", since the 'a' doesn't need to be 
escaped.

> Incedentally here's a bit from the XML 1.0 spec:
> 
> 	4.1. URI Reference Encoding and Escaping 
> 
> 	The set of characters allowed in xml:base attributes is the same as for
> 	XML, namely [Unicode]. However, some Unicode characters are
> 	disallowed from URI references, and thus processors must encode and
> 	escape these characters to obtain a valid URI reference from the attribute
> 	value.
> 
> 	The disallowed characters include all non-ASCII characters, plus the
> 	excluded characters listed in Section 2.4 of [IETF RFC 2396], except for the
> 	crosshatch (#) and percent sign (%) characters and the square
> 	bracket characters re-allowed in [IETF RFC 2732]. Disallowed characters
> 	must be escaped as follows:
> 
> 	Each disallowed character is converted to UTF-8 [IETF RFC 2279] as one or
> 	more bytes.
> 
> 	Any octets corresponding to a disallowed character are escaped with the URI
> 	escaping mechanism (that is, converted to %HH, where HH is the hexadecimal
> 	notation of the byte value).
> 
> 	The original character is replaced by the resulting character sequence.

Ah.  So, by extension, a URI string is an encoded UTF-8 string, where all 
disallowed and non-ASCII characters (including all two-byte characters) 
are escaped.

Well I certainly agree with Mike that that's a pain-in-the-woodpile.

I think--going by Eric's guidelines--that the correct way to handle
non-ASCII characters is to simply accept them on the command line.  There 
are still the codepage issues to deal with, but if the input can be read 
as Unicode then I don't see a problem doing it this way.

Chris -)-----

-- 
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org