[jcifs] Creating file with hash ('#') in filename

Thu Jan 16 16:48:39 EST 2003

> -----Original Message-----
> From:	Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> 
> >	That would happen far too much and annoy people to no end. The only
> >	reason for escaping things in the first place is to make them portable:
> > 
> > 	"The space character is excluded because significant spaces may disappear
> > 	and insignificant spaces may be introduced when URI are transcribed or
> > 	typeset or subjected to the treatment of wordprocessing programs."
> 
> That's the reason for escaping spaces.  There are a lot of other 
> characters that get escaped.  The reasons are:
> 
> 1) They have syntactic meaning within the URL.
> 2) They have syntactic meaning when used to identify a URL in another 
>    context.
> 3) They are non-printing.
> 4) They may get munged by an intermediary.
> 
> RFC 2396:
> 2. URI Characters and Escape Sequences
> 
>    URI consist of a restricted set of characters, primarily chosen to
>    aid transcribability and usability both in computer systems and in
>    non-computer communications. Characters used conventionally as
>    delimiters around URI were excluded.  The restricted set of
>    characters consists of digits, letters, and a few graphic symbols
>    were chosen from those common to most of the character encodings and
>    input facilities available to Internet users.
> 
>       uric          = reserved | unreserved | escaped
> 
>    Within a URI, characters are either used as delimiters, or to
>    represent strings of data (octets) within the delimited portions.
>    Octets are either represented directly by a character (using the US-
>    ASCII character for that octet [ASCII]) or by an escape encoding.
>    This representation is elaborated below.
> 
	But I cannot think of any characters that if placed in the path component would actually *break* the parsing. The '#' did. The '@' does in 0.6 but not in 0.7. '?' doesn't because it's not
permitted in an SMB filename. So none of the characters you refer to can actually create a syntactic problem. Therefore the remaining issue is transcribability and usability. And since not escaping
the path improves transcribability we are left with usability. And usability suffers by not escaping reserved characters because it diminishes portability. The only effect is reduced portability.

> > 	These URLs are not going to be embedded in word processing programs.
> 
> Why not?  I can imagine using Kword to edit a document stored on some 
> other system.  Kword might use the SMB URL string to identify the document 
> in its "most recently accessed" list or somesuch.  Likewise, when I print 
> the document I might include the filename in the footer.  There it is.
> 
> In an office environment, where documents are shared by several people, an 
> SMB URL might be handed around, or even referenced in an internal memo.
> 
> Again, it's dangerous to guess how an SMB URL might be used.
> 
	True. I admit this is a weak argument.

> >       If they are embedded in a web page it will be the path component
> >	appended to an HTTP URL
> 
> If it's a relative URL, then it's an HTTP URL.  If you append it to an
> HTTP URL then it's just a relative portion and has nothing to do with SMB.
> 
	Right this has nothing to do with SMB.

> > 	which the web application needs to escape (i.e. NetworkExplorer).
> >       No one is going	to send an SMB URL to someone in an e-mail. That
> >	kind of stuff just doesn't fit the protocol. That's what HTTP and
> >	FTP are for.
> 
> I disagree with you here, mostly for reasons already stated.  Further, 
> though, I got an email this very day (geez, I write too much) in which 
> someone where I worked was talking about the fact that his users just 
> couldn't get the hang of using FTP or HTTP to share files.
> 
	They don't? People link to websites and anonymous FTP servers almost exclusively. Either that or they just attach a file. I very rarely see a UNC path in an e-mail. I do it occasionally but
I've never really seen anyone else do it.

> > 	Again, we've been here before. If you remember when we un-escaped URLs
> >       in 0.6 we suddenly had to escape them. Then I decided to hang onto
> >	the URL that was passed	in as is and give back what was given in. But
> >	that didn't quite work either. The end result will be that
> >	escaping will creep in.
> 
> I don't think it should creep back in.  It needs to be handled head-on.
> 
	I didn't decribe this adequately. I meant that when you get a URL like:

	smb://server/share/mike @ work/

	You can save the original URL given and when asked give back the same thing. You know it's ok because that's what they gave you. But now let's say you listFiles() and they have these reserved
characters. Now you don't know if it's ok:

	smb://server/share/mike @ work/
	smb://server/share/mike @ work/
	smb://server/share/mike @ work/

> > 	More importantly, all that character manipulation is very costly.
> 
> Then its important to find ways to minimize it by figuring out the locus 
> at which it must occur.
> 
> As usual, I'm on the theory side of the house here, and the job is to 
> figure out how to make this all practical.  Annoying, eh?
> 
> Thing one:  The unescaping doesn't occur until the URL is split into its 
>             component pieces.  That's logical, since the escapes may be 
>             protecting some character that would otherwise be a delimiter.
> 
> Thing two:  Once the URL is decomposed, the pieces can be unescaped, but 
>             the escaped version would be kept as well.  If a change is 
>             made to, say, the path then both versions are updated.
> 
> Thing three: Well, maybe not.  If the escaped version is updated then 
>             there is no need to update the unescaped version until it's 
>             actually used.  A flag would be needed...
> 
> Thing four: I know exactly how I would do this if I were not relying on 
>             java.net.URL.  I don't know what kind of monkey wrench that 
>             throws into things.
> 
> > > >       Can
> > > > 	someone give me a reason why we *have* to require URL encoding of the
> > > > 	path component? Otherwise I think we should punt the '#ref' and just
> > > >       integrate it into the path. Anything we would use it for can be
> > > >       done with a query_string parameter.
> > > 
> > > The '#' character isn't the only problem.  You could fudge that one.  
> > > There are other characters (eg., spaces) which are not legal URL
> > > characters.  Non-english language characters, for example.
> > > 
> > > The key thing, though, is that a user may type in an SMB URL with a URL
> > > escape sequence included.
> > > 
> > 	This is a very unlikey scenario but in that application (a web browser
> >       maybe) the application will be responsible for un-escaping it.
> 
> When and why?  The unescaping must be done after the parsing but before 
> the calls to jCIFS.  The reverse--escaping strings returned by jCIFS 
> before handing them to java.net.URL-- also fits in between.  Is that 
> (honestly, I don't know...my head's been in my book) something that is 
> do-able with jCIFS as it works currently?
> 
> > > > 	Incedentally speaking of query_string parameters we got lucky with the '?'
> > > > 	character. That *is* reserved in SMB pathnames. It's a wildcard character.
> > > > 	Otherwise we really *would* have to require escaping path components.
> > > 
> > > We still do.  :)
> > > 
> > 	Seriously, what will *break* if we do not mandate escaping the path
> >	component?
> 
> Any valid Windows filename or directoryname character that is not also a 
> valid URL character.
> 
> You are using java.net.URL which, if I understand correctly, does the 
> parsing for you.  I imagine that other tools exist out there that also do 
> generic URL parsing.  These tools may simply ignore "illegal" URL 
> characters, or they may not.  If java.net.URL simply passes such 
> characters along then jCIFS is okay...except that some of the URLs that 
> work with jCIFS won't work with, say, KDE or MacOS X other tools that 
> support the SMB URL.
> 
> > 
> > > > 	Anyway it looks like just tacking the '#ref' back onto the path component in
> > > > 	Handler.java is going to do the trick.
> > > 
> > > ...for that *one* case, and it is still a user convenience at the expense 
> > > of correct syntax.
> > > 
> > 	This is the debate right here. The SMB URL cannot be used on the
> >       Internet because it's character range is too great. It inherently
> >	stomps on reserved characters. So are you weighing the "user
> >	convenience" side enough?
> 
> Nope.  There is no reason that a file specified by the SMB URL could not 
> also be offered to the Internet via a web server.  The same path, same 
> file, same problem.
> 
> The current RFC specifies the US ASCII set of characters, so Unicode
> simply isn't supported by generic URI.  We fall into that the same way FTP
> and HTTP do.  I will have to ask if/when that will be covered.  A new
> generic URI draft is in the works, by the way.
> 
> Anyway, if "the SMB URL cannot be used on the Internet" then there is no
> point in pursuing the draft, since it is an "Internet Draft".  The point
> is that the SMB URL *can* be used on the Internet (no judgement as to
> whether this is wise or not...except to say that it's no less secure 
> than FTP, which sends passwords in cleartext).
> 
> More likely, the SMB URL will be used on the "in*tra*net" within an office
> or company or suchlike.  That, however, is pure conjecture on my part so
> who knows...
> 
> > > The HTTP URL is just an instance of a URL.  A descendant type.  The rules 
> > > apply to all URLs.
> > > 
> > 	Well in this case I meant the path component really IS an HTTP URL
> >       because
> > 	that is how the requested path is passed to the servlet like:
> > 
> > 	  http://miallen3.com:8080/servlets/NetworkExplorer/miallen1/C$/pub/
> 
> Ah.  Okay.
> 
> > > Sorry.  :(
> > > 
> > 	No reason to be. I am just trying to determine with certainty the answer to the
> > 	question: Will the SMB URL *break* if we do not escape the path component?
> 
> It's not really a question of whether *we* escape the path.  On input, the 
> user "should" do so.  A "user-friendly" application may try to clean the 
> URL itself and correct the user's mistakes (as well as it can).  Doing so, 
> though, runs a bit of a risk by allowing "invalid" URL strings to 
> propogate.
> 
> When returning a URL to the user, I believe that it should have correct
> syntax.
> 
> As far as the SMB URL breaking, my thoughts are these:
> 
> - Escapes should be handled regardless, simply because they *may* be used.
>   They may be cut-and-pasted from other sources, for example.
> 
> - Other general-purpose parsers may or may not handle URLs that are not
>   escaped properly.  jCIFS can be compatible by aiming for (not
>   necessarily being) the least-common-denominator.
> 
> - Within the ASCII set, the list of characters that would need to be 
>   escaped is limited to those that are valid filename characters but are 
>   not valid URL 'pchar' characters:
> 
>     pchar         = unreserved | escaped |
>                     ":" | "@" | "&" | "=" | "+" | "$" | ","
> 
>     unreserved    = alphanum | mark
>           mark    = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
> 
>   That leaves us with only a small number of disallowed ASCII characters.  
>   The '#' and the space are probably the most conspicuous.
> 
> - I have no idea how Unicode and extended ASCII should be handled 
> off-hand.
> 
> > 	After we answer that question we can debate whether or not "correct syntax"
> > 	out weighs "user convenience".
> 
> Good 'nough.  :)
> 
> Chris -)-----
> 
> -- 
> Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
> jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
> ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
> OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org