[jcifs] Creating file with hash ('#') in filename

Thu Jan 16 14:07:27 EST 2003

> -----Original Message-----
> From:	Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> Sent:	Wednesday, January 15, 2003 9:25 PM
> To:	Allen, Michael B (RSCH)
> Cc:	jcifs at samba.org
> Subject:	Re: [jcifs] Creating file with hash ('#') in filename
> 
> On Wed, Jan 15, 2003 at 09:03:09PM -0500, Allen, Michael B (RSCH) wrote:
> > 
> > 
> > > -----Original Message-----
> > > From:	Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> > > 
> > > > The java.net.URL class parses the URL and *before* the jcifs.smb.Hanlder
> > > > gets it. So the '#ref' is getting picked out. I was just saying perhaps
> > > > I can append it back on to create an internal path that retains it.
> > > 
> > > Only if you want to bypass the standard syntax for URLs.  :)
> > > 
> > 	You mean HTTP URLs. It's the HTTP URL that uses '#'. We don't have any
> > 	use for it.
> 
> No, I mean that the parsing of URLs is standard, based on the RFC.  The # 
> is defined as part of the syntax of generic URLs.  It's just that the 
> semantics have no meaning for the SMB URL.
> 
	Ok. I though it was specific to HTTP.

> > > The # (if unescaped) in that position should be a delimiter and the
> > > pedantic way to handle it is to cough back an error.
> > > 
> > 	For HTTP URLs. For SMB URLs this remains to be seen. We cannot
> > 	conform to the HTTP URL closely without a cost.
> 
> We are trying to conform to the generic specification for URLs.  We have 
> (by necessity) overloaded the general form by adding NBT name syntax, and 
> further defining subfields within existing fields.  Nothing in the SMB URL 
> syntax actually overrides generic URL syntax.
> 
	True. By removing the '#' as the ref delimeter we will be breaking the generic
	URL syntax (albeit not by much and of no consequence to anyone).

> >       The main problem is that
> > 	SMB path names need to represent just about any character including
> > 	Unicode which we haven't even touched on. I personally do not want to
> > 	decode paths. That is very costly.
> 
> We only need to unescape them.
> 
	When decoded. But what happens when we pass a string back out? You have
	to encode it. Now if you have spaces and @ and # it all get's escaped.
	Unless... see below...

> >       It is very likely that SMB URLs will contain
> > 	reserved characters like space, '@', and '#'. We cannot accept both encoded
> > 	and non encoded URLs because URLs returned by jCIFS will need to be
> > 	encoded.
> 
> If by "encoded" you mean "escaped" (I'm being pedantic).
> 
	Right. Escape and unescape.

> Think of it as a translation.  There is the name in SMB format and the 
> same name in URL format.  In the latter case, characters which are not 
> permitted by URL syntax must be escaped.  So, when translating from URL 
> format to SMB format, you unescape.  When translating from SMB to URL, you 
> gotta escape them again.
> 
> >       Even if you pass back whatever was passed in how do you handle
> > 	URLs derived from a parent during a list() operation.
> 
> Go through the string one character (ASCII or Unicode) at a time and 
> rewrite it.
> 
> >       It get's very messy.
> 
> Two methods:  urlEscape() and urlUnEscape().
> 
	We cannot do that. We've been here before. The SMB URLs are not used like
	normal URLs. People will need to specify them manually. They will be used
	where UNC paths are used. We cannot mandate that spaces, '@', '#' and unicode
	characters be escaped. That would happen far too much and annoy people to no
	end. The only reason for escaping things in the first place is to make them portable:

		"The space character is excluded because significant spaces may disappear
		and insignificant spaces may be introduced when URI are transcribed or
		typeset or subjected to the treatment of wordprocessing programs."

	These URLs are not going to be embedded in word processing programs. If they are
	embedded in a web page it will be the path component appended to an HTTP URL
	which the web application needs to escape (i.e. NetworkExplorer). No one is going
	to send an SMB URL to someone in an e-mail. That kind of stuff just doesn't fit the
	protocol. That's what HTTP and FTP are for.

	Again, we've been here before. If you remember when we un-escaped URLs in 0.6 we
	suddenly had to escape them. Then I decided to hang onto the URL that was passed
	in as is and give back what was given in. But that didn't quite work either. The end
	result will be that escaping will creep in.

	More importantly, all that character manipulation is very costly.

> >       Can
> > 	someone give me a reason why we *have* to require URL encoding of the
> > 	path component? Otherwise I think we should punt the '#ref' and just
> >       integrate it into the path. Anything we would use it for can be
> >       done with a query_string parameter.
> 
> The '#' character isn't the only problem.  You could fudge that one.  
> There are other characters (eg., spaces) which are not legal URL
> characters.  Non-english language characters, for example.
> 
> The key thing, though, is that a user may type in an SMB URL with a URL
> escape sequence included.
> 
	This is a very unlikey scenario but in that application (a web browser maybe) the
	application will be responsible for un-escaping it.

> > 	Incedentally speaking of query_string parameters we got lucky with the '?'
> > 	character. That *is* reserved in SMB pathnames. It's a wildcard character.
> > 	Otherwise we really *would* have to require escaping path components.
> 
> We still do.  :)
> 
	Seriously, what will *break* if we do not mandate escaping the path component?

> > 	Anyway it looks like just tacking the '#ref' back onto the path component in
> > 	Handler.java is going to do the trick.
> 
> ...for that *one* case, and it is still a user convenience at the expense 
> of correct syntax.
> 
	This is the debate right here. The SMB URL cannot be used on the Internet
	because it's character range is too great. It inherently stomps on reserved
	characters. So are you weighing the "user convenience" side enough?

> >	NetworkExporer doesn't like it but that's
> > 	because the SMB URLs are going through the browser as part of the path. In
> > 	this case they *are* HTTP URLs and as such need to be escaped. I'll leave
> > 	the NetworkExplorer fixes till later I think.
> 
> The HTTP URL is just an instance of a URL.  A descendant type.  The rules 
> apply to all URLs.
> 
	Well in this case I meant the path component really IS an HTTP URL because
	that is how the requested path is passed to the servlet like:

	  http://miallen3.com:8080/servlets/NetworkExplorer/miallen1/C$/pub/

> Sorry.  :(
> 
	No reason to be. I am just trying to determine with certainty the answer to the
	question: Will the SMB URL *break* if we do not escape the path component?

	After we answer that question we can debate whether or not "correct syntax"
	out weighs "user convenience".