[jcifs] [URL Potability] Creating file with hash ('#') in fil ename

Fri Jan 17 10:33:02 EST 2003

Lookout won't let me send this!

> -----Original Message-----
> From:	Allen, Michael B (RSCH) 
> Sent:	Thursday, January 16, 2003 1:09 AM
> To:	'Christopher R. Hertel'
> Cc:	'jcifs at samba.org'
> Subject:	RE: [jcifs] [URL Potability] Creating file with hash ('#') in filename
> 
> 
> 
> > -----Original Message-----
> > From:	Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> > 
> > >	That would happen far too much and annoy people to no end. The only
> > >	reason for escaping things in the first place is to make them portable:
> > > 
> > > 	"The space character is excluded because significant spaces may disappear
> > > 	and insignificant spaces may be introduced when URI are transcribed or
> > > 	typeset or subjected to the treatment of wordprocessing programs."
> > 
> > That's the reason for escaping spaces.  There are a lot of other 
> > characters that get escaped.  The reasons are:
> > 
> > 1) They have syntactic meaning within the URL.
> > 2) They have syntactic meaning when used to identify a URL in another 
> >    context.
> > 3) They are non-printing.
> > 4) They may get munged by an intermediary.
> > 
> > RFC 2396:
> > 2. URI Characters and Escape Sequences
> > 
> >    URI consist of a restricted set of characters, primarily chosen to
> >    aid transcribability and usability both in computer systems and in
> >    non-computer communications. Characters used conventionally as
> >    delimiters around URI were excluded.  The restricted set of
> >    characters consists of digits, letters, and a few graphic symbols
> >    were chosen from those common to most of the character encodings and
> >    input facilities available to Internet users.
> > 
> >       uric          = reserved | unreserved | escaped
> > 
> >    Within a URI, characters are either used as delimiters, or to
> >    represent strings of data (octets) within the delimited portions.
> >    Octets are either represented directly by a character (using the US-
> >    ASCII character for that octet [ASCII]) or by an escape encoding.
> >    This representation is elaborated below.
> > 
> But I cannot think of any characters that if placed in the path component would
> actually *break* the parsing. The '#' did. The '@' does in 0.6 but not in 0.7. '?'
> doesn't because it's not permitted in an SMB filename. So none of the characters
> you refer to can actually create a syntactic problem. Therefore the remaining
> issue is transcribability and usability. And since not escaping the path improves
> transcribability we are left with usability. And usability suffers by not escaping
> reserved characters because it diminishes portability. The only effect is reduced
> portability.
> 
> > > 	These URLs are not going to be embedded in word processing programs.
> > 
> > Why not?  I can imagine using Kword to edit a document stored on some 
> > other system.  Kword might use the SMB URL string to identify the document 
> > in its "most recently accessed" list or somesuch.  Likewise, when I print 
> > the document I might include the filename in the footer.  There it is.
> > 
> > In an office environment, where documents are shared by several people, an 
> > SMB URL might be handed around, or even referenced in an internal memo.
> > 
> > Again, it's dangerous to guess how an SMB URL might be used.
> > 
> 	True. I admit this is a weak argument.
> 
> > >       If they are embedded in a web page it will be the path component
> > >	appended to an HTTP URL
> > 
> > If it's a relative URL, then it's an HTTP URL.  If you append it to an
> > HTTP URL then it's just a relative portion and has nothing to do with SMB.
> > 
> Right this has nothing to do with SMB.
> 
> > > 	which the web application needs to escape (i.e. NetworkExplorer).
> > >       No one is going	to send an SMB URL to someone in an e-mail. That
> > >	kind of stuff just doesn't fit the protocol. That's what HTTP and
> > >	FTP are for.
> > 
> > I disagree with you here, mostly for reasons already stated.  Further, 
> > though, I got an email this very day (geez, I write too much) in which 
> > someone where I worked was talking about the fact that his users just 
> > couldn't get the hang of using FTP or HTTP to share files.
> > 
> 
> They don't? People link to websites and anonymous FTP servers almost
> exclusively. Either that or they just attach a file. I very rarely see a UNC path in
> an e-mail. I do it occasionally but I've never really seen anyone else do it.
> 
> > > 	Again, we've been here before. If you remember when we un-escaped URLs
> > >       in 0.6 we suddenly had to escape them. Then I decided to hang onto
> > >	the URL that was passed	in as is and give back what was given in. But
> > >	that didn't quite work either. The end result will be that
> > >	escaping will creep in.
> > 
> > I don't think it should creep back in.  It needs to be handled head-on.
> > 
> 
> I didn't decribe this adequately. The escaping will "creep" in when you basically
> listFiles(). When you listFiles() you have to escape their paths. This is why:
> 
> When you get a URL like:
> 
> 	smb://server/share/mike @ work/
> 
> You can save the original URL given and when asked give back the same thing.
> You know it's ok because that's what they gave you. But now let's say you
> listFiles() and they have these reserved characters. Now you don't know if it's ok
> so you have to escape them:
> 
> 	smb://server/share/mike @ work/Some Stuff/
> 	smb://server/share/mike @ work/file892%44.bin
> 	smb://server/share/mike @ work/file9#9009.bin
> 
> As you drill down you get more and more escaped stuff creeping in. But this is
> theory. In practice you can't even do it this way. You cannot selectively escape
> part of the path. Once you listFiles() you have to completely escape the path. I
> know you know that so I won't try to prove why.
> 
> > > 	More importantly, all that character manipulation is very costly.
> > 
> > Then its important to find ways to minimize it by figuring out the locus 
> > at which it must occur.
> 
> See above. The whole path must be escaped.
> 
> > 
> > As usual, I'm on the theory side of the house here, and the job is to 
> > figure out how to make this all practical.  Annoying, eh?
> > 
> > Thing one:  The unescaping doesn't occur until the URL is split into its 
> >             component pieces.  That's logical, since the escapes may be 
> >             protecting some character that would otherwise be a delimiter.
> > 
> > Thing two:  Once the URL is decomposed, the pieces can be unescaped, but 
> >             the escaped version would be kept as well.  If a change is 
> >             made to, say, the path then both versions are updated.
> > 
> > Thing three: Well, maybe not.  If the escaped version is updated then 
> >             there is no need to update the unescaped version until it's 
> >             actually used.  A flag would be needed...
> > 
> > Thing four: I know exactly how I would do this if I were not relying on 
> >             java.net.URL.  I don't know what kind of monkey wrench that 
> >             throws into things.
> 
> 
> You know jCIFS is actually quite fast? In almost every instance trivial tests
> have shown it's actually faster than all the other clients (of course it uses much
> more resources). In some cases it's a *lot* faster. It's great at copying large
> bushy trees of directories around. The crawers like ThreadedSmbCrawler will
> fly through entire machines in a matter of seconds.
> 
> Now add URL path name escaping. Even if there are no offensive characters
> you still have to parse and test each character to see that it falls within the
> prescribed set. This would slow things down noticably.
> 
> So that coupled with total desimation of transcribability are lost for what?
> 
> Portability.
>