[jcifs] [URL Potability] Creating file with hash ('#') in
fil ename
Allen, Michael B (RSCH)
Michael_B_Allen at ml.com
Fri Jan 17 10:33:02 EST 2003
Lookout won't let me send this!
> -----Original Message-----
> From: Allen, Michael B (RSCH)
> Sent: Thursday, January 16, 2003 1:09 AM
> To: 'Christopher R. Hertel'
> Cc: 'jcifs at samba.org'
> Subject: RE: [jcifs] [URL Potability] Creating file with hash ('#') in filename
>
>
>
> > -----Original Message-----
> > From: Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> >
> > > That would happen far too much and annoy people to no end. The only
> > > reason for escaping things in the first place is to make them portable:
> > >
> > > "The space character is excluded because significant spaces may disappear
> > > and insignificant spaces may be introduced when URI are transcribed or
> > > typeset or subjected to the treatment of wordprocessing programs."
> >
> > That's the reason for escaping spaces. There are a lot of other
> > characters that get escaped. The reasons are:
> >
> > 1) They have syntactic meaning within the URL.
> > 2) They have syntactic meaning when used to identify a URL in another
> > context.
> > 3) They are non-printing.
> > 4) They may get munged by an intermediary.
> >
> > RFC 2396:
> > 2. URI Characters and Escape Sequences
> >
> > URI consist of a restricted set of characters, primarily chosen to
> > aid transcribability and usability both in computer systems and in
> > non-computer communications. Characters used conventionally as
> > delimiters around URI were excluded. The restricted set of
> > characters consists of digits, letters, and a few graphic symbols
> > were chosen from those common to most of the character encodings and
> > input facilities available to Internet users.
> >
> > uric = reserved | unreserved | escaped
> >
> > Within a URI, characters are either used as delimiters, or to
> > represent strings of data (octets) within the delimited portions.
> > Octets are either represented directly by a character (using the US-
> > ASCII character for that octet [ASCII]) or by an escape encoding.
> > This representation is elaborated below.
> >
> But I cannot think of any characters that if placed in the path component would
> actually *break* the parsing. The '#' did. The '@' does in 0.6 but not in 0.7. '?'
> doesn't because it's not permitted in an SMB filename. So none of the characters
> you refer to can actually create a syntactic problem. Therefore the remaining
> issue is transcribability and usability. And since not escaping the path improves
> transcribability we are left with usability. And usability suffers by not escaping
> reserved characters because it diminishes portability. The only effect is reduced
> portability.
>
> > > These URLs are not going to be embedded in word processing programs.
> >
> > Why not? I can imagine using Kword to edit a document stored on some
> > other system. Kword might use the SMB URL string to identify the document
> > in its "most recently accessed" list or somesuch. Likewise, when I print
> > the document I might include the filename in the footer. There it is.
> >
> > In an office environment, where documents are shared by several people, an
> > SMB URL might be handed around, or even referenced in an internal memo.
> >
> > Again, it's dangerous to guess how an SMB URL might be used.
> >
> True. I admit this is a weak argument.
>
> > > If they are embedded in a web page it will be the path component
> > > appended to an HTTP URL
> >
> > If it's a relative URL, then it's an HTTP URL. If you append it to an
> > HTTP URL then it's just a relative portion and has nothing to do with SMB.
> >
> Right this has nothing to do with SMB.
>
> > > which the web application needs to escape (i.e. NetworkExplorer).
> > > No one is going to send an SMB URL to someone in an e-mail. That
> > > kind of stuff just doesn't fit the protocol. That's what HTTP and
> > > FTP are for.
> >
> > I disagree with you here, mostly for reasons already stated. Further,
> > though, I got an email this very day (geez, I write too much) in which
> > someone where I worked was talking about the fact that his users just
> > couldn't get the hang of using FTP or HTTP to share files.
> >
>
> They don't? People link to websites and anonymous FTP servers almost
> exclusively. Either that or they just attach a file. I very rarely see a UNC path in
> an e-mail. I do it occasionally but I've never really seen anyone else do it.
>
> > > Again, we've been here before. If you remember when we un-escaped URLs
> > > in 0.6 we suddenly had to escape them. Then I decided to hang onto
> > > the URL that was passed in as is and give back what was given in. But
> > > that didn't quite work either. The end result will be that
> > > escaping will creep in.
> >
> > I don't think it should creep back in. It needs to be handled head-on.
> >
>
> I didn't decribe this adequately. The escaping will "creep" in when you basically
> listFiles(). When you listFiles() you have to escape their paths. This is why:
>
> When you get a URL like:
>
> smb://server/share/mike @ work/
>
> You can save the original URL given and when asked give back the same thing.
> You know it's ok because that's what they gave you. But now let's say you
> listFiles() and they have these reserved characters. Now you don't know if it's ok
> so you have to escape them:
>
> smb://server/share/mike @ work/Some Stuff/
> smb://server/share/mike @ work/file892%44.bin
> smb://server/share/mike @ work/file9#9009.bin
>
> As you drill down you get more and more escaped stuff creeping in. But this is
> theory. In practice you can't even do it this way. You cannot selectively escape
> part of the path. Once you listFiles() you have to completely escape the path. I
> know you know that so I won't try to prove why.
>
> > > More importantly, all that character manipulation is very costly.
> >
> > Then its important to find ways to minimize it by figuring out the locus
> > at which it must occur.
>
> See above. The whole path must be escaped.
>
> >
> > As usual, I'm on the theory side of the house here, and the job is to
> > figure out how to make this all practical. Annoying, eh?
> >
> > Thing one: The unescaping doesn't occur until the URL is split into its
> > component pieces. That's logical, since the escapes may be
> > protecting some character that would otherwise be a delimiter.
> >
> > Thing two: Once the URL is decomposed, the pieces can be unescaped, but
> > the escaped version would be kept as well. If a change is
> > made to, say, the path then both versions are updated.
> >
> > Thing three: Well, maybe not. If the escaped version is updated then
> > there is no need to update the unescaped version until it's
> > actually used. A flag would be needed...
> >
> > Thing four: I know exactly how I would do this if I were not relying on
> > java.net.URL. I don't know what kind of monkey wrench that
> > throws into things.
>
>
> You know jCIFS is actually quite fast? In almost every instance trivial tests
> have shown it's actually faster than all the other clients (of course it uses much
> more resources). In some cases it's a *lot* faster. It's great at copying large
> bushy trees of directories around. The crawers like ThreadedSmbCrawler will
> fly through entire machines in a matter of seconds.
>
> Now add URL path name escaping. Even if there are no offensive characters
> you still have to parse and test each character to see that it falls within the
> prescribed set. This would slow things down noticably.
>
> So that coupled with total desimation of transcribability are lost for what?
>
> Portability.
>
More information about the jcifs
mailing list