[jcifs] SMB URL parsing buglet?

Thu Sep 5 15:54:16 EST 2002

"Allen, Michael B (RSCH)" wrote:
> 
> > -----Original Message-----
> > From: Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> > Sent: Thursday, September 05, 2002 12:04 AM
> > To:   jcifs at samba.org
> > Subject:      [jcifs] SMB URL parsing buglet?
> >
> > Mike,
> >
> > java List smb://192.168.101.16/c/My%20Download%20Files
> >
> > dumps an exception, but
> >
> > java List "smb://192.168.101.16/c/My Download Files"
> >
> > works fine.  The % escapes should work, though.
> >
>         True. I believe we determined it was only necessary to URL
>         encode the authority component before the '@' (e.g. another '@').
>         So if it's not necessary to URL encode path names I don't bother
>         to try and decode them for the sake of performance. But with
>         applications like NetworkExplorer it would be a great idea to
>         URL decode these.

I don't think I would have determined that.

The string form of the URL, when displayed, should always be presented as
fully encoded.  True, some users will type the URLs in with "illegal"
characters and it is nice if the code can cope.  It's also true that
different parts of the URL string will require that different characters are
encoded.  The '@', for instance, needs to be encoded if it is *not* used as
delimitier in the authority component.  It probably doesn't need
to be encoded in the path (though I haven't checke the syntax to be sure).

Each part of the string has a different set of reserved characters, which is
why it is so hard for a user to get it right, and why browsers are so
forgiving.  Of course, when some pedantic fool like me *does* get it right,
then it should work.  :)

Decoding a URL string from the command line only happens when the user hits
return, so I don't see it as a performance issue.  I imagine that it would
be best (and I haven't look at the code recently either, so forgive me if
I'm talking through my elbow) if the URL object kept both the presentation
form and the decoded form of each field handy, or kept NULL for the
presentation form until it was requested (that is, cache it).

If I type in something like 

  "smb://192.168.101.16/c/My Download Files"

I would expect that a method that returns the URL string resulting from
the parsing of that input would give me:

  smb://192.168.101.16/c/My%20Download%20Files

...which is the correct form of the URL.  Notice that no semantic
translations are done (i.e., it didn't replace the IP address with the
NetBIOS or DNS name).  Only syntactic corrections are made.

> > I seem to recall that there were some URL issues that you hadn't had
> > time to look at yet.
> >
>         The big issues were resolved. There are some pathalogical
>         canonicalization issues but I don't think they would be
>         encountered in the wild really.

I think I hit one.  :)

>         I would rather put effort into
>         other things at the moment. Later, I will write a little state
>         machine to parse these URLs.
>         That should be better, smaller, and more extensible. Same
>         principle as the one I have
>         at the bottom of my homepage here:
> 
>           http://www.eskimo.com/~miallen/
> 
>         for parsing CSV lines.

I don't know how many state machines I've written...  Lost count.  I used to
hand-code lexical analyzers for fun.  ;)

I do have an SMB URL parser written in C, but it is over a year old, based
on older ideas, and doesn't do all I want it to do.  When I'm not writing my
book, I'll go back to writing code.  I hope...

Chris -)-----

-- 
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org