[jcifs] Creating file with hash ('#') in filename

Thu Jan 16 15:35:39 EST 2003

On Wed, Jan 15, 2003 at 10:07:27PM -0500, Allen, Michael B (RSCH) wrote:
:
> > No, I mean that the parsing of URLs is standard, based on the RFC.  The # 
> > is defined as part of the syntax of generic URLs.  It's just that the 
> > semantics have no meaning for the SMB URL.
> > 
> 	Ok. I though it was specific to HTTP.

...it explains why the java.net.URL class parses on the # sign.  It will 
probably do that on all URLs that match the particular syntax (there are 
variations within the URL syntax itself).

> 	True. By removing the '#' as the ref delimeter we will be breaking the
>	 generic
> 	URL syntax (albeit not by much and of no consequence to anyone).

...and that is where a developer gets to make a decision balancing 
user-friendly vs. correct syntax.  Personally, I would favor correct 
syntax.  Let the jCIFS code throw a polite exception, and any application 
that uses jCIFS can do the work of "cleaning up" the initial URL.

> > >       The main problem is that
> > > 	SMB path names need to represent just about any character including
> > > 	Unicode which we haven't even touched on. I personally do not want to
> > > 	decode paths. That is very costly.
> > 
> > We only need to unescape them.
> > 
> 	When decoded. But what happens when we pass a string back out? You have
> 	to encode it. Now if you have spaces and @ and # it all get's escaped.

Right.  That's what I think should be done.

> > Two methods:  urlEscape() and urlUnEscape().
> > 
> 	We cannot do that. We've been here before. The SMB URLs are not used like
> 	normal URLs. People will need to specify them manually.

I don't think it's a good idea to predict how URL strings will be used.  
I'm rather amazed at what people already do with them.

>	They will be used where UNC paths are used. We cannot mandate that
>	spaces, '@', '#' and unicode characters be escaped.

On the one hand, you are correct.  It's a pain.  On the other hand, that's 
the nature of URLs.  The current RFC doesn't talk about Unicode, though.  
I imagine that characters outside of the US ASCII set would not need to be 
escaped in this situation.  Not if there's a proper Unicode 
representation.

You *do* get into a problem when using extended ASCII, however.  The 
different DOS codepages use different octet values for extended 
characters.  They don't all map to Latin1 either.  So here's the problem:

  Someone enters a filename which includes the character 'Ö' (that's 
  o-umlout).  In the Latin1 character set (Unicode), the octet value is
  0xD6, but in DOS Code Page 437 it's 0x99.

  So the question is: how do you read something like this:

  smb://server/share/path/Övertone.spew

  jCIFS would have to know the character set in use at the terminal (maybe 
  you can do that...if so, it's not a problem) in order to figure out 
  that the octet value 0x99 maps to Unicode character 0xD600 (that's
  0x00D6 except that Microsoft uses UCS2LE encoding which is two bytes,
  little-endian--so the bytes are reversed).

So, as I said, it's still a problem if the command shell is not using 
Unicode as well.  On the other hand, if you always read escapes as 
Unicode, then

  smb://server/share/path/%D6vertone.spew

is not ambiguous.

>	That would happen far too much and annoy people to no end. The only
>	reason for escaping things in the first place is to make them portable:
> 
> 	"The space character is excluded because significant spaces may disappear
> 	and insignificant spaces may be introduced when URI are transcribed or
> 	typeset or subjected to the treatment of wordprocessing programs."

That's the reason for escaping spaces.  There are a lot of other 
characters that get escaped.  The reasons are:

1) They have syntactic meaning within the URL.
2) They have syntactic meaning when used to identify a URL in another 
   context.
3) They are non-printing.
4) They may get munged by an intermediary.

RFC 2396:
2. URI Characters and Escape Sequences

   URI consist of a restricted set of characters, primarily chosen to
   aid transcribability and usability both in computer systems and in
   non-computer communications. Characters used conventionally as
   delimiters around URI were excluded.  The restricted set of
   characters consists of digits, letters, and a few graphic symbols
   were chosen from those common to most of the character encodings and
   input facilities available to Internet users.

      uric          = reserved | unreserved | escaped

   Within a URI, characters are either used as delimiters, or to
   represent strings of data (octets) within the delimited portions.
   Octets are either represented directly by a character (using the US-
   ASCII character for that octet [ASCII]) or by an escape encoding.
   This representation is elaborated below.

> 	These URLs are not going to be embedded in word processing programs.

Why not?  I can imagine using Kword to edit a document stored on some 
other system.  Kword might use the SMB URL string to identify the document 
in its "most recently accessed" list or somesuch.  Likewise, when I print 
the document I might include the filename in the footer.  There it is.

In an office environment, where documents are shared by several people, an 
SMB URL might be handed around, or even referenced in an internal memo.

Again, it's dangerous to guess how an SMB URL might be used.

>       If they are embedded in a web page it will be the path component
>	appended to an HTTP URL

If it's a relative URL, then it's an HTTP URL.  If you append it to an
HTTP URL then it's just a relative portion and has nothing to do with SMB.
If it is an SMB URL it will be absolute.  That's the only way to change
protocols.  Eg.  If you reference a document available via FTP in a web
page then you would use the absolute form: ftp://ftp.server.net/path/file
Likewise with an SMB URL.

> 	which the web application needs to escape (i.e. NetworkExplorer).
>       No one is going	to send an SMB URL to someone in an e-mail. That
>	kind of stuff just doesn't fit the protocol. That's what HTTP and
>	FTP are for.

I disagree with you here, mostly for reasons already stated.  Further, 
though, I got an email this very day (geez, I write too much) in which 
someone where I worked was talking about the fact that his users just 
couldn't get the hang of using FTP or HTTP to share files.

> 	Again, we've been here before. If you remember when we un-escaped URLs
>       in 0.6 we suddenly had to escape them. Then I decided to hang onto
>	the URL that was passed	in as is and give back what was given in. But
>	that didn't quite work either. The end result will be that
>	escaping will creep in.

I don't think it should creep back in.  It needs to be handled head-on.

> 	More importantly, all that character manipulation is very costly.

Then its important to find ways to minimize it by figuring out the locus 
at which it must occur.

As usual, I'm on the theory side of the house here, and the job is to 
figure out how to make this all practical.  Annoying, eh?

Thing one:  The unescaping doesn't occur until the URL is split into its 
            component pieces.  That's logical, since the escapes may be 
            protecting some character that would otherwise be a delimiter.

Thing two:  Once the URL is decomposed, the pieces can be unescaped, but 
            the escaped version would be kept as well.  If a change is 
            made to, say, the path then both versions are updated.

Thing three: Well, maybe not.  If the escaped version is updated then 
            there is no need to update the unescaped version until it's 
            actually used.  A flag would be needed...

Thing four: I know exactly how I would do this if I were not relying on 
            java.net.URL.  I don't know what kind of monkey wrench that 
            throws into things.

> > >       Can
> > > 	someone give me a reason why we *have* to require URL encoding of the
> > > 	path component? Otherwise I think we should punt the '#ref' and just
> > >       integrate it into the path. Anything we would use it for can be
> > >       done with a query_string parameter.
> > 
> > The '#' character isn't the only problem.  You could fudge that one.  
> > There are other characters (eg., spaces) which are not legal URL
> > characters.  Non-english language characters, for example.
> > 
> > The key thing, though, is that a user may type in an SMB URL with a URL
> > escape sequence included.
> > 
> 	This is a very unlikey scenario but in that application (a web browser
>       maybe) the application will be responsible for un-escaping it.

When and why?  The unescaping must be done after the parsing but before 
the calls to jCIFS.  The reverse--escaping strings returned by jCIFS 
before handing them to java.net.URL-- also fits in between.  Is that 
(honestly, I don't know...my head's been in my book) something that is 
do-able with jCIFS as it works currently?

> > > 	Incedentally speaking of query_string parameters we got lucky with the '?'
> > > 	character. That *is* reserved in SMB pathnames. It's a wildcard character.
> > > 	Otherwise we really *would* have to require escaping path components.
> > 
> > We still do.  :)
> > 
> 	Seriously, what will *break* if we do not mandate escaping the path
>	component?

Any valid Windows filename or directoryname character that is not also a 
valid URL character.

You are using java.net.URL which, if I understand correctly, does the 
parsing for you.  I imagine that other tools exist out there that also do 
generic URL parsing.  These tools may simply ignore "illegal" URL 
characters, or they may not.  If java.net.URL simply passes such 
characters along then jCIFS is okay...except that some of the URLs that 
work with jCIFS won't work with, say, KDE or MacOS X other tools that 
support the SMB URL.

> 
> > > 	Anyway it looks like just tacking the '#ref' back onto the path component in
> > > 	Handler.java is going to do the trick.
> > 
> > ...for that *one* case, and it is still a user convenience at the expense 
> > of correct syntax.
> > 
> 	This is the debate right here. The SMB URL cannot be used on the
>       Internet because it's character range is too great. It inherently
>	stomps on reserved characters. So are you weighing the "user
>	convenience" side enough?

Nope.  There is no reason that a file specified by the SMB URL could not 
also be offered to the Internet via a web server.  The same path, same 
file, same problem.

The current RFC specifies the US ASCII set of characters, so Unicode
simply isn't supported by generic URI.  We fall into that the same way FTP
and HTTP do.  I will have to ask if/when that will be covered.  A new
generic URI draft is in the works, by the way.

Anyway, if "the SMB URL cannot be used on the Internet" then there is no
point in pursuing the draft, since it is an "Internet Draft".  The point
is that the SMB URL *can* be used on the Internet (no judgement as to
whether this is wise or not...except to say that it's no less secure 
than FTP, which sends passwords in cleartext).

More likely, the SMB URL will be used on the "in*tra*net" within an office
or company or suchlike.  That, however, is pure conjecture on my part so
who knows...

> > The HTTP URL is just an instance of a URL.  A descendant type.  The rules 
> > apply to all URLs.
> > 
> 	Well in this case I meant the path component really IS an HTTP URL
>       because
> 	that is how the requested path is passed to the servlet like:
> 
> 	  http://miallen3.com:8080/servlets/NetworkExplorer/miallen1/C$/pub/

Ah.  Okay.

> > Sorry.  :(
> > 
> 	No reason to be. I am just trying to determine with certainty the answer to the
> 	question: Will the SMB URL *break* if we do not escape the path component?

It's not really a question of whether *we* escape the path.  On input, the 
user "should" do so.  A "user-friendly" application may try to clean the 
URL itself and correct the user's mistakes (as well as it can).  Doing so, 
though, runs a bit of a risk by allowing "invalid" URL strings to 
propogate.

When returning a URL to the user, I believe that it should have correct
syntax.

As far as the SMB URL breaking, my thoughts are these:

- Escapes should be handled regardless, simply because they *may* be used.
  They may be cut-and-pasted from other sources, for example.

- Other general-purpose parsers may or may not handle URLs that are not
  escaped properly.  jCIFS can be compatible by aiming for (not
  necessarily being) the least-common-denominator.

- Within the ASCII set, the list of characters that would need to be 
  escaped is limited to those that are valid filename characters but are 
  not valid URL 'pchar' characters:

    pchar         = unreserved | escaped |
                    ":" | "@" | "&" | "=" | "+" | "$" | ","

    unreserved    = alphanum | mark
          mark    = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"

  That leaves us with only a small number of disallowed ASCII characters.  
  The '#' and the space are probably the most conspicuous.

- I have no idea how Unicode and extended ASCII should be handled 
off-hand.

> 	After we answer that question we can debate whether or not "correct syntax"
> 	out weighs "user convenience".

Good 'nough.  :)

Chris -)-----

-- 
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org