[jcifs] Creating file with hash ('#') in filename

Sat Jan 18 07:40:09 EST 2003

On Fri, Jan 17, 2003 at 04:10:14AM -0500, Michael B. Allen wrote:
> On Fri, 17 Jan 2003 00:53:16 -0600
> "Christopher R. Hertel" <crh at ubiqx.mn.org> wrote:
> 
> > On Fri, Jan 17, 2003 at 12:22:08AM -0500, Allen, Michael B (RSCH) wrote:
> > :
> > > > This is the Unicode issue.  URLs must handle full Unicode in order to deal 
> > > > with this stuff.  It is a problem, but it's not specific to the SMB URL.
> > > > 
> > > Yes, it does seem to be coming down to Unicode. But it is specific to the SMB URL.
> > > RFC 2396 does not appear to address Unicode. You cannot have Unicode HTTP URLs
> > > for example. Thus the problem.
> > 
> > HTTP URLs are just one example.  Still, "You cannot have Unicode HTTP
> > URLs"...  That's why it's *not* specific to the SMB URL.  People may post
> > all sorts of files on a web server.  If the filename is in cyrillic
> > then...well, same problem even if it's accessed via HTTP.  It's a general
> > URL issue.  URLs *should* support Unicode... but they don't.
> 
> No. You cannot have a cryllic filename on a web server. You can have a
> cryllic *link* displayed in the page but the filenames and all parts of
> the URL are ASCII. There might be extensions to this. I don't know. But
> URLs are 100% good ol' ASCII.

Then you cannot have a cyrillic filename in the SMB URL.  It's the *same
problem*.  ...but there is a solution.

The URL format itself disallows (unescaped) non-ASCII characters.  If a
web server enforces that rule then the client is SOL even if it does allow
the user to enter cyrillic (or other) characters.

SMB servers, however, are not the same as web servers.  What I am saying 
is that jCIFS should allow the use of non-ASCII characters in the SMB URL.
These should be handled as Unicode.

> Java ships with an HTTP URL handler. jCIFS provides an SMB URL
> handler. You're arguing that the SMB handler should be compliant. I'm just
> pointing out that the HTTP URL handler shipped with Java isn't compliant
> either and Eric's example illustrated that. It took the spaces. And it
> passed them back unescaped. So someone punted on the escapes just like me.
> and probably for transcribability reasons. Don't really know.

Ah.  Yes, that makes sense.  Sorry it took me so long to figure out what 
you were saying.

I can see that the handler would accept the spaces (for user-friendliness,
of course).  I disagree, however, about the transcribability of URLs with
spaces. I think spaces are very ambiguous in a URL.  For hand-written URLs
that contain spaces, I would likely go back to the ancient IBM practice of
using a 'b' with a slash through it.  For typed URLs (sent in e'mail,
etc.) I would use %20, but I'm unusual (in more ways than one).

Generally speaking, the problem with spaces is that you don't know how 
many there are, you don't know if they are a space or (gasp) a tab.  
Hand-written URLs might have them added just so the URL fits on the napkin 
you've scrawled it on, and URLs sent by e'mail are subject to the fiddling 
that editors and some mail agents sometimes apply.  Ick.

> >       space: ' '
> >        hash: '#'
> >     percent: '%'
> >   back-tick: '`'
> >       carat: '^'
> >      lbrace: '{'
> >      rbrace: '}'
> 
> Interesting. I was wondering what the precise list was.

I *think* that's the list.  The SNIA doc is not reliable on this point
(yes, I know...).  Still, if the above list is wrong it's not wrong by
much.  :)

> > The rest of the characters belong to the "unwise" set, listed in RFC2396.  
> > These *should* be escaped by the user, but you could fudge those too.  
> > Again, it's not that jCIFS requires that they be escaped, it's that a 
> > smart user (do they exist? I think there are some) will escape them.
> 
> And being smart they will quickly find that it won't work.

The point is, it *should* work.  Escapes *should* be handled.

> > So, except for the percent sign and the space, you could argue that all of
> > the above are fudge-able.  That still doesn't get you out of having to
> > translate escape sequences if the user hands them to you.
> 
> I don't understand. We don't have to do anything. We just pass them
> right through. Only the cross hatch '#' was syntactically important. And
> we handled that. This is good!

If someone hands you "Foob%61r" then you should not pass that straight
through.  The %61 should be translated to 'a'.  That's the behavior of
URLs.

> But we just keep butting into the same problem: non-ASCII characters.

Absolutely.

The non-ascii issue is a problem and my solution--which is probably the
solution you'd want (I'm guessing)--also fits with Eric's guidelines.

The solution is that the SMB URL draft accepts the restriction and stays 
within the rules, but that jCIFS (and, I hope, other implementations) will 
accept non-ASCII characters on the command line.  Escaped or not.  The SMB
protocol itself allows non-ASCII characters, so there really is no other 
choice.  Client tools *must* accept non-ASCII characters.

Note that non-ASCII characters are nicely outside of the "unwise", and
"reserved" sets used by URLs.  That is, from the perspective of RFC2396,
the whole non-ASCII range is its own set.  That's good.  It makes it easy,
on the command line, to determine which characters belong to which sets.

That being the case, I have no problem recommending that a user agent 
accept non-ASCII characters.

> Well browsers do not understand the SMB URL yet. And when they do they
> will have to deal with this problem too. If an application expects to
> get escaped SMB URLs then it will have to deal with that but I don't
> see how it can.

Um...  Konqueror, Galeon...  They support SMB URL.

I don't know what problem there is in accepting escaped SMB URLs.  You say 
"I don't see how it can", but I don't see any problem in doing so.

> > There's got to be some way 'round it...  Hmmm...
> 
> The *real* problem is that we don't know if the resource refers to a
> directory or file and when you combine them with the 2 argument parameters
> the URL class cuts off the name if it doesn't have a trailing slash:
> 
>   smb://server/share/dir/ + file.txt => smb://server/share/dir/file.txt
>   smb://server/share/dir + file.txt => smb://server/share/file.txt
> 
> The 'dir' get's dropped. I don't think we're going to fix that. I think
> this is built into the URL specs somewhere. Maybe you can come up with a
> solution.

The solution is semantic, not syntactic, I'm afraid.  You'd have to 
contact the server and ask it "what is this thing".  If it's a directory, 
then add the slash.

> > I should say that jCIFS should accept non-ASCII characters as Unicode so 
> > that users do not have to escape non-ASCII characters.
> 
> Well all strings are Unicode in Java. And we don't escape them so this
> is what jCIFS does and has always done (although until we introduced
> the jcifs.encoding property I don't think it converted things property
> interally).

Going back to Eric's guidelines, I don't think that jCIFS would need (or 
want) to return non-ASCII as escaped.  Only that set I listed above.
(Space, hash, percent...)

> Ok. There are no codepage issues with *the SMB URL* (because everything
> coming in gets converted to Unicode).

After which it's a protocol issue.  Yes.

> > There is nothing that can be done to fix it, however, unless the
> > jcifs.encoding property can be told to use the (proprietary) DOS OEM
> > codepages.  I think it would be ugly to add a ?CODEPAGE= option to the SMB 
> > URL, and doing so would require that all implementations come with a 
> > complete set of DOS codepages.
> 
> I like it. Throw it in. Make it optional what codepages are supported.

Eeewww...  but I'll think about it.  It makes some sort of twisted sense.  
:)

I think I like using the jCIFS/Java properties better, but that won't help 
much if there are multiple servers (with different codepages...ouch).

> > I think we should just stick to Unicode and punt on the codepages.
> 
> There are NO CODEPAGE ISSUES in jCIFS. The jcifs.encoding property

I'm not saying that jCIFS has a problem with codepages.  I have tried to
be very, very clear that this is an SMB protocol problem.  A major goal of
this discussion is to find good ways to handle the SMB protocol problem in
jCIFS (and other implementations).

> DOES specify a codepage. JCIFS supports ALL codepages listed here:
> 
>   http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html

Nice list...

> and fully Unicode. It undoubtedly has the most flexible character
> encoding support of ANY client. It can read in Korean encoded SMB URLs
> and talk EBCDIC with OS/390.
> 
> Really. It's built into the language. I didn't even have to do anything.

Not meant to be a challenge.

Chris -)-----
...better a good, heated debate than no discussion at all.  This is good.  
:)

-- 
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org