[jcifs] Creating file with hash ('#') in filename

Sat Jan 18 09:33:55 EST 2003

On Fri, 17 Jan 2003 14:40:09 -0600
"Christopher R. Hertel" <crh at ubiqx.mn.org> wrote:

> > > > > This is the Unicode issue.  URLs must handle full Unicode in order to deal 
> > > > > with this stuff.  It is a problem, but it's not specific to the SMB URL.
> > > > > 
> > > > Yes, it does seem to be coming down to Unicode. But it is specific to the SMB URL.
> > > > RFC 2396 does not appear to address Unicode. You cannot have Unicode HTTP URLs
> > > > for example. Thus the problem.
> > > 
> > > HTTP URLs are just one example.  Still, "You cannot have Unicode HTTP
> > > URLs"...  That's why it's *not* specific to the SMB URL.  People may post
> > > all sorts of files on a web server.  If the filename is in cyrillic
> > > then...well, same problem even if it's accessed via HTTP.  It's a general
> > > URL issue.  URLs *should* support Unicode... but they don't.
> > 
> > No. You cannot have a cryllic filename on a web server. You can have a
> > cryllic *link* displayed in the page but the filenames and all parts of
> > the URL are ASCII. There might be extensions to this. I don't know. But
> > URLs are 100% good ol' ASCII.
> 
> Then you cannot have a cyrillic filename in the SMB URL.  It's the *same
> problem*.  ...but there is a solution.

Sure you can. Cryllic like Unicode is a character set. It's not an
encoding. KIO8-R is a Cryllic encoding:

  http://czyborra.com/charsets/cyrillic.html

But this is nomenclature. You cannot have an HTTP URL in any encoding
other than ASCII. But you can have an SMB URL encoded in any encoding
because we are accepting Unicode and it is the superset of all character
sets.

> The URL format itself disallows (unescaped) non-ASCII characters.  If a
> web server enforces that rule then the client is SOL even if it does allow
> the user to enter cyrillic (or other) characters.
> 
> SMB servers, however, are not the same as web servers.  What I am saying 
> is that jCIFS should allow the use of non-ASCII characters in the SMB URL.
> These should be handled as Unicode.

Ok. Well that's what we do now. I'm not convinced it is free of problems
but there really is no way to escape them reasonably. If you recall the
section from the XML spec I cited they convert them to UTF-8 and escape
each byte. But we cannot do that because it would result in the most
rediculous URLs the world has never seen. Cryllic characters might have
3 %HH escapes per character.

> > >       space: ' '
> > >        hash: '#'
> > >     percent: '%'
> > >   back-tick: '`'
> > >       carat: '^'
> > >      lbrace: '{'
> > >      rbrace: '}'
> > 
> > Interesting. I was wondering what the precise list was.
> 
> I *think* that's the list.  The SNIA doc is not reliable on this point
> (yes, I know...).  Still, if the above list is wrong it's not wrong by
> much.  :)

Ok, well I think it is important that we establish the precise list of
characters that would need to be escaped. I trust the leach v1-spec-02
doc. Provided we factor out the '.' because we are not concerned with
8.3 filename constraints and subtract any reserved based on your list
from RFC 2396 the characters that would need to be escaped are:

  ' ' | '#' | '%' | '^' | '`' | '{' | '}'

So yes, it's exactly the same list.

> > But we just keep butting into the same problem: non-ASCII characters.
> 
> Absolutely.
> 
> The non-ascii issue is a problem and my solution--which is probably the
> solution you'd want (I'm guessing)--also fits with Eric's guidelines.

Eric's guidelines didn't suggest anything about how to handle non-ASCII
characters.

> The solution is that the SMB URL draft accepts the restriction and stays 
> within the rules, but that jCIFS (and, I hope, other implementations) will 
> accept non-ASCII characters on the command line.  Escaped or not.  The SMB

Whooh. Hold on. If you claim that non-ASCII characters can be escaped,
how are they escaped exactly? They're out of the %HH range. How would
you represent the Klingon character U-000123E9?

> protocol itself allows non-ASCII characters, so there really is no other 
> choice.  Client tools *must* accept non-ASCII characters.

Well they don't *have* to but an ASCII client would be very limited.

> 
> Note that non-ASCII characters are nicely outside of the "unwise", and
> "reserved" sets used by URLs.  That is, from the perspective of RFC2396,
> the whole non-ASCII range is its own set.  That's good.  It makes it easy,
> on the command line, to determine which characters belong to which sets.

I would be carefull with the term "command line" because internal to
a program there is no concept of what encoding the text was entered
as. In Java the file.encoding property is used to read in data but you
get Unicode. In C the LC_CTYPE variable defines the locale and encoding
input is read in as and in most cases it is probably stored internally
as the locale encoding or maybe wchar_t which *may* be Unicode. So
just to separate your peas and carrots a little do not even mention the
"command line". If a client can negotiate Unicode then there is little to
discuss about what character set's can be represented in an SMB URL. If
the client is stuck in a particular 8 bit encoding like KIO8-R then it's
just "an 8 bit encoding". So there are no "codepage issues".

> 
> That being the case, I have no problem recommending that a user agent 
> accept non-ASCII characters.
> 
> > Well browsers do not understand the SMB URL yet. And when they do they
> > will have to deal with this problem too. If an application expects to
> > get escaped SMB URLs then it will have to deal with that but I don't
> > see how it can.
> 
> Um...  Konqueror, Galeon...  They support SMB URL.
> 
> I don't know what problem there is in accepting escaped SMB URLs.  You say 
> "I don't see how it can", but I don't see any problem in doing so.

Actually they probably can. But it demands a little coordination. If
a web browser is going to handle SMB URLs like any other URL such as
within a link within a web page and all URLs are ASCII only do they make
an exception for the SMB URL and support Unicode URLs? Presumably they
could do something foolish like mandate all such URLs are UTF-8. That
*would* require escaping if the HTML page wasn't also UTF-8. They should
just inherit the client encoding or encoding specified in the web page or
HTTP header. Techincally their scenario is no different from any other
C client as decribed I descibed previously but I have to wonder if they
thought this through as far as we have.

> > > There's got to be some way 'round it...  Hmmm...
> > 
> > The *real* problem is that we don't know if the resource refers to a
> > directory or file and when you combine them with the 2 argument parameters
> > the URL class cuts off the name if it doesn't have a trailing slash:
> > 
> >   smb://server/share/dir/ + file.txt => smb://server/share/dir/file.txt
> >   smb://server/share/dir + file.txt => smb://server/share/file.txt
> > 
> > The 'dir' get's dropped. I don't think we're going to fix that. I think
> > this is built into the URL specs somewhere. Maybe you can come up with a
> > solution.
> 
> The solution is semantic, not syntactic, I'm afraid.  You'd have to 
> contact the server and ask it "what is this thing".  If it's a directory, 
> then add the slash.

Which means we would have to actually connect to the server and query
the resource just to parse it. Ha. No.

> > > I think we should just stick to Unicode and punt on the codepages.
> > 
> > There are NO CODEPAGE ISSUES in jCIFS. The jcifs.encoding property
> 
> I'm not saying that jCIFS has a problem with codepages.  I have tried to
> be very, very clear that this is an SMB protocol problem.  A major goal of
> this discussion is to find good ways to handle the SMB protocol problem in
> jCIFS (and other implementations).

What "SMB protocol problem"? I don't understand. I thought we were
talking about SMB URLs and how to handle non-ASCII characters?

Mike

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived.