[jcifs] Creating file with hash ('#') in filename

Sat Jan 18 10:19:56 EST 2003

On Fri, Jan 17, 2003 at 05:33:55PM -0500, Michael B. Allen wrote:
:
> > > No. You cannot have a cryllic filename on a web server. You can have a
> > > cryllic *link* displayed in the page but the filenames and all parts of
> > > the URL are ASCII. There might be extensions to this. I don't know. But
> > > URLs are 100% good ol' ASCII.
> > 
> > Then you cannot have a cyrillic filename in the SMB URL.  It's the *same
> > problem*.  ...but there is a solution.
> 
> Sure you can. Cryllic like Unicode is a character set. It's not an
> encoding. KIO8-R is a Cryllic encoding:
> 
>   http://czyborra.com/charsets/cyrillic.html
> 
> But this is nomenclature. You cannot have an HTTP URL in any encoding
> other than ASCII. But you can have an SMB URL encoded in any encoding
> because we are accepting Unicode and it is the superset of all character
> sets.

Urg.  No.  That's not the point.

The HTTP URL and SMB URL are both subject to the same rules.  The general
form of the URL (as defined in RFC2396) does not allow non-ASCII
characters.  The fact that web servers and clients enforce this behavior 
is not at all my point.  At this level, I'm talking about the requirements 
(per specification) of URLs.

If you accept that "You cannot have an HTTP URL in any encoding other than
ASCII" then you are stating the same for all URL types because the rules 
that establish the restriction are rules for *all* URL types.

I am suggesting that an implementation, such as jCIFS, may safely break 
this rule.

> > The URL format itself disallows (unescaped) non-ASCII characters.  If a
> > web server enforces that rule then the client is SOL even if it does allow
> > the user to enter cyrillic (or other) characters.
> > 
> > SMB servers, however, are not the same as web servers.  What I am saying 
> > is that jCIFS should allow the use of non-ASCII characters in the SMB URL.
> > These should be handled as Unicode.
> 
> Ok. Well that's what we do now. I'm not convinced it is free of problems
> but there really is no way to escape them reasonably. If you recall the
> section from the XML spec I cited they convert them to UTF-8 and escape
> each byte. But we cannot do that because it would result in the most
> rediculous URLs the world has never seen. Cryllic characters might have
> 3 %HH escapes per character.

Right.  ...but if the terminal or shell that the user is using is set for 
cyrillic then they can just enter the characters.  Hopefully the JVM will 
interpret them correctly (which, from what you said, is easy).

> > > >       space: ' '
> > > >        hash: '#'
> > > >     percent: '%'
> > > >   back-tick: '`'
> > > >       carat: '^'
> > > >      lbrace: '{'
> > > >      rbrace: '}'
> > > 
> > > Interesting. I was wondering what the precise list was.
> > 
> > I *think* that's the list.  The SNIA doc is not reliable on this point
> > (yes, I know...).  Still, if the above list is wrong it's not wrong by
> > much.  :)
> 
> Ok, well I think it is important that we establish the precise list of
> characters that would need to be escaped. I trust the leach v1-spec-02
> doc. Provided we factor out the '.' because we are not concerned with
> 8.3 filename constraints and subtract any reserved based on your list
> from RFC 2396 the characters that would need to be escaped are:
> 
>   ' ' | '#' | '%' | '^' | '`' | '{' | '}'
> 
> So yes, it's exactly the same list.

I think that the list in the Leach/Naik draft is the same as the SNIA
list, but I haven't checked.  The SNIA list included the space (in the
text of the document, though it wasn't in the actual given list), but for
the wrong reasons.

> > > But we just keep butting into the same problem: non-ASCII characters.
> > 
> > Absolutely.
> > 
> > The non-ascii issue is a problem and my solution--which is probably the
> > solution you'd want (I'm guessing)--also fits with Eric's guidelines.
> 
> Eric's guidelines didn't suggest anything about how to handle non-ASCII
> characters.

Not specifically, but the principles apply.

> > The solution is that the SMB URL draft accepts the restriction and stays 
> > within the rules, but that jCIFS (and, I hope, other implementations) will 
> > accept non-ASCII characters on the command line.  Escaped or not.  The SMB
> 
> Whooh. Hold on. If you claim that non-ASCII characters can be escaped,
> how are they escaped exactly? They're out of the %HH range. How would
> you represent the Klingon character U-000123E9?

I don't know, because I don't know Unicode well enough.  The problem, of 
course, is that it depends upon the encoding scheme (it's all Unicode, but 
UTF-8 vs. UCS2LE vs. ... ).  The problem is that the escape sequences are 
not *necesarrily* going to be in the same encoding that the terminal input 
is in.

Let's say I'm here in the US with my terminal set to use Latin1.  I can 
input a variety of EU characters and such.  My terminal, however, can't 
represent kanji in its current settings.  This new problem (which you 
correctly bring up) is that I now need to enter escapes in order to 
connect to a server offering files with kanji names.  Ouch.  Which 
encoding do I use?  UTF-8?  UCS2LE?

I don't have an answer, but it's a good question.

> > protocol itself allows non-ASCII characters, so there really is no other 
> > choice.  Client tools *must* accept non-ASCII characters.
> 
> Well they don't *have* to but an ASCII client would be very limited.

Yes.  That's right...

> > Note that non-ASCII characters are nicely outside of the "unwise", and
> > "reserved" sets used by URLs.  That is, from the perspective of RFC2396,
> > the whole non-ASCII range is its own set.  That's good.  It makes it easy,
> > on the command line, to determine which characters belong to which sets.
>
> I would be carefull with the term "command line" because internal to
> a program there is no concept of what encoding the text was entered
> as. In Java the file.encoding property is used to read in data but you
> get Unicode.

So Java converts natively to a common standard.  That's good (for Java).

> In C the LC_CTYPE variable defines the locale and encoding
> input is read in as and in most cases it is probably stored internally
> as the locale encoding or maybe wchar_t which *may* be Unicode. So
> just to separate your peas and carrots a little do not even mention the
> "command line". If a client can negotiate Unicode then there is little to
> discuss about what character set's can be represented in an SMB URL. If
> the client is stuck in a particular 8 bit encoding like KIO8-R then it's
> just "an 8 bit encoding". So there are no "codepage issues".

Well, again, I see it as a "codepage issue".  The application gets a 
string of octet values.  If the source "codepage" is known then the 
application has half a chance to do what Java does and convert to a common 
standard.

Unicode, though much more comprehensive than the others, is also a mapping
between integer values and character representations.  Knowing that the 
values are in Unicode (and knowing which encoding scheme is being used) 
lets you convert to other "codepages", assuming that you know the 
relationship between them.

> > I don't know what problem there is in accepting escaped SMB URLs.  You say 
> > "I don't see how it can", but I don't see any problem in doing so.
> 
> Actually they probably can. But it demands a little coordination. If
> a web browser is going to handle SMB URLs like any other URL such as
> within a link within a web page and all URLs are ASCII only do they make
> an exception for the SMB URL and support Unicode URLs?

Right.  The problem is universal.  "How should Unicode be handled in 
URLs?".  A very broad question...

> Presumably they
> could do something foolish like mandate all such URLs are UTF-8. That
> *would* require escaping if the HTML page wasn't also UTF-8. They should
> just inherit the client encoding or encoding specified in the web page or
> HTTP header. Techincally their scenario is no different from any other
> C client as decribed I descibed previously but I have to wonder if they
> thought this through as far as we have.

The problem with inheriting the client or web-page encoding is the same 
one you brought up above: escaping.  The escape sequences are different if 
the encoding is UTF-8 vs. UCS2LE vs. whatever.

I imagine that things like the X Windows clipboard could handle conversion 
of Unicode encoding schemes.  If my browser is using one encoding, and I 
cut a URL from a web page and move it to a shell window that's using a 
different encoding, the clipboard would need to know the source and 
destination encoding and perform the magic.

[Change of topic.]

> > The solution is semantic, not syntactic, I'm afraid.  You'd have to 
> > contact the server and ask it "what is this thing".  If it's a directory, 
> > then add the slash.
> 
> Which means we would have to actually connect to the server and query
> the resource just to parse it. Ha. No.

We do that anyway, in other areas.  We contact servers to discover the 
semantic differences between an NBT name and a DNS name.  We contact 
servers to discover NBT vs. naked TCP (port 445) transport.  We do that a 
lot already.  That's a problem with SMB.

In this case, however, I strongly suspect that the semantic differences 
between a directory and a file won't matter *until after* the server is 
contacted anyway.  Something like:

  smb://server/share/something

can probably remain ambiguous until you've talked to the server about it.

[Another topic change.]

> > > > I think we should just stick to Unicode and punt on the codepages.
> > > 
> > > There are NO CODEPAGE ISSUES in jCIFS. The jcifs.encoding property
> > 
> > I'm not saying that jCIFS has a problem with codepages.  I have tried to
> > be very, very clear that this is an SMB protocol problem.  A major goal of
> > this discussion is to find good ways to handle the SMB protocol problem in
> > jCIFS (and other implementations).
> 
> What "SMB protocol problem"? I don't understand. I thought we were
> talking about SMB URLs and how to handle non-ASCII characters?

No, we've both been overloading the discussion.

In this case, I was talking about the basic problem with DOS OEM
codepages, which is that there is *nothing* in the protocol that lets you
negotiate which one is in use.  The protocol *does* let you negotiate
Unicode, and if you're using Unicode then you're okay.  If you are not
using Unicode, then there is no way--via the protocol--for server and
client to know which DOS OEM codepage is in use on the other.  That's an
SMB protocol problem.

Chris -)-----

-- 
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org