[jcifs] Creating file with hash ('#') in filename

Mon Jan 20 09:14:25 EST 2003

> -----Original Message-----
> From:	Christopher R. Hertel [SMTP:crh at ubiqx.mn.org]
> Sent:	Friday, January 17, 2003 6:20 PM
> To:	Michael B. Allen
> Cc:	jcifs at samba.org
> Subject:	Re: [jcifs] Creating file with hash ('#') in filename
> 
> On Fri, Jan 17, 2003 at 05:33:55PM -0500, Michael B. Allen wrote:
> :
> > > > No. You cannot have a cryllic filename on a web server. You can have a
> > > > cryllic *link* displayed in the page but the filenames and all parts of
> > > > the URL are ASCII. There might be extensions to this. I don't know. But
> > > > URLs are 100% good ol' ASCII.
> > > 
> > > Then you cannot have a cyrillic filename in the SMB URL.  It's the *same
> > > problem*.  ...but there is a solution.
> > 
> > Sure you can. Cryllic like Unicode is a character set. It's not an
> > encoding. KIO8-R is a Cryllic encoding:
> > 
> >   http://czyborra.com/charsets/cyrillic.html
> > 
> > But this is nomenclature. You cannot have an HTTP URL in any encoding
> > other than ASCII. But you can have an SMB URL encoded in any encoding
> > because we are accepting Unicode and it is the superset of all character
> > sets.
> 
> Urg.  No.  That's not the point.
> 
	Ok, well I'm not really sure what you're trying to get at. But I think we're both
	being a little pedantic in this thread.

	Let's review what we agree on. The characters that are required to be escaped
	in the SMB URL for RFC2396 conformance are:

	  ' |#%^`{}'

	and non-ASCII characters. However because SMB path names support
	Unicode, how these characters would be escaped is not clear. If each character
	was converted to a UTF-8 multibyte sequence and each byte in turn were
	escaped the frequency and appearence of these URLs would make the process
	unreasonable and for many scripts (e.g. Cryllic) they would be pathologically
	unusable.

	That's the problem. Right?

> I am suggesting that an implementation, such as jCIFS, may safely break 
> this rule.
> 
	At the moment we do not have much choice. CIFS is a Unicode protocol. We
	MUST provide a way to escape the escaping. However you do realise that URLs
	with Unicode characters cannot be embedded into web pages and other similar
	places one might find them because it may be assumed that all characters in
	any URL are ASCII? I think we need to investigate the state of escaping
	Unicode in URLs. Certainly it has been discussed and implemented in one form
	or another. Is there a standard for it?

> represent kanji in its current settings.  This new problem (which you 
> correctly bring up) is that I now need to enter escapes in order to 
> connect to a server offering files with kanji names.  Ouch.  Which 
> encoding do I use?  UTF-8?  UCS2LE?
> 
> I don't have an answer, but it's a good question.
> 
	Well we're not really concerned with an "encoding" because we know what the
	encoding is going to be; ASCII. The question is more like how do you represent
	a value that can be between 0x80 and 0x10FFFF in a sequence of ASCII
	characters? But it's not like we can just make something up.