Japanese SJIS reguration problem

Hiroshi MIURA miura at blue.gr.jp
Thu Feb 10 08:00:50 GMT 2000


Hello,

I have problems, and make patch. 
Now We,  the technical team on SAMBA user community in Japan, 
test this patch for portability. 

The patch is attached. It is for samba-2.0.5.

problems are described belows.

=One Problem is...

In historical reason, we  have  a problem about SJIS code reguration.
What's it? Long ago, each computer maker in Japan defines 
their extention of KANJI codes adding to JIS(Japanese Industory
Standard). In ShiftJIS, MS-Kanji, that is not the exception. 

The important extensions are
'NEC kanji', 'NEC selected IBM extention kanji code', 
'IBM extention kanji code'.

Bacause MSKK, Microsoft japan, adopt that 3 extension to Windows 3.0J, 
Windows Code Set.
But, there have duplicated codes, same typeface and differ code.

MS NT4(janapanese) unify these codes at one way, but
Windows 98 and newer MS OS's, its unify codes at another way :-(

For example, 0x8754 in SJIS is 'one' in Roman number, looks like 'I'.
NT4 use this code '0x8754', but Windows98 use '0xfa4a'.  
these two code have same looks as  'I'. 

eg. 

1) I make file 'I' (code is 0x8754) on the samba file server 
   using NT4 workstation. 
2) I want to open file 'I' on Windows 98.
3) Windows 98 unify it to code 0xfa4a.
4) Samba ordered to open file named '0xfa4a' from Windows98.
5) Samba don't have it. samba has 'I' as 0x8754.
6) As a result, it fails. 


=Another problem 

There are SJIS codes that we cannnot map to EUC.
these code is extension described above. This patch make
these code to unify defined code area.
But this rule is different from MS's rule.

=Solution 
  
On coding system = CAP or HEX or SJIS, we unify the these code in MS's 
recommended way.

On coding system = EUC or JIS,  we unify the some codes in an 
original way.

Thanks,

MIURA, Hiroshi

-------------- next part --------------
--- samba-2.0.5a-JP2/source/include/kanji.h.orig	Fri Nov 13 04:40:33 1998
+++ samba-2.0.5a-JP2/source/include/kanji.h	Thu Feb 10 13:43:02 2000
@@ -30,7 +30,8 @@
 /* FOR SHIFT JIS CODE */
 #define is_shift_jis(c) \
     ((0x81 <= ((unsigned char) (c)) && ((unsigned char) (c)) <= 0x9f) \
-     || (0xe0 <= ((unsigned char) (c)) && ((unsigned char) (c)) <= 0xef))
+     || (0xe0 <= ((unsigned char) (c)) && ((unsigned char) (c)) <= 0xef) \
+     || (0xfa <= ((unsigned char) (c)) && ((unsigned char) (c)) <=0xfc))
 #define is_shift_jis2(c) \
     (0x40 <= ((unsigned char) (c)) && ((unsigned char) (c)) <= 0xfc \
     && ((unsigned char) (c)) != 0x7f)
--- samba-2.0.5a-JP2/source/lib/kanji.c.orig	Wed Nov 18 05:13:48 1998
+++ samba-2.0.5a-JP2/source/lib/kanji.c	Thu Feb 10 14:41:30 2000
@@ -22,6 +22,8 @@
      and extend coding system to EUC/SJIS/JIS/HEX at 1994.10.11
      and add all jis codes sequence type at 1995.8.16
      Notes: Hexadecimal code by <ohki at gssm.otuka.tsukuba.ac.jp>
+
+   Adding for Japanese Shift JIS regurating by <miura at blue.gr.jp> 2000.2.2
 */
 
 #define _KANJI_C_
@@ -379,6 +381,191 @@
 /* convesion buffer */
 static char cvtbuf[1024];
 
+
+/*******************************************************************
+   SJIS <-> regular-SJIS
+
+  on ShiftJIS
+	Befor Reguration   After Reguration     note
+  --------------------------------------------------------------------
+ 	0x8470 - 0x8491	-> 0x8440 - 0x8460	Russian char(*1)(*2)
+	0x8754 - 0x875d	-> 0xfa4a - 0xfa53      Roman Number(Capital)
+	0x8782		-> 0xfa59		No.
+	0x8784		-> 0xfa5a		TEL
+	0x878a		-> 0xfa58		Symbol means to 'Co.'
+	0x8790		-> 0x81e0		approximation sign of equality 
+	0x8791		-> 0x81df		Symbol means to 'be idetical with'
+	0x8792		-> 0x81e7		Integration symbol 
+	0x8795		-> 0x81e3		Square root symbol
+	0x8796		-> 0x81db		Vertical symbol
+	0x8797		-> 0x81da		Corner symbol
+	0x879a		-> 0x81e6		Reason symbol
+	0x879b		-> 0x81bf		Product set symbol
+	0x879c		-> 0x81be		Sum set symbol
+	0xed40 - 0xeeec	-> 0xfa5c - 0xfc4b	Extented KANJI(*1)
+	0xeeef - 0xeef8	-> 0xfa40 - 0xfa49	Roman number(small)
+	0xeef9		-> 0x81ca		Boolian denial symbol 
+	0xeefa		-> 0xfa55		Pipe symbol
+	0xeefb		-> 0xfa56		Single quote
+	0xeefc		-> 0xfa57		Dubble quote
+	0xfa54		-> 0x81ca		Boolian denial symbol 
+	0xfa5b		-> 0x81e6		Reason symbol
+
+	*1 Please pay attention that is exists the uncontinuation 
+           in the shiftJIS definition code area.(Under 0x40 and 0x7f etc.)
+	*2 reguarete only SFN. LFN is  UNICODE corresponding to the 
+           ShiftJIS before reguration.
+---
+
+  Griff is equal between reguration. Almost of that is So-called
+"Charactor relys to a machine types". Please check actual griff
+on Microsoft Windows  95/98 environment.
+
+  If you make table in programs, please attention to *1.
+There is  no-continuas part not only before reguration but after.
+For example, on Rosian charactor, there seems to 2 areas. and 
+there seems to 8 areas on Extented KANJI code.
+Needless to say, this part is continued in JIS-208 KUTEN code.
+
+  It seems  purpose that Unification of the redundancy between
+IBM extension and NEC extension char set. But, there are three
+"Reason Symbol", it's mysterious.... 
+
+  There are no rule about an unification.
+If we were thinking the reason, we would get confused... 
+Let's accept it without understanding... 
+
+                                    Shirai <shirai at nintendo.co.jp> 
+                     translated by MIURA,Hiroshi<miura at blue.gr.jp>
+********************************************************************/
+/* This Table comes from 'FDclone' written by <shirai at nintendo.co.jp>
+ * map sjis to able to convert jis and euc. (over 0xfa00 codes).
+ */
+
+typedef struct _sjis_regur_t {
+	int start;
+	int end;
+	int shift;
+} sjis_regur_t;
+
+#define CONVSJIS(w)    	((0x8470 <= ((int) (w)) && ((int) (w)) <= 0x879c)||\
+			 (0xed40 <= ((int) (w)) && ((int) (w)) <= 0xeefc)||\
+			 (((int) (w)) == 0xfa54) || (((int) (w)) == 0xfa5b))
+#define CONVSJISC(c)	((0x84 <= (unsigned char) (c) && ((unsigned char) (c)) <= 0x87)||\
+			 (0xed <= ((unsigned char) (c)) && ((unsigned char) (c)) <= 0xee)||\
+			 ((unsigned char) (c) == 0xfa))
+
+static sjis_regur_t rsjistable[] = {
+	{0x8470, 0x847e, 0x8440},	/* strange Russian char */
+	{0x8480, 0x8491, 0x844f},	/* Why they converted ? */
+	{0x8754, 0x875d, 0xfa4a},
+	{0x8782, 0x8782, 0xfa59},
+	{0x8784, 0x8784, 0xfa5a},
+	{0x878a, 0x878a, 0xfa58},
+	{0x8790, 0x8790, 0x81e0},
+	{0x8791, 0x8791, 0x81df},
+	{0x8792, 0x8792, 0x81e7},
+	{0x8795, 0x8795, 0x81e3},
+	{0x8796, 0x8796, 0x81db},
+	{0x8797, 0x8797, 0x81da},
+	{0x879a, 0x879a, 0x81e6},
+	{0x879b, 0x879b, 0x81bf},
+	{0x879c, 0x879c, 0x81be},
+	{0xed40, 0xed62, 0xfa5c},
+	{0xed63, 0xed7e, 0xfa80},
+	{0xed80, 0xede0, 0xfa9c},
+	{0xede1, 0xedfc, 0xfb40},
+	{0xee40, 0xee62, 0xfb5c},
+	{0xee63, 0xee7e, 0xfb80},
+	{0xee80, 0xeee0, 0xfb9c},
+	{0xeee1, 0xeeec, 0xfc40},
+	{0xeeef, 0xeef8, 0xfa40},
+	{0xeef9, 0xeef9, 0x81ca},
+	{0xeefa, 0xeefc, 0xfa55},
+	{0xfa54, 0xfa54, 0x81ca},
+	{0xfa5b, 0xfa5b, 0x81e6}
+};
+
+#define	RJISTBLSIZ	(sizeof(rsjistable) / sizeof(sjis_regur_t))
+
+static int sjis2rjis (int hi, int lo)
+{
+  int i, w;
+
+  w = (int)((hi << 8) | lo);
+  if (CONVSJIS(w)) {
+      for (i=0; i < RJISTBLSIZ; i++) {
+	  if (w >= rsjistable[i].start &&  w <= rsjistable[i].end) {
+	      w -= rsjistable[i].start;
+	      w += rsjistable[i].shift;
+	      break;
+	  }
+      }
+  }
+  return w;
+}
+
+/********************************************************************
+ * SJIS -> SJIS(JIS, EUC covarsion capable, ad hoc dirty )
+ ********************************************************************/
+#define EXTSJISC(c)   (0xf0 <= ((unsigned char)(c))) 
+
+/*  When Converting to EUC and JIS,  there is no room to 
+ *  these codes whose hi byte is larger than 0xf0.
+ *  
+ *  So we must drop or convert it to harmless code.
+ *  This is not standard way, but is ad hoc and practical. 
+ */
+
+static sjis_regur_t sjisconvtable[] = {
+{0xfa40, 0xfa49, 0xeeef},
+{0xfa4a, 0xfa53, 0x8754},
+{0xfa54, 0xfa54, 0x81ca},
+{0xfa55, 0xfa57, 0xeefa},
+{0xfa58, 0xfa58, 0x878a},
+{0xfa59, 0xfa59, 0x8782},
+{0xfa5a, 0xfa5a, 0x8784},
+{0xfa5b, 0xfa5b, 0x81e6},
+{0xfa5c, 0xfa7e, 0xed40},
+{0xfa80, 0xfa9b, 0xed63},
+{0xfa9c, 0xfafc, 0xed80},
+{0xfb40, 0xfb5b, 0xede1},
+{0xfb5c, 0xfb7e, 0xee40},
+{0xfb80, 0xfb9b, 0xee63},
+{0xfb9c, 0xfbfc, 0xee80},
+{0xfc40, 0xfc4b, 0xeee1}
+};
+#define	SJISCONVTBLSIZ	(sizeof(sjisconvtable) / sizeof(sjis_regur_t))
+
+/*******************************************************************
+ sj to sj-regularize (Only in Charset 932, ad hoc)
+********************************************************************/
+static char *sj_to_sj(char *from, BOOL overwrite)
+{
+  char *out;
+  char *save;
+  int  code;
+
+  save = (char *) from;
+  for (out = cvtbuf; *from; ) {
+      if (is_shift_jis (*from)) {
+	   code = sjis2rjis ((int) from[0] & 0xff, (int) from[1] & 0xff);
+	   *out++ = (code >> 8) & 0xff;
+	   *out++ = code;
+	   from += 2;
+    } else {
+      *out++ = *from++;
+    }
+  }
+  *out = 0;
+  if (overwrite) {
+    pstrcpy((char *) save, (char *) cvtbuf);
+    return (char *) save;
+  } else {
+    return cvtbuf;
+  }
+}
+
 /*******************************************************************
   EUC <-> SJIS
 ********************************************************************/
@@ -394,6 +581,21 @@
 
 static int sjis2euc (int hi, int lo)
 {
+  int i,w;
+
+  w = (int)((hi << 8) | lo);
+  if (EXTSJISC(hi)) {
+      for (i=0; i < SJISCONVTBLSIZ; i++) {
+	  if (w >= sjisconvtable[i].start &&  w <= sjisconvtable[i].end) {
+	      w -= sjisconvtable[i].start;
+	      w += sjisconvtable[i].shift;
+	      break;
+	  }
+      }
+      hi = (int) ((w >> 8) & 0xff);
+      lo = (int) (w & 0xff);
+  }
+
   if (lo >= 0x9f)
     return ((hi * 2 - (hi >= 0xe0 ? 0xe0 : 0x60)) << 8) | (lo + 2);
   else
@@ -473,6 +675,21 @@
 
 static int sjis2jis(int hi, int lo)
 {
+  int i,w;
+
+  w = (int)((hi << 8) | lo);
+  if (((u_short) hi) >= 0xfa ) {
+      for (i=0; i < SJISCONVTBLSIZ; i++) {
+	  if (w >= sjisconvtable[i].start &&  w <= sjisconvtable[i].end) {
+	      w -= sjisconvtable[i].start;
+	      w += sjisconvtable[i].shift;
+	      break;
+	  }
+      }
+      hi = (int) ((w >> 8) & 0xff);
+      lo = (int) (w & 0xff);
+  }
+  
   if (lo >= 0x9f)
     return ((hi * 2 - (hi >= 0xe0 ? 0x160 : 0xe0)) << 8) | (lo - 0x7e);
   else
@@ -803,6 +1020,7 @@
     for (out = cvtbuf; *from; ) {
 	if (is_shift_jis (*from)) {
 	    int code;
+
 	    switch (shifted) {
 	    case _KJ_KANA:
 	    case _KJ_ROMAN:		/* to KANJI */
@@ -858,7 +1076,7 @@
 }
 
 /*******************************************************************
-  HEX <-> SJIS
+  HEX <-> SJIS with sjis reguration
 ********************************************************************/
 /* ":xx" -> a byte */
 static char *hex_to_sj(char *from, BOOL overwrite)
@@ -884,12 +1102,13 @@
 }
  
 /*******************************************************************
-  kanji/kana -> ":xx" 
+  kanji/kana -> ":xx"  
 ********************************************************************/
 static char *sj_to_hex(char *from, BOOL overwrite)
 {
     unsigned char *sp, *dp;
-    
+    int code;
+
     sp = (unsigned char*) from;
     dp = (unsigned char*) cvtbuf;
     while (*sp) {
@@ -899,14 +1118,26 @@
 	    *dp++ = bin2hex ((*sp)&0x0f);
 	    sp++;
 	} else if (is_shift_jis (*sp) && is_shift_jis2 (sp[1])) {
-	    *dp++ = hex_tag;
-	    *dp++ = bin2hex (((*sp)>>4)&0x0f);
-	    *dp++ = bin2hex ((*sp)&0x0f);
-	    sp++;
-	    *dp++ = hex_tag;
-	    *dp++ = bin2hex (((*sp)>>4)&0x0f);
-	    *dp++ = bin2hex ((*sp)&0x0f);
-	    sp++;
+	    if (CONVSJISC(*sp)) {
+		code = sjis2rjis (((int) (sp[0] & 0xff)), ((int) (sp[1] & 0xff)));
+		*dp++ = hex_tag;
+		*dp++ = bin2hex ((code >>12) & 0x0f);
+		*dp++ = bin2hex ((code >>8 ) & 0x0f);
+		sp++;
+		*dp++ = hex_tag;
+		*dp++ = bin2hex ((code >>4 ) & 0x0f);
+		*dp++ = bin2hex ((code) & 0x0f);
+		sp++;
+	    } else {
+	      *dp++ = hex_tag;
+	      *dp++ = bin2hex (((*sp)>>4)&0x0f);
+	      *dp++ = bin2hex ((*sp)&0x0f);
+	      sp++;
+	      *dp++ = hex_tag;
+	      *dp++ = bin2hex (((*sp)>>4)&0x0f);
+	      *dp++ = bin2hex ((*sp)&0x0f);
+	      sp++; 
+	    }
 	} else
 	    *dp++ = *sp++;
     }
@@ -920,7 +1151,7 @@
 }
 
 /*******************************************************************
-  CAP <-> SJIS
+  CAP <-> SJIS with sjis reguration
 ********************************************************************/
 /* ":xx" CAP -> a byte */
 static char *cap_to_sj(char *from, BOOL overwrite)
@@ -957,19 +1188,36 @@
 static char *sj_to_cap(char *from, BOOL overwrite)
 {
     unsigned char *sp, *dp;
+    int code;
 
     sp = (unsigned char*) from;
     dp = (unsigned char*) cvtbuf;
     while (*sp) {
 	if (*sp >= 0x80) {
-	    *dp++ = hex_tag;
-	    *dp++ = bin2hex (((*sp)>>4)&0x0f);
-	    *dp++ = bin2hex ((*sp)&0x0f);
-	    sp++;
+	    if (CONVSJISC(*sp)) {
+		code = sjis2rjis(((int) (sp[0]) & 0xff), ((int) (sp[1] & 0xff)));
+		*dp++ = hex_tag;
+		*dp++ = bin2hex ((code >>12) & 0x0f);
+		*dp++ = bin2hex ((code >>8 ) & 0x0f);
+		sp++;
+		if (((unsigned char) (code & 0xff)) >= 0x80) {
+		    *dp++ = hex_tag;
+		    *dp++ = bin2hex ((code >>4 ) & 0x0f);
+		    *dp++ = bin2hex ((code) & 0x0f);
+		    sp++;
+		} else {
+		    *dp++ = *sp++;
+		}
+            } else {
+		 *dp++ = hex_tag;
+		 *dp++ = bin2hex (((*sp)>>4)&0x0f);
+		 *dp++ = bin2hex ((*sp)&0x0f);
+		 sp++;
+	    }
 	} else {
 	    *dp++ = *sp++;
 	}
-    }
+    }	
     *dp = '\0';
     if (overwrite) {
 	pstrcpy ((char *) from, (char *) cvtbuf);
@@ -980,17 +1228,20 @@
 }
 
 /*******************************************************************
- sj to sj
+ sj to sj (obsolete)
 ********************************************************************/
-static char *sj_to_sj(char *from, BOOL overwrite)
-{
-    if (!overwrite) {
-	pstrcpy (cvtbuf, (char *) from);
-	return cvtbuf;
-    } else {
-	return (char *) from;
-    }
-}
+/*
+ * 
+ * static char *sj_to_sj(char *from, BOOL overwrite)
+ * {
+ *   if (!overwrite) {
+ *	pstrcpy (cvtbuf, (char *) from);
+ *	return cvtbuf;
+ *   } else {
+ *	return (char *) from;
+ *   }
+ * }
+ */
 
 /************************************************************************
  conversion:
-------------- next part --------------
Hiroshi MIURA 	miura at blue.gr.jp
staff of:	Hokkaid Guide Editor
projects:	LKH-JP=Linux Kernel Hack Japan Project from 1998 ;-)
		BLUE=Business Linux Users Encouragement
		Linux kernel CASSIOPEIA FIVA APM hack




More information about the samba-technical mailing list