What is BWSORT

BWSORT is a program for sorting Bengali text files written according to the BWFU font encoding. It reads multiple text files and sorts the concatenated list line by line.

BWSORT can be run from the shell prompt as

   bwsort [-s <sortscheme>] file1 [file2 [file3 ...]]
where the sorting scheme is either DEFAULT or OLD. There are small differences between these two schemes. While the default scheme looks more appropriate to me, many Bengali dictionaries tend to use the old scheme. If the environment variable BWSORTSTYLE is set (to DEFAULT or OLD), that value is used as the default sorting style. Otherwise, the default scheme is DEFAULT. In both cases, however, this default behavior can be overridden by the -s option.
   bwsort -h
prints a help message and quits, whereas
   bwsort -v
prints the version info and quits.

When one runs BWSORT without any file names in the command line, the program starts running in interactive mode. The prompt that is displayed is

bwsort>
In the interactive mode, the following commands are interpreted:
     add file1 [file2 [file3 ...]]
     comp word1 word2
     exit / quit
     help [topic]
     load file1 [file2 [file3 ...]]
     save file
     show
     sortstyle [style]
     version
     !shell command
Type
   help 
at the BWSORT prompt to get an on-line help on the commands supported.


The BWSORT Algorithm

The Bengali character set consists of the following:
Vowels         a, ae (= a + jafala + aa-kaar), aa, i (hrashwa-i), I (dirgha-i),
               u (hraswa-u), U (dirgha-u), Ri,
               e, ea (= e + jafala + aa-kaar), E (oi), o, O (ou).

Vowel forms    Similar to the vowels

Consonants     k, K (=kh), g, G (=gh), ^n (=una),
               c (=ch), C (=chh), j (=j), J (=jh), ^N (=ina),
               T, Z (=Th), D, X (=Dh), N,
               t, z (=th), d, x (=dh), n,
               p, f (=ph), b, v (=bh), m,
               Y (=antashtha-ja), r, l, b, S (=sh = talabya-sha),
               S (=murdhanya-sa), s (=dantya-sa), h, rr (=Da-e shunya ra),
                                                     rh (=Dha-e shunya ra),
               y (=antashtya-a), ^t (khanda-ta), M (=anuswar), H (=bisarga),
                                                     ^ (=chandrabindu).

Conjunct consonants
               Some allowed combination of two or more consonants

Digits         0 1 2 3 4 5 6 7 8 9

Punctuation symbols
               period (=dnari), comma, quote, space etc.
However the basic primitives are: The vowels and the pure consonants (i.e. consonants without any vowel sound, e.g., ka-e hasanta, etc.) plus the punctuation symbols and digits. Any Bengali string can be broken as a concatenation of these primitives. For example,
  prakhara daaruNa ati dirgha dagdha din.
can be broken as
  p_ + r_ + a + kh_ + a + r_ + a + space + d_ + aa + r_ + u + N_ + a + space +
  a + t_ + i + space + d_ + I + r_ + gh_ + a + space + d_ + a + g_ + dh_ + a +
  space + d_ + i + n_ + a + .
Here the underscore (_) stands for the pure consonant forms (i.e. consonants without vowel sounds, or with hasanta). Any Bengali sorting scheme (be it a computer program or a press standard) sorts Bengali strings based on this decomposition. As regards the positions of these primitives in the Bengali alphabet, we have the following ordering:
  a   < ae   < aa  < i    < I   <
  u   < U    < Ri  <
  e   < ea   < oi  < o    < ou  <
  k_  < kh_  < g_  < gh_  < ^n_ <
  ch_ < chh_ < j_  < jh_  < ^N_ <
  T_  < Th_  < D_  < Dh_  < N_  <
  t_  < th_  < d_  < dh_  < n_  <
  p_  < ph_  < b_  < bh_  < m_  <
  Y_  < r_   < l_  < sh_  <
  ss_ < s_   < h_  < rr_  < rh_ <
  y_  < ^t   < M_  < H_   < ^_
The DEFAULT sorting scheme of BWSORT respects this order.

Note that there are a total of 52 alphabetic primitives. These have been given the ASCII values A - Z, a - Z in that order. Punctuation symbols and digits are given the same ASCII values as in roman. This makes an ordering of all finite length Bengali strings. BWSORT sorts Bengali strings based on this converted decomposition (using `strcmp').

While this scheme seems quite reasonable, many modern dictionaries in Bengali follow a slight variation of the primitive order. This mostly conforms with old Sanskrit conventions. The OLD sorting scheme of BWSORT is based on these conventions. We will now enumerate the differences between DEFAULT and OLD schemes:

  1. In the OLD scheme the following pairs are identified:
          rr_ <--> D_,   rh_ <--> Dh_,   y_ <--> Y_,   ^t <--> t_
    
    In the dictionary order rr_, rh_ and y_ immediately follow D_, Dh_ and Y_ respectively, though they are not in those positions in the alphabet. See point 4 below for a discussion on ^t.

  2. The ajogabaho barno's (M, H, ^) come before any other consonant primitive in the order.

  3. A consonant with a hasanta is treated the same way as the consonant without the hasanta in the OLD sorting scheme. That is, b_da is broken as
          b_ + a + d_ + a
    
    This is not grammatically correct, but this convention is followed in Bengali dictionaries. BWSORT's OLD scheme respects this convention. The DEFAULT one, on the other hand, does not put the a after the hasanta (b_) and thereby identifies b_da as the conjunct bda (ba-e da-e).

  4. Etymologically ^t (khanda-ta) is nothing but ta with a hasanta. In view of this and the previous point (3), ^t is identified with t in the OLD sorting scheme and is not treated as a separate primitive.

    The primitive ordering for the OLD scheme is, therefore, like the following:

         a   < ae   < aa  < i    < I    <
         u   < U    < Ri  <
         e   < ea   < oi  < o    < ou   <
         M_  < H_   < ^_  <
         k_  < kh_  < g_  < gh_  < ^n_  <
         ch_ < chh_ < j_  < jh_  < ^N_  <
         T_  < Th_  < D_  < rr_  < Dh_  < rh_  < N_  <
         t_  = ^t   < th_ < d_   < dh_  < n_   <
         p_  < ph_  < b_  < bh_  < m_   <
         Y_  < y_   < r_  < l_   < sh_  <
         ss_ < s_   < h_
    

These make the OLD sorting scheme a little bit different from the DEFAULT scheme. As we have discussed elsewhere, bwsort allows you to choose the one you like in a variety of ways (-s option in command line, setting the environment variable BWSORTSTYLE, calling sortstyle in the interactive mode).

Before we end, some general remarks about a few BWSORT conventions are in order:

  1. bargya-ba and antashtha-ba are pronounced and written the same way in Bengali (like the bargya-ba in Sankrit). We, therefore, omitted the antashtha-ba from the alphabet. Some Bengali dictionaries still find it necessary to find out the original Sanskrit spelling and sort based on the type of the ba. Neither sorting scheme of BWSORT does that. BWFU fonts do not encourage that too.

  2. Bargya-ja and antashtha-ja are pronounced the same way in Bengali. They are, however, written differently. So these two are treated as separate characters.

  3. The vowels `ae' and `ea' are not listed as vowels in classical definition of Bengali grammar. Their vowel form would be
          jafala + aa-kaar
    
    When this sequence comes immediately after a consonant (as in baekaraN, for example), the decomposition goes like this
          baekaraN = b_ + Y_ + aa + k_ + a + r_ + a + N_ + a
    
    On the other hand, when jafala + aakaar comes after the vowels `a' or `e', they are not decomposed the same way, that is, not as
          a + Y_ + aa    or    e + Y_ + aa
    
    Instead it is preferable to treat `ae' and `ea' as separate vowels which do not have any vowel forms (kaar) associated with them. This convention is followed for both the DEFALUT and the OLD sorting schemes.

  4. The BWFU and BWTI fonts (on which BWSORT is based) do not define the Bengali vowel `Li'. I have never heard of anybody who has seen this character in a Bengali word (however old, obsolete or uncommon the word is). So I felt no justification for including this character in the Bengali alphabet.

  5. BWSORT assumes that the files you are sorting are Bengali text files in the sense that those words make syntactic senses to a Bengali reader. For example, a word cannot start with a aa-kaar or a hasanta. In a word, hasanta and a vowel-form cannot coexist against a consonant. Similarly two vowel forms cannot modify the sound of a consonant simultaneously. There is nothing like a bisarga-e hraswa-i-kaar etc. `a' and `e' are the only vowels that take a jafala (+ aa-kaar) after them. And so on... If the input file does not conform to these general rules, you may expect peculiar behavior of bwsort.

That's all! If you find some conventions wrong or wrongly implemented, or there is a pre-defined standard which every sorting scheme should follow, please let me know. I can be reached at

   abhij@csa.iisc.ernet.in
Thanks for your interest in bwsort.


Copyright

BWSORT is copyrighted 1998 by the author

Abhijit Das (Barda)
Department of Computer Science and Automation
Indian Institute of Science
Bangalore 560 012

BWSORT is a freeware. Permission is hereby granted to use and distribute it free of charge for all sorts of personal and academic purposes. Use of this software for commercial purposes is strictly prohibited.


Download bwsort

My home page


This page hosted by    Get your own Free Home Page
1