hindicomputing: जून 2008

बुधवार, 18 जून 2008

VKM at MVP world Meet at Redmond

aavaaraa hooM....आवारा हूँ...

Aawara Hun.mp3

सोमवार, 16 जून 2008

हिंदी सीखने-सिखाने के लिए प्राकृतिक भाषा संसाधन (NLP) के अनुप्रयोगों की भूमिका

(29-30 जुलाई,2006 को टोक्यो विश्वविद्यालय में आयोजित अंतर्राष्ट्रीय हिंदी सम्मेलन में प्रस्तुत आलेख)
---विजय कुमार मल्होत्रा पूर्व निदेशक (हिंदी),रेल मंत्रालय, भारत सरकार

सार-संक्षेप

सूचना प्रौद्योगिकी (IT) के अंतर्गत अनेक ऐसी युक्तियाँ अंतर्निहित हैं,जिनके तार्किक सूत्रों की सहायता से प्राकृतिक भाषाओं के शिक्षण का मार्ग भी प्रशस्त किया जा सकता है.इस आलेख में MS ऑफ़िस हिंदी में प्रयुक्त कुछ ऐसी विधियों और प्राकृतिक भाषा संसाधन (NLP) संबंधी अनुप्रयोगों के उपयोग पर प्रकाश डाला गया है,जिनकी सहायता से अनायास ही देवनागरी लिपि की ध्वन्यात्मक और वर्णपरक व्यवस्था को रोमन लिपि के माध्यम से सीखा जा सकता है.भारत में हिंदीभाषी क्षेत्रों के सुदूरवर्ती इलाकों में स्थित पाठशालाओं में आज भी देवनागरी लिपि के अक्षर सिखाने के लिए बारहखड़ी का व्यापक उपयोग किया जाता है. MS ऑफ़िस हिंदी के अंतर्गत इसी बारहखड़ी का उपयोग देवनागरी लिपि के अक्षरों और उनकी मात्राओं के संयोजन के लिए किया गया है.इसके अंतर्गत रोमन लिपि की सहायता से हिंदी न जानने वाले छात्र न केवल देवनागरी लिपि के अक्षर संयोजन को सीख सकते हैं,बल्कि हिंदी पाठ को सहजता और सरलता से टाइप भी कर सकते हैं.इस कुंजीपटल को ध्वन्यात्मक लिप्यंतरण कुंजीपटल (Phonetic Transliteration Keyboard) कहा जाता है.
इसके अलावा, प्राकृतिक भाषा संसाधन (NLP) के अंतर्गत आज पेन्सिल्वानिया विश्वविद्यालय के प्रो.अरविंद जोशी के निर्देशन में विकसित वृक्ष संलग्न प्रणाली (TAG) नामक कलनविधि (Algorithm) की सहायता से विभिन्न भाषाओं में मशीनी अनुवाद प्रणाली जैसे अनुप्रयोगों का विकास किया जा रहा है. प्रस्तुत आलेख के लेखक को प्रो.अरविंद जोशी के निर्देशन में इसी टैग प्रणाली की सहायता से प्रो.सूरजभान सिंह द्वारा प्रस्तावित हिंदी वाक्य संरचना के सार्वभौमिक और भाषा-विशिष्ट लक्षणों को विश्लेषित करने के लिए अपेक्षित हिंदी पार्सर विकसित करने का अवसर मिला था और इसी कार्य के दौरान यह पाया गया कि टैग प्रणाली का उपयोग अन्य अनुप्रयोगों के अलावा शब्दवृत्त विकसित करने और हिंदी शिक्षण के लिए भी किया जा सकता है.इस आलेख में ऐसे अनेक उदाहरण देकर यह सिद्ध करने का प्रयास किया गया है कि युनिकोड के वैश्विक मानक पर आधारित अधुनातन कंप्यूटर प्रणाली हिंदी शिक्षण के लिए भी अत्यंत उपयोगी सिद्ध हो सकती है. ...........................

Role of NLP applications for learning / teaching Hindi
(A Paper for International Hindi Conference at Tokyo University on July 29-30,2006)
By
Vijay K Malhotra
Former Director (Hindi), Ministry of Railways, Govt of India

ABSTRACT

IT has a lot of potential to develop various techniques for learning natural languages through its inherent logical features. This paper deals with some of the features of MS Office Hindi and other NLP applications for the purpose of learning Hindi. Although Devanagari script used for Hindi is syllabic but its alphabetical order is phonetic in nature. Hence BARAHKHADI has been used for learning / teaching Hindi even in the remotest schools of Hindi heartland in India. Microsoft picked up this BARAHKHADI concept to enable the user just knowing Roman script to type Devanagari alphabets along with its maatraas without knowing Devanagari script and its keyboard by way of using Roman script. Hence this keyboard is called Phonetic Transliteration keyboard.
Besides, the NLP algorithm such as Tree Adjoing Grammar (TAG) developed by Prof Aravind Joshi of University of Pennsylvania is now being used extensively to develop various NLP applications such as MT System in various languages of the world including Hindi. This author had got an opportunity to assist NLP group of the University of Pennsylvania to develop a Hindi Parser to analyze the universal as well as language specific features of Hindi. This paper deals with some of the rules of Syntactic Grammar in Hindi proposed by Prof. SB Singh paving the way to develop a Lexicon in Hindi on the latest system of computer technology in Hindi based on UNICODE. Hence it can be concluded that MS Office Hindi and other NLP applications such as TAG algorithm can be used extensively for learning / teaching Hindi.

मुख्य आलेख / Main Paper
विश्वभर के भाषावैज्ञानिक और कंप्यूटर विशेषज्ञ आज यह स्वीकार करने लगे हैं कि शब्द एक बीज के समान है, जिसमें संपूर्ण वृक्ष के रूप में वाक्य को पुष्पित, पल्लवित और विकसित करने की क्षमता निहित है.वस्तुत :प्रत्येक शब्द विविध प्रकार के अभिलक्षणों का संविन्यास है.ये अभिलक्षण प्रकट और अप्रकट दोनों ही रूपों में शब्दों में निहित हैं.ये अभिलक्षण दो प्रकार के होते है : सार्वभौमिक और भाषाविशिष्ट.यदि इन अभिलक्षणों का समुचित विश्लेषण किया जाए तो न केवल शब्दों के विभिन्न अर्थों और संदर्भों को खोजा जा सकता है,बल्कि वाक्य रचना के सूक्ष्म नियमों को भी अनावृत किया जा सकता है.किंतु प्रचीन व्याकरण का आरंभिक स्वरूप मुख्यत : अनुशासनमूलक रहा है. अनुशासनमूलक व्याकरण का आग्रह भाषाविशेष की शुद्धता बनाए रखना था. टेरी विनोग्राद (1983) के शब्दों में, “ इस प्रकार भाषावैज्ञानिक का कार्य उस दौर में न्यायाधीश या पुलिसमैन का रहा होगा, जिसका दायित्व सामाजिक व्यवहार के रूप में भाषा के सही प्रयोग को अनुशासित रखना था.”
19 वीं सदी में डार्विन के विकासवाद के सिद्धांत से भाषाविज्ञान का क्षेत्र भी अछूता न रहा और तुलनात्मक भाषाविज्ञान के नाम से एक नया आयाम भाषाविज्ञान के क्षेत्र में जुड़ गया,जिसके अंतर्गत विभिन्न भाषाओं में अंतर्निहित समान और असमान वृत्तियों के आधार पर विश्व की सभी भाषाओं को विभिन्न भाषा-परिवारों में विभाजित कर दिया गया और उनकी प्रवृत्तियों के तुलनात्मक अध्ययन पर बल दिया जाने लगा. कालांतर में तुलनात्मक भाषाविज्ञान का स्थान संरचनात्मक और उसके बाद प्रजनक रूपांतरण व्याकरण ने ले लिया.चॉम्स्की (1965) के इसी प्रजनक रूपांतरण व्याकरण के (TG Grammar) आधार पर सर्वभाषा व्याकरण की परिकल्पना की गई.यद्यपि चॉम्स्की ने वाक्य को शब्द की संरचित माला के रूप में स्वीकार किया, लेकिन उनका प्रयास भाषाविशेष की प्रवृत्तियों का अध्ययन करने के बजाय भाषानिरपेक्ष और सार्वभौमिक तत्वों की खोज तक ही सीमित रहा. दासगुप्ता (1991),गीता (1985) और जैन (1960) ने यह दावा किया है कि सर्वभाषा व्याकरण का यह प्रयास विश्व की अनेक भाषाओं के संदर्भ में,विशेषकर भारतीय भाषाओं के संदर्भ में सफल नहीं हो पाया है. इस आलेख के वक्ता ने भी IIT, कानपुर (भारत) द्वारा आयोजित CPAL-2 के अवसर पर प्रस्तुत आलेख (1992, पृ.317) में हिंदी भाषा के संदर्भ में यही स्पष्ट किया था कि जब तक भाषाविशेष के विशिष्ट पक्षों का सम्यक् अध्ययन नहीं कर लिया जाता तब तक उस भाषा का संसाधन कंप्यूटर के माध्यम से नहीं हो पाएगा.जैसे हिंदी और अंग्रेज़ी के निम्नलिखित वाक्य देखें:
(1) राम को बुखार है.
(2) राम श्याम से मिलता है.
(3) Ram has a fever.
(4) Ram meets Shyam.
वाक्य (1) में ‘को’ का प्रयोग हिंदी की भाषाविशिष्ट प्रवृत्ति है.यह दिलचस्प तथ्य है कि वाक्य (3) के अंग्रेज़ी वाक्य में ‘को’ परसर्ग के समकक्ष कोई पूर्वसर्ग नहीं है, लेकिन सभी भारतीय भाषाओं में ‘को’ के समकक्ष परसर्ग का नियमित प्रयोग मिलता है:
(5) रामला ताप आहे. (मराठी)
(6) रामक्कु ज्वरम् (तमिल)
(7) रामन्नु पनियानु (मलयालम)
(8) रामनिगे ज्वर दिगे (कन्नड़)
(9) रामेर ताप आछे (बँगला)
यह प्रवृत्ति दक्षिण पूर्वेशिया की अन्य भाषाओं में भी मिलती है.इन्हीं समान भाषिक प्रवृत्तियों के कारण ही यह निष्कर्ष निकाला गया है कि सिर्फ भारत ही नहीं, बल्कि संपूर्ण दक्षिण पूर्वेशिया, एकभाषिक क्षेत्र (Linguistic Zone) है.
इसीप्रकार वाक्य (2) और (4) में श्याम के रूप में सहकर्ता की उपस्थिति ‘मिलना’ और ‘meet’ क्रिया की सार्वभौमिक प्रवृत्ति है, लेकिन वाक्य (2) में ‘से’ का प्रयोग हिंदी की भाषाविशिष्ट प्रवृत्ति है.यही कारण है कि वाक्य (4) में ‘से’ के समकक्ष किसी परप्रत्यय का प्रयोग नहीं है. इस प्रकार की भाषिक प्रवृत्तियों और अभिलक्षणों के विश्लेषण का कार्य प्राकृतिक भाषा संसाधन या Natural Language Processing (NLP) के अंतर्गत किया जाता है.NLP अभिकलनात्मक भाषाविज्ञान (Computational Linguistics) का ही एक अंग है. इसका उद्देश्य कंप्यूटर के ऐसे व्यापक मॉडल और डिज़ाइन तैयार करना है, जिनकी सहायता से मानव-मशीन के बीच संवाद स्थापित हो सके. आरंभ में कंप्यूटर के साथ संवाद के लिए बेसिक, कोबोल, पास्कल आदि प्रोग्रामिंग भाषाओं का प्रयोग किया जाता था, किंतु अब अंग्रेज़ी जैसी प्राकृतिक भाषाओं के माध्यम से भी कंप्यूटर से संवाद किया जा सकता है. NLP का मुख्य आधार स्तंभ है, शब्दवृत्त (Lexicon). NLP के अंतर्गत शब्दवृत्त के निर्माण की प्रक्रिया में मुख्यत : तीन उपागम (approaches) अपनाए जाते हैं : संरचनात्मक, लक्षणपरक और संबंधपरक. इन्हीं उपागमों के अंतर्गत अर्थपरक क्षेत्रों (Semantic Fields) के आधार पर शब्दों का वर्गीकरण किया जाता है.इस प्रकार सामान्य कोश में जहाँ शब्दों को अकारादि क्रम में रखा जाता है,वहीं संरचनात्मक उपागम के अंतर्गत शब्दों को अर्थपरक कोटियों में विभाजित किया जाता है,जैसे थिसॉरस आदि.लक्षणपरक उपागम के अंतर्गत पक्ष, वचन, काल, लिंग आदि व्याकरणिक सूचनाएँ दी जाती हैं.ये सूचनाएँ प्रकट रहती हैं.जैसे,’गया’ में “या” प्रत्यय भूतकाल, एकवचन और पुल्लिंग की सूचना देता है.हिंदी में संज्ञापद में बहुवचन के तिर्यक् रूप में परसर्ग की उपस्थिति अनिवार्य है और यह सूचना हमें शब्दवृत्त में की गई उसकी प्रविष्टि से अनायास ही मिल जाती है.उदाहरण के लिए हिंदी के किसी भी वाक्य में ‘लड़कों’ और ‘लड़कियों’ का प्रयोग परसर्ग (के, में से आदि) के बिना नहीं किया जा सकता है. लड़कों / लड़कियों को बुलाओ.
इतना ही नहीं,’ओं’ के प्रयोग से भी इनकी तिर्यक् प्रवृत्ति का बोध हो जाता है.यदि यहाँ ’ओ’ का प्रयोग होता तो यह संबोधनवाचक बहुवचन होता और उसके साथ किसी भी परसर्ग (के, में से आदि) के अनुप्रयोग की अनुमति नहीं है.जैसे,देवियो और सज्जनो..
अर्थपरक और वाक्यपरक लक्षण अप्रकट होते हैं अर्थात् ये लक्षण शब्दों में ही अंतर्निहित होते हैं.जैसे “बालक” शब्द चेतन, प्राणिवाचक, मानव और मूर्त संज्ञापद है.ये लक्षण “बालक” शब्द के अर्थपरक लक्षण हैं. शब्दकोश में इन लक्षणों के समावेश से असंगत वाक्यों के प्रजनन को भी रोका जा सकता है.जैसे “गाना” क्रिया प्राणिवाचक कर्ता की ही अपेक्षा करती है. इसलिए जब तक इसका कर्ता प्राणिवाचक न हो, तब तक इस क्रिया से निर्मित वाक्य संगत नहीं माना जा सकता. वाक्यपरक लक्षणों में ऐसे घटकों का समावेश होता है,जिनके प्रयोग के बिना वाक्य अधूरा या असंगत मालूम पड़ता है.जैसे,” राम श्याम से मिलता है.” यदि इस वाक्य में श्याम या किसी अन्य सहपात्र का उल्लेख न हो तो यह वाक्य अधूरा ही लगेगा. जैसे, ” *राम मिलता है.”

संबंधपरक उपागम दो शब्दों के बीच के संबंधों को उजागर करते हैं. यदि माँ-बाप और बच्चे के संबंध को लें तो ये इसप्रकार हो सकते हैं. जैसे, पिल्ला > बच्चा > कुत्ता
बछड़ा > बच्चा > गाय
इसीप्रकार अंग-अंगी संबंधों को भी इसप्रकार रखा जा सकता है :
पैर <>अंग-अंगी संबंध <>शरीर चोंच <>अंग-अंगी संबंध <>चिड़िया
पर्यायवाची शब्द भी (जैसे, पवन, समीर, वायु, हवा) भी इसी के अंतर्गत आते हैं. शब्दवृत्त (Lexicon) के अंतर्गत इसप्रकार के अनेक संबंधों को एक जालक्रम (Network) के रूप में इस प्रकार रखा जा सकता है :
ठंडा <>पर्याय<> शीतल ठंडा<> विलोम<> गरम चीता <>वर्ग
स्तनपायी<> पशु
खाना> प्रेरणा > खिलाना बछड़ा >शिशु >गाय
हाथ >अंग> शरीर
सोमवार> अनुक्रम> मंगलवार

ये सभी अभिलक्षण सर्वभाषा व्याकरण के अंतर्गत आते हैं. इसप्रकार जहां सर्वभाषा व्याकरण में भाषाविशिष्ट अभिलक्षणों की उपेक्षा हुई, वहीं हिंदी के परंपरागत व्याकरणों में भी हिंदी भाषा की बहुत कम भाषाविशिष्ट संरचनाओं की विवेचना की गई है.इस कमी को यदि किसी एक भाषावैज्ञानिक ने बहुत गंभीरता के साथ पूरा करने का प्रयास किया है तो वे हैं, प्रो.सूरजभान सिंह. प्रो.सिंह द्वारा लिखित “हिंदी का वाक्यात्मक व्याकरण” मील का ऐसा पत्थर है,जिसने न केवल हिंदी के समग्र व्याकरण की रूपरेखा हमारे सामने प्रस्तुत की है, बल्कि NLP के अंतर्गत हिंदी भाषा के संदर्भ में कंप्यूटरसाधित संश्लेषण, विश्लेषण और संसाधन का मार्ग भी प्रशस्त कर दिया है.इस आलेख के वक्ता ने प्रो.सिंह द्वारा वर्गीकृत हिंदी के बीज वाक्यसाँचों के आधार पर प्रो.अरविंद जोशी के नेतृत्व में अमरीका स्थित पेन्सिल्वानिया विश्वविद्यालय में हिंदी पार्सर के निर्माण में यत्किंचित् योगदान दिया था.प्रो.सिंह ने हिंदी-वाक्यों के अंतर्निहित तथा संरचनात्मक लक्षणों के विश्लेषण में सामान्यत : लक्षण-विश्लेषण या तत्व- विश्लेषण पद्धति का उपयोग किया है.वाक्यसाँचों के सभी घटक,अमूर्त रूप में, कुछ लक्षणों या तत्वों से मिलकर बने हैं और प्रत्येक वाक्य-साँचा लक्षणों का एक पुंज या संविन्यास है. समान प्रकार का लक्षण-संविन्यास समान प्रकार के वाक्य व्युत्पन्न करने की क्षमता रखता है.इन वाक्य-साँचों तथा उपवाक्य-साँचों से कुछ विशिष्ट प्रक्रियाओं द्वारा वाक्य व्युत्पन्न किए जा सकते हैं.यद्यपि प्रो.सिंह ने हिंदी भाषा के लिए कुल 14 वाक्य-साँचों और 45 उपवाक्य-साँचों की परिकल्पना की है,किंतु इस आलेख में केवल दो वाक्य-साँचों पर ही चर्चा की जाएगी, जिस पर अब भी हिंदी शिक्षकों का ध्यान नहीं गया है.ये वाक्य-साँचे हैं, को-वाक्य-साँचा और सहपात्रीय पूरकप्रधान ‘से’ परसर्ग से निर्मित अकर्मक क्रियाप्रधान वाक्य-साँचा. आम तौर पर आज भी हिंदी के शिक्षक ‘को” की परिकल्पना केवल कर्म के साथ जोड़कर ही करते हैं.जैसे,राम श्याम को पीटता है.लेकिन कर्ता के साथ व्युत्पन्न ‘को”वाक्यों पर हमारा ध्यान नहीं जाता,जबकि ‘को”वाक्य-साँचे से व्युत्पन्न वाक्यों का हिंदी में व्यापक प्रयोग किया जाता है.जैसे, ”राम को बुखार है”, “मुझे (मैं + को) बहुत काम है”, “राम को फुटबॉल का शौक है”, “राम को लड़कियों से नफ़रत है”.आदि.. ऐसे सभी वाक्यों को, जिनमें पूरक अथवा कोशीय क्रिया की आकांक्षा के कारण (लौकिक) कर्ता के साथ ‘को” परसर्ग का प्रयोग अनिवार्य हो, ‘को”वाक्य कहा जाता है. ‘को”कर्ता के दो अंतर्निहित (आर्थी) लक्षण महत्वपूर्ण हैं.(क ) चेतनता और (ख ) स्वेच्छा का अभाव. ‘को”-कर्ता के स्थान पर आने वाला संज्ञा शब्द चेतन तथा संवेदनशील होता है,क्योंकि ‘को” वाक्यों में प्राय : उसी श्रेणी की अनुभूतियों, मनोभावों तथा अमूर्त व्यापारों की अभिव्यक्ति होती है,जो प्राणिसुलभ हैं.कभी-कभी मानवीकरण के कारण मन, दिल आदि के साथ भी ‘को” का प्रयोग कर लिया जाता है. जैसे,”मेरे मन को बड़ा दु :ख हुआ.” इसीप्रकार को-वाक्य का सक्रिय कर्ता नहीं हो सकता, क्योंकि इसमें स्वेच्छा का अभाव होता है.प्रो. सिंह ने ऐसे सभी भावों या अर्थतत्वों को को –भाव कहा है, जो अपनी अभिव्यक्ति के लिए को-वाक्य की आकांक्षा करते हैं. प्रो.सिंह ने कुछ विशिष्ट आर्थी लक्षणों के आधार पर को-भावों का एक स्थूल वर्गीकरण किया है :

1.शारीरिक अनुभूति (बुखार,प्यास,नींद,पसीना,हँसी,छींक आदि)
शीला को प्यास लगी. राम को नींद /हँसी /छींक /आई.

2.बौद्धिक अनुभूति (ज्ञान / बोध) (आना,मालूम होना,याद आना)

सुधीर को हिंदी आती है.
शीला को यह मालूम है.
विजय को अक्सर अपने गाँव की याद आती है.

3.मनोभाव (दुःख,गुस्सा,आशा,नफ़रत आदि)

राम को यह खबर सुनकर बहुत दुःख हुआ.
पार्वती को अपनी सहेली पर बहुत गुस्सा आया.
पिता को बेटे से यही आशा थी.

4.पसंद /आदत (पसंद,शौक,दिलचस्पी,आदत आदि)

राम को फ़ुटबॉल का शौक है.
सुधीर को खीर बहुत पसंद है.

5. अमूर्त मानवीय व्यापार
5.1 ज़रूरत-वर्ग (ज़रूरत,आवश्यकता,इंतज़ार,खोज,चाहिए,तलाश,प्रतीक्षा आदि).
राम को नौकर की ज़रूरत है.
शीला को कार चाहिए. 5.2.उपलब्ध-वर्ग(उपलब्ध होना,नसीब होना,प्राप्त होना,मिलना,सुलभ होना)
राम को सब सुविधाएँ उपलब्ध / प्राप्त /सुलभ हैं.)

शीला को सौ रुपए मिले.

5.3 लाभ / हानि – वर्ग) (अभाव,कमी,क्षति,नुक्सान,बचत,हानि आदि)

सेठ जी को दस हज़ार रुपए का नुक्सान हुआ.
राम को इस साल अच्छी बचत हुई.

5.4 काम-वर्ग (अवकाश,काम,खतरा,जल्दी,मतलब,विलंब,फुर्सत आदि) मुझे आज बहुत काम है.
पिता जी को जल्दी है.
तुम्हें क्या मतलब?
मुझे फुर्सत नहीं है.

5.स्वीकार्य-वर्ग (मान्य,अमान्य,असह्य,स्वीकार्य आदि)

मुझे तुम्हारी शर्तें मान्य नहीं हैं.
तुम्हारी बातें मुझे असह्य लगती हैं.

6.बधाई-वर्ग (अधिकार,नमस्कार,प्रणाम,बधाई,शुक्रिया,छूट आदि)

जनता को बोलने का अधिकार है.
आपको बधाई हो.
पुलिस को गोली चलाने का आदेश है.

इससे यह स्पष्ट है कि हिंदी में ‘को-पूरक’ और ‘को-क्रिया’ के रूप में ‘को-भाव’ के प्रयोग की नियमित प्रवृत्ति है और यह भाषाविशिष्ट प्रवृत्ति है.यदि शब्दवृत्त (lexicon) में इन सभी शब्दों की प्रविष्टि करते समय इनमें अंतर्निहित सभी प्रकार के प्रकट और अप्रकट लक्षणों को भी शामिल कर लिया जाए तो सहजता से हिंदी का शब्द-व्याकरण तैयार हो जाएगा. सर्वभाषा व्याकरण के अनुसार कुछ अकर्मक क्रियाएँ ऐसी हैं जो अपने सहज अर्थों में कर्ता के साथ-साथ सहपात्र की आकांक्षा भी करती हैं,जैसे मिलना, लड़ना, झगड़ना, चिपटना, टकराना, डरना, कतराना, चिढ़ना आदि. इन क्रियाओं में सहपात्र के साथ कर्ता का संबंध साहचर्य का भी हो सकता है और पार्थक्य का भी. साहचर्यपरक क्रियाओं में कर्ता और सहकर्ता के बीच समस्तरीय संबंध होता है और यही कारण है कि उन्हें ‘ और’ संयोजक द्वारा भी जोड़ा जा सकता है. जैसे, ‘ राम श्याम से मिलता है.’ ‘ राम और श्याम मिलते हैं.’ इन वाक्यों में ‘आपस में’, ‘परस्पर’ , ‘एक-दूसरे से’ आदि क्रियाविशेषणों का भी प्रयोग किया जा सकता है.जैसे राम और श्याम आपस में / परस्पर / एक-दूसरे से मिलते हैं.’ लेकिन पार्थक्य की क्रियाओं में कर्ता और सहपात्र को ‘ और’ संयोजक द्वारा नहीं जोड़ा जा सकता.
जैसे, ‘ चोर पुलिस से डरता है.’ ‘*चोर और पुलिस डरते हैं.’
परंपरागत हिंदी व्याकरण में और सर्वभाषा व्याकरण में भी अपादान के रूप में निम्नलिखित वाक्यों में ‘से / from’ के प्रयोग की सार्वभौमिक प्रवृत्ति है.

जैसे, “पेड़ से पत्ते गिरते हैं”.(The leaves fall from the tree)
“हिमालय से गंगा निकलती है”.(Ganges flows from the Himalaya).
इसीप्रकार करण / Intrumental के रूप में भी “से” का प्रयोग सार्वभौमिक प्रवृत्ति है.जैसे, डाकू ने यात्री को तलवार से मार दिया / The decoit killed the passenger with a sword.किंतु हिंदी की एक भाषाविशिष्ट प्रवृत्ति यह भी है कि कुछ अकर्मक क्रियाओं के साथ ‘से’ का प्रयोग उनके नाभिकीय तत्व के रूप में अंतर्निहित होता है.जैसे निम्नलिखित वाक्यों को देखें:
‘राम श्याम से मिलता है.’
‘गोपाल राधा से लड़ता है.’
‘कार स्कूटर से टकरा गई. ’
‘बच्चा माँ से लिपट / चिपट गया. ’
‘चोर पुलिस से डरता है. ’
शीला अपने पति से रूठ गई. ’
इन सभी क्रियाओं में ‘से’ का प्रयोग अंतर्निहित है, जबकि यहाँ न तो अपादान का प्रयोग है और न ही करण का.यह हिंदी की भाषाविशिष्ट और नियमित प्रवृत्ति है,किंतु हिंदी के परंपरागत व्याकरणों की सहायता से आज भी छात्रों को कर्म के साथ ‘को’ और अपादान / करण के विभक्ति-चिह्न या परसर्ग के रूप में ‘से’ का प्रयोग पढ़ाया जाता है.प्रो.सिंह द्वारा वर्णित इस प्रकार के सूक्ष्म और अंतर्निहित भाषाविशिष्ट नियमों को प्राकृतिक भाषा संसाधन (NLP) के अंतर्गत प्रो.अरविंद जोशी द्वारा विकसित टैग (Tree-Adjoing Grammar) नामक कलन-विधि (algorithm) की सहायता से वर्ष 1996 में अमरीका के पेन्सिल्वानिया विश्वविद्यालय में एक ऐसा पदनिरूपक (Parser) विकसित करने का प्रयास किया गया था, जिसकी सहायता से न केवल असंगत हिंदी वाक्यों को प्रजनित होने से रोका जा सकता है,बल्कि एक ऐसा शब्दवृत्त भी तैयार किया जा सकता है,जिसमें इन सभी लक्षणों का समावेश हो.यद्यपि उक्त पार्सर का विकास कंप्यूटरसाधित अनुवाद प्रणाली के लिए किया गया था,किंतु इसकी क्षमता को देखते हुए हिंदी सीखने-सिखाने के लिए शब्दवृत्त भी विकसित किया जा सकता है.
माइक्रोसॉफ़्ट ने हाल ही में ‘विंडोज़ /ऑफ़िस हिंदी’ के रूप में एक ऐसी ऑपरेटिंग प्रणाली और ऑफ़िस पैकेज का विकास किया है,जिसमें हिंदी में कुंजीयन या टाइप करने के लिए उस बारहखड़ी का उपयोग किया है,जिसका प्रयोग हिंदी सीखने-सिखाने के लिए शताब्दियों से भारत के गाँव-गाँव में किया जाता रहा है.इसमें संदेह नहीं कि देवनागरी लिपि आज भी विश्व की सर्वाधिक वैज्ञानिक लिपि मानी जाती है,लेकिन हिंदीभाषियों के लिए हिंदी में टाइप करना आज भी टेढ़ी खीर है.इसलिए माइक्रोसॉफ़्ट ने एक वैकल्पिक कुंजीपटल के रूप में वेबदुनिया की सहायता से एक ऐसी IME (Input Method Editor) का विकास किया गया है,जिसमें रोमन लिपि की सहायता से ध्वन्यात्मक रूप में हिंदी के पाठ को टाइप किया जा सकता है.वस्तुत : यह कुंजीपटल उन लोगों के लिए अधिक उपयोगी है,जो रोमल लिपि में पहले से ही टाइप करना जानते हैं.इसकी एक झलक निम्नलिखित भाषापट्टी से देखी जा सकती है :
इसमें ‘म’ अक्षर से बनने वाले सभी शब्दों को बिना हिंदी टाइप जाने भी रोमन कुंजीपटल की सहायता से हिंदी में टाइप किया जा सकता है.यहाँ तक कि अगर आप मृ भी लिखना चाहते हैं तो भी इस पट्टी को देखकर पता लगा सकते हैं कि ऋ की मात्रा कैपिटल R से लिखी जा सकती है.आप जो भी शब्द टाइप करना चाहते हैं,उसका पहला अक्षर रोमन लिपि में टाइप करें.जैसे आप ‘भारत’ लिखना चाहते हैं तो आप जैसे ही b टाइप करेंगे तो ‘ब’ की बारहखड़ी स्क्रीन पर आ जाएगी और आपको स्क्रीन पर ‘भ’ टाइप करने के लिए ‘bh’ दिखाई पड़ेगा.फिर ‘आ’ की मात्रा के लिए aa और ‘भारत’ लिखने के लिए रोमन लिपि में ‘bhaarat’ टाइप करें.इसप्रकार यह विधि पूरी तरह से ध्वन्यात्मक है.आप जिस क्रम से बोलते हैं,उसी क्रम से टाइप भी करेंगे.उदाहरण के लिए हिंदी में इ की मात्रा लिखी तो पहले जाती है,लेकिन उसका उच्चारण बाद में होता है.उदाहरण के लिए यदि आपको ‘स्त्रियाँ’ लिखना है तो आप ‘striyaa^’ टाइप करेंगे.
हिंदी की यह विशेषता है कि इसमें अल्पप्राण और महाप्राण का युग्म साथ-साथ रहता है.ध्वन्यात्मक लिप्यंतरण (Phonetic Transliteration) नामक इस कुंजीपटल से ‘क’ लिखने के लिए आप ‘k’ टाइप करते हैं और ‘ख’ लिखने के लिए ‘kh’.’ग’ लिखने के लिए ‘g’ और ‘घ’ के लिए ‘gh’ टाइप करते हैं. इसप्रकार दूसरी या विदेशी भाषा के रूप में हिंदी सीखने वालों को ‘h’ के माध्यम से महाप्राणत्व का बोध हो जाता है.हिंदी में यह ध्वनि अर्थभेदक है. भारत में भी अहिंदीभाषियों के लिए विशेषकर तमिलभाषियों के लिए यह विधि हिंदी सीखने में काफ़ी मददगार सिद्ध हुई है.
अंत में निष्कर्ष के रूप में यही कहा जा सकता है कि यदि हिंदी में शब्द-व्याकरण के विकास के लिए प्राकृतिक भाषा संसाधन (NLP) की तकनीक का उपयोग किया जाए तो हिंदी सीखने-सिखाने के लिए एक उपयुक्त शब्दवृत्त विकसित किया जा सकता है,जिससे हिंदी शब्दों में निहित सार्वभौमिक और भाषाविशिष्ट प्रवृत्तियों को अनायास ही अनावृत किया जा सकेगा.

इंटरनेट और हिंदी

कदाचित् कभी सूरज न डूबने वाले ब्रिटिश साम्राज्य और अमरीका की मात्र सुपर ही नहीं, बल्कि सुप्रीम पावर के प्रभाव से भी अंग्रेज़ी का वर्चस्व विश्व भर में इतना नहीं फैला था, जितना कि इंटरनेट के माध्यम से रातों-रात फैल गया है. यूनेस्को के अनुसार आज इंटरनेट पर उपलब्ध सूचनाओं का लगभग 82 प्रतिशत भाग अंग्रेज़ी या रोमन आधारित भाषाओं में है. केवल 18 प्रतिशत भाग रोमनेतर भाषाओं में है और भारतीय भाषाओं में तो इसका भाग एक प्रतिशत से भी कम है. आखिर ऐसी कौन-सी दिक्कतें हैं, जिनके कारण इंटरनेट पर हिंदी और भारतीय भाषाओं में पर्याप्त सूचनाएँ उपलब्ध नहीं हैं. भारत के भूतपूर्व महामहिम राष्ट्रपति श्री अब्दुल कलाम ने 14 सितंबर, 2006 को हिंदी दिवस पर आयोजित एक समारोह में कहा था: ' विश्व के अनेक हिस्सों में हिंदी भाषा आसानी से बोली जा सके, इसके लिए इंटरनेट पर हिंदी साहित्य का युनिकोड स्वरूप उपलब्ध करवाना होगा.' आखिर क्या है यह युनिकोड ? क्या यह अलादीन का चिराग है, जिसके उपयोग से हिंदी और भारतीय भाषाओं में सूचनाएँ अनायास ही इंटरनेट पर आ जाएँगी.
इस समय हिंदी में इंटरनेट पर उपलब्ध वेब साइटों, पोर्टलों और ब्लॉगरों को हम तीन श्रेणियों में विभाजित कर सकते हैं. सर्वप्रथम वे वेबसाइट, जिन्हें खोलने के लिए किसी फ़ॉन्टविशेष को डाउनलोड करना पड़ता है, जिसका परिणाम यह होता है कि तकनीकी जानकारी न रखने वाला आम उपयोगकर्ता भी इसका उपयोग नहीं कर पाता और फिर हर वेबसाइट के लिए अलग फ़ॉन्ट डाउनलोड करना अपने आप में भी कम झंझट का काम नहीं है. इसके अलावा ऐसी वेबसाइट में उपलब्ध सूचनाओं को सर्च के माध्यम से खोजा भी नहीं जा सकता. भारत सरकार की अधिकांश वेबसाइट इसी श्रेणी की हैं. दूसरी वेबसाइट वे हैं, जिनके लिए डायनामिक फ़ॉन्ट का उपयोग किया जाता है. डायनामिक फ़ॉन्ट, HTML डिज़ाइन का एक नया रूप है, जिसकी सहायता से किसी भी फ़ॉन्ट में निर्मित वेबसाइट की विषयवस्तु को उपयोगकर्ता फ़ॉन्टविशेष को डाउनलोड किए बिना ही देख और पढ़ सकता है. हिंदी के अधिकांश समाचारपत्र इसी श्रेणी के अंतर्गत आते हैं, लेकिन डायनामिक फ़ॉन्ट में निर्मित वेबसाइट को न तो आप वर्ड आदि प्रलेख में सहेज सकते हैं और न ही इसमें संकलित विषयवस्तु को सर्च के माध्यम से खोजा जा सकता है. इससे यह स्पष्ट है कि इंटरनेट पर इन दोनों प्रकार की वेबसाइटों का कोई अस्तित्व ही नहीं है और इस प्रकार वेबसाइट बनाने का मूल उद्देश्य ही नष्ट हो जाता है.
तीसरी श्रेणी की चर्चा से पहले हम कंप्यूटर के क्षेत्र में हिंदी की स्थिति की समीक्षा करने का प्रयास करेंगे. यह विडंबना ही है कि आज भी कंप्यूटर पर भारतीय भाषाओं के अधिकांश उपयोगकर्ता सिस्टम और फ़ॉन्ट की असंगतता के कारण ई-मेल, गपशप (चैट), टैम्पलेट, ऑटो टेक्स्ट, थिसॉरस, स्पेलचैक जैसे कंप्यूटर के सामान्य अनुप्रयोगों का भी उपयोग करने में हिचकिचाते हैं. यही कारण है कि कंप्यूटर पर हिंदी के उपयोगकर्ता आज भी शब्दसंसाधन तक ही सीमित हैं. शब्दसंसाधन के अंतर्गत भी वे कंप्यूटर पर हिंदी में टाइप करने मात्र को ही हिंदी कंप्यूटिंग समझने लगते हैं. बहुत ही कम उपयोगकर्ता ऐसे हैं जो हिंदी और अन्य भारतीय भाषाओं में, पावर पॉइंट, ऐक्सेल और ऐक्सेस आदि का उपयोग करते हैं या इंटरनेट पर हिंदी में खोज जैसी सुविधाओं का उपयोग करते हैं. इसका मुख्य कारण अब तक तो यही था कि भारतीय भाषाओं में विभिन्न सिस्टमों के आरपार कोई समान मानक प्रचलित नहीं था. इस दिशा में भारत सरकार द्वारा अनुमोदित भारतीय भाषाओं में कंप्यूटिंग के लिए ISCII कोडिंग प्रणाली एक अच्छी शुरूआत थी, लेकिन विश्वीकरण के इस युग में विविध प्रकार के प्लेटफ़ॉर्म, फ़ॉन्ट और सिस्टम के बावजूद आवश्यकता एक ऐसी मानक कोडिंग प्रणाली की थी, जिसके अंतर्गत विश्व की सभी भाषाएँ सह-अस्तित्व की भावना के साथ रह सकें. इन समस्याओं का एकमात्र समाधान है, युनिकोड. हिंदी के व्यापक प्रचार-प्रसार में युनिकोड की सुविधा क्रांतिकारी परिवर्तन ला सकती है. आज विश्व की सभी लिखित भाषाओं के लिए युनिकोड नामक विश्वव्यापी कोड का उपयोग, माइक्रोसॉफ़्ट, आई.बी.एम.,लाइनेक्स, ओरेकल जैसी विश्व की लगभग सभी कंप्यूटर कंपनियों द्वारा किया जा रहा है. यह कोडिंग सिस्टम फ़ॉन्ट्समुक्त , प्लेटफ़ॉर्ममुक्त और ब्राउज़रमुक्त है. विंडोज़ 2000 या उससे ऊपर के सभी पी सी युनिकोड को सपोर्ट करते हैं, इसलिए युनिकोड आधारित फ़ॉन्ट का उपयोग करने से न केवल हिंदी को आज विश्व की उन्नत भाषाओं के समकक्ष रखा जा सकता है, बल्कि इसकी सहायता से निर्मित वेबसाइट में खोज आदि अधुनातन सुविधाएँ भी सहजता से ही उपलब्ध हो सकती हैं. यह हर्ष का विषय है कि पिछले कुछ समय से भारत सरकार के कुछ विभाग, हिंदी के कुछ बड़े समाचार-पत्र और कुछ हिंदी पोर्टल युनिकोड के महत्व को समझने लगे हैं और उन्होंने अपनी वेबसाइट के लिए युनिकोड का प्रयोग शुरू कर दिया है. तीसरी श्रेणी की वेबसाइट में इन्हीं विभागों, समाचार पत्रों और पोर्टलों का समावेश है. इस दिशा में भारत सरकार का विदेश मंत्रालय, दैनिक जागरण, अभिव्यक्ति व अनुभूति नामक वेबपत्रिकाएँ और वेबदुनिया पोर्टल अग्रणी हैं. यह भी दिलचस्प तथ्य है कि इस दिशा में अग्रणी भूमिका निभाने का श्रेय उन छोटी पत्रिकाओं को है, जिनके पास साधनों का हमेशा ही अभाव रहता है. ये पत्रिकाएँ हैं: निरंतर, वागर्थ, तन्मय, तद्भव, अन्यथा आदि. युनिकोड के प्रयोग के कारण न केवल इनकी विषयवस्तु को इंटरनेट पर बिना फ़ॉन्ट डाउनलोड किए देखा और पढ़ा जा सकता है, बल्कि इसे सहेजकर रखा भी जा सकता है और अंतत: इसे गूगल और अल्टा विस्टा आदि जैसे सर्चइंजनों की सहायता से खोजा भी जा सकता है. हिंदी में युनिकोड के निरंतर बढ़ते उपयोग के कारण अब यदि यूनेस्को की कोई रिपोर्ट तैयार होती है तो उसमें हिंदी और भारतीय भाषाओं का अंश पहले की तुलना में निश्चय ही काफ़ी अधिक होगा.

हिंदी में ब्लॉग अर्थात् चिट्ठों का संसार

सर्वप्रथम वेबब्लॉग नाम से इसकी मूल परिकल्पना आज से दस वर्ष पूर्व जॉर्न बर्गर द्वारा सन् 1997 में की गई और इसी शब्द को हँसी-मज़ाक में छोटा करके पीटर मरहॉल्ज ने सन् 1999 में ब्लॉग शब्द का प्रयोग शुरू कर दिया और तब से संज्ञा और क्रिया दोनों ही रूपों में इसका प्रयोग किया जाने लगा. आरंभ में वेबब्लॉग लोकप्रिय वेबसाइटों के संक्षिप्त और अद्यतन रूप ही हुआ करते थे. आरंभिक ब्लॉग किसी न किसी विषय विशेष पर ही केँद्रित रहते थे, जैसे, राजनैतिक ब्लॉग, यात्रा ब्लॉग, फ़ैशन ब्लॉग, परियोजना ब्लॉग, स्वप्न ब्लॉग आदि..आदि.ज़्यादातर ब्लॉग कुछ व्यक्तियों द्वारा निजी स्तर पर ही बनाए जाते थे, लेकिन इसकी बढ़ती लोकप्रियता को देखते हुए अब इसका प्रयोग व्यावसायिक आधार पर भी किया जाने लगा है. वस्तुत: ब्लॉग एक ऐसी वेबसाइट है, जिसे कालक्रम से संयोजित किया जाता है, किंतु उसका प्रदर्शन आम तौर पर उल्टे कालक्रम से किया जाता है. टैक्नोराटी नामक एक वेब सर्च इंजन द्वारा सितंबर,2007 में की गई खोज के अनुसार इस समय लगभग 106 मिलियन ब्लॉग दुनिया-भर में प्रचलित हैं. इनमें से अधिकतर ब्लॉग तो समाचारों पर अपनी टिप्पणी देते हैं और कुछ ब्लॉग ऑन-लाइन डायरी के रूप में दिखाई पड़ते हैं और कदाचित् यही कारण है कि हिंदी में इसके लिए चिट्ठा शब्द प्रचलित हो गया है. शुरूआती दौर में इस पर सिर्फ़ शब्द रहते थे, लेकिन अब इसका स्वरूप काफ़ी बदल गया है. अब इसमें चित्र तथा ऑडियो और वीडियो के प्रकाशन की भी सुविधा हो गई है.
तकनीकी दृष्टि से ब्लॉग,वेबसाइट और पोर्टल में कोई विशेष अंतर नहीं है. प्रभासाक्षी.कॉम के संपादक श्री बालेंदु दधीच मानते हैं कि ब्लॉग एक डायरी या चिट्ठा-भर है, जबकि वेबसाइट एक पूरा दस्तावेज़ है और पोर्टल अनेक वेबसाइटों का एक समूह है, लेकिन इनकी निर्माण प्रक्रिया में कुछ अंतर अवश्य है. जहाँ वेबसाइट या पोर्टल के निर्माण के लिए डोमेन-नाम का पंजीकरण अनिवार्य है और FTP पता, उपयोगकर्ता का नाम और पासवर्ड लेना अनिवार्य है, वहीं ब्लॉग निर्माण के लिए इस प्रकार की कोई औपचारिकता आवश्यक नहीं है. इसके अलावा वेबसाइट निर्माण के लिए FTP क्लाइंट सर्वर आपसे स्पेस के लिए मामूली शुल्क भी लेते हैं, जबकि ब्लॉग हॉस्टिंग सेवा लेने के लिए गूगल, वर्डप्रैस, मूवेबल टाइप, ब्लॉगर या लाइवजर्नल जैसे ब्लॉग सॉफ़्टवेयर उपलब्ध हैं.
वस्तुत: हिंदी में इसकी शुरूआत लगभग साढ़े चार वर्ष पूर्व युनिकोड के आगमन के बाद ही हुई, क्योंकि युनिकोड एक ऐसा कोडिंग सिस्टम है,जो न केवल विश्वव्यापी है,बल्कि उसमें विश्व की सभी जीवंत भाषाएं समाहित हैं.इसकी शुरूआत रवि रतलामी जैसे लोगों ने व्यक्तिगत स्तर पर की, किंतु हिंदी के लोकप्रिय चिट्ठाकार आलोक पुराणिक के अनुसार व्यक्तिगत स्तर पर निर्मित और प्रकाशित ये ब्लॉग अत्यंत यूज़र फ़्रैंडली होते हुए भी जंगल में मोर नाचा, किसने देखा जैसी कहावत ही सिद्ध करते थे और गिने-चुने लोग ही इसे देख पाते थे, लेकिन अब कुछ ब्लॉग ऐग्रीगेटर मैदान में आ गए हैं, जैसे, नारद, ब्लॉगवाणी और हिंदी ब्लॉग.कॉम आदि. व्यक्तिगत स्तर पर सक्रिय ब्लॉगर या चिट्ठाकार अपना पंजीकरण इनके अंतर्गत करा लेते हैं और पहले दिन से ही उन्हें कुछ पाठक मिल जाते हैं. हिंदी में चिट्ठाविश्व नाम से पहला ब्लॉग एग्रीगेटर बनाने वाले पुणे के सॉफ्टवेयर इंजीनियर देबाशीष चक्रवर्ती हैं. ब्लॉग एग्रीगेटर और ब्लॉग निर्माण की नि:शुल्क सुविधा प्रदान किए जाने के कारण हिंदी ब्लॉगों की संख्या अब बढ़कर लगभग 1100 तक पहुँच गई है. यदि ब्लॉगर अपने ब्लॉग के उपयोगकर्ताओं की सही और अद्यतन संख्या जानना चाहते हैं तो वे काउंटर की सुविधा भी ले सकते हैं.जहाँ तक हिंदी में कुंजीयन का प्रश्न है,इसके लिए माइक्रोसॉफ़्ट द्वारा विकसित IME और गूगल इंडिक ट्रांसलिटरेशन टूल जैसी सुविधाओं के आ जाने से देवनागरी में लिखना पहले के मुक़ाबले बहुत आसान हो गया है.
हिंदी में ब्लॉग की बढ़ती लोकप्रियता को देखते हुए अब बड़ी-बड़ी कॉर्पोरेट कंपनियाँ भी मैदान में आने लगी हैं. बीबीसी.हिंदी ने हाल ही में अपनी ब्लॉगर सेवा शुरू कर दी है,लेकिन ब्लॉग की क्षमता और व्यापकता को देखते हुए कुछ आतंकवादी संगठन भी अब इसका लाभ उठाने लगे हैं, इसलिए कुछ देश अब इस पर रोक लगाने पर भी विचार कर रहे हैं. अभी हाल ही में मिस्र की एक अदालत ने ब्लॉग के माध्यम से इस्लाम और राष्ट्रपति की आलोचना के लिए एक ब्लॉगर को चार साल की जेल की सज़ा सुनाई है. मिस्र के मानवाधिकार संगठन ने इस सज़ा को 'बहुत सख़्त' क़रार दिया है.भारत सरकार द्वारा जारी की गई एक अधिसूचना में भी कहा गया है कि भारत की संप्रभुता और अखंडता, राष्ट्रीय सुरक्षा, अन्य देशों से दोस्ताना रिश्ते और कानून तोड़ने के लिए प्रोत्साहित करने वाले वेबसाइटों को सरकार प्रतिबंधित कर सकती है, लेकिन ब्लॉगिंग के ज़रिए अपने दफ़्तर या समाज के बारे में बेबाक राय रखने वालों का कहना है कि इन पर रोक लगाने का मतलब है अभिव्यक्ति की स्वतंत्रता पर चोट करना. ऐसी स्थिति में चिट्ठाकारों के लिए स्वयं ही आदर्श आचार संहिता का पालन जरूरी है. नारद ने इस तरह की संहिता तैयार करने की दूरदर्शिता दिखाकर प्रशंसनीय पहल की है.बीबीसी के अनुसार कुछ सरकारी संस्थाओं ने इसके सकारात्मक पहलू को सामने रखकर इसका प्रयोग शुरू कर दिया है. ब्लॉगिंग विधा ने राजस्थान में बाड़मेर पुलिस को इतना प्रभावित किया कि उन्होंने अपना एक आधिकारिक चिट्ठा ही बना लिया. इस चिट्ठे पर विभाग अपने दैनिक कार्यकलापों का ब्यौरा प्रकाशित करता है.
इससे यह स्पष्ट है कि ब्लॉग एक ऐसी शक्ति है, जिसके सकारात्मक उपयोग से व्यक्तिगत और सामाजिक दोनों ही स्तरों पर समाज में भारी परिवर्तन लाया जा सकता है.

इंडिक भाषा कंप्यूटिंग के माध्यम से भारतीयों का वैश्विक समन्वय

इंडिक भाषा कंप्यूटिंग के माध्यम से भारतीयों का वैश्विक समन्वय

विजय कुमार मल्होत्रा

पूर्व निदेशक (राजभाषा),रेल मंत्रालय, भारत सरकार,नई दिल्ली (भारत)

प्रस्तावना
राजनीति के रंगमंच पर भाषा की भूमिका भी काफ़ी महत्वपूर्ण होती है.विश्व के इतिहास पर सरसरी निगाह डालने पर पता चलेगा कि यह मात्र संयोग नहीं है कि विकासित देश प्रमुखत:एकभाषिक हैं और विकासशील देश बहुभाषिक, बहु नस्लीय और बहु सांस्कृतिक. तथाकथित विकसित विश्व में दो भाषाओं को झंझट और अनेक भाषाओं को मूर्खतापूर्ण माना जाता है.वस्तुत:विविधता का प्रबंधन आज मानवता के लिए बहुत बड़ी चुनौती है. एकभाषिक सोच पर आधारित राष्ट्र के रूप में राज्य की परिकल्पना से आज लोकतांत्रिक विश्व विभाजित और अलग-थलग पड़ गया है.कनाडा में फ्रेंच और अंग्रेज़ी की परस्पर असहिष्णुता ने क्यूबेक को कनाडा से अलग होने के कगार पर पहुँचा दिया था.पाकिस्तान के असहिष्णु राजनीतिज्ञों ने जब उर्दू को एकमात्र राजभाषा बनाकर थोपने का प्रयास किया तो बँगला देश पाकिस्तान से अलग हो गया.विभिन्न भाषाओं की मान्यता से न केवल संवाद का मुक्त प्रवाह होता है,बल्कि इससे निचले स्तर को लोगों का सशक्तीकरण भी होता है.चीनी और अंग्रेज़ी की तरह हिंदी के समन्वित नाम के अंतर्गत भी अनेक बोलियों का समावेश हो जाता है.
यदि आप 14 सितंबर,1949 को भारतीय संविधान सभा में हिंदी को राजभाषा बनाने के प्रस्ताव पर हुई बहस पर नज़र दौड़ाएँ तो पाएँगे कि यह प्रस्ताव सुप्रसिद्ध तमिल नेता श्री गोपालस्वामी आयंगर द्वारा रखा गया था और इसे बिना किसी विरोधी या अनुपस्थित मत के सर्वसम्मति से स्वीकार किया गया था, लेकिन इसके साथ एक खंड भी जोड़ दिया गया था कि हिंदी के साथ-साथ अन्य भारतीय भाषाओं का विकास भी सुनिश्चित किया जाएगा.तदनुसार,अंग्रेज़ी और संस्कृत के साथ-साथ अन्य भारतीय भाषाओं को भी राष्ट्रीय भाषाओं के रूप में भारतीय संविधान की 8 वीं अनुसूची में शामिल कर लिया गया.अब यह संख्या बढ़कर 22 हो गई है.संविधान के अनुच्छेद 351 में यह परिकल्पना की गई है कि हिंदी का विकास इस तरह से किया जाए कि यह भारत की सामासिक संस्कृति का प्रतिनिधित्व कर सके और इसमें भारत के संविधान की 8वीं अनुसूची में परिगणित भाषाओं के रूपों, शैलियों और अभिव्यक्तियों को भी समाहित किया जाए.
· विविधता में एकता
· भारत एकभाषिक क्षेत्र है.
· सभी भारतीय लिपियों के लिए समान कोड और समान कुंजीपटल
बहुभाषी,बहु नस्लीय और बहु सांस्कृतिक समाज में विविधता में एकता के सिद्धांत को ही मूलमंत्र माना जाता है . भारत में 1650 से अधिक भाषाएँ और बोलियाँ प्रचलित हैं.फिर भी यह देश एकभाषिक क्षेत्र है. यह सिद्धांत भारत की भाषाओं और लिपियों पर भी लागू होता है. भारत की अनेक भाषाएँ और लिपियाँ तो एक-दूसरे से इतनी भिन्न दिखती हैं कि उनमें समानता के अंतर्निहित सूत्र को खोजना भी सरल नहीं है.उदाहरण के लिए आर्यभाषाओं और द्रविड़ भाषाओं में इतना अंतर दिखाई पड़ता है कि यह विश्वास करना कठिन हो जाता है सभी भारतीय भाषाओं की लिपियों (उर्दू को छोड़कर) का उद्गम ब्राह्मी लिपि के समान स्रोत से हुआ है.अशोक काल से ही हमें उत्तर और दक्षिण भारत में ब्राह्मी का व्यापक उपयोग मिलने लगता है. ऐतिहासिक रूप में इसकी प्रामाणिक जानकारी सन् 1837 में मिली.IIT, कानपुर के कंप्यूटर वैज्ञानिकों को सभी भारतीय भाषाओं के लिए जिस्ट प्रौद्योगिकी के आधार पर समान कुंजीपटल का विकास करते हुए इस तथ्य का व्यावहारिक अनुभव सन् 1983 में हुआ. पहली बार इसका सार्वजनिक प्रदर्शन सन् 1983 में नई दिल्ली में आयोजित अंतर्राष्ट्रीय हिंदी सम्मेलन में किया गया.क्वेर्टी कुंजीपटल पर सभी भारतीय भाषाओं को समेटना अपने आप में एक जटिल कार्य था,लेकिन ब्राह्मी लिपि से उद्भव के फलस्वरूप भारतीय लिपियों के ध्वन्यात्मक स्वरूप के कारण यह कार्य अत्यंत वैज्ञानिक रूप में संपन्न हो गया. सभी भारतीय भाषाओं के लिए जिस्ट प्रौद्योगिकी के आधार पर समान ध्वन्यात्मक कुंजीपटल और समान कोड का विकास किया गया.
· भारतीय लिपियों के लिए ISCII कोड और इन्स्क्रिप्ट
भारतीय लिपियाँ अपने स्वरूप में अक्षरात्मक हैं,लेकिन उनका स्वरूप ध्वन्यात्मक है औरब्राह्मी लिपि से उद्भव के कारण उनकी विरासत भी एक ही है. कुछ लिपियों में मामूली-सा अंतर होने के कारण कतिपय लिपियों में कुछ अक्षर अतिरिक्त हैं और कुछ कम. 1986-88 में विकसित ISCII (Indian Standard Code for Information Interchange) कोड में परिवर्धित देवनागरी के अंतर्गत इस पक्ष का भी ध्यान रखा गया और इसे भारतीय मानक ब्यूरो ने मानक के रूप में स्वीकार कर लिया, लेकिन जब कंप्यूटर के उपयोग का सवाल आया तो भारतीय भाषाओं में डेटा प्रविष्टि के अनेक विकल्प सामने थे और यही चिंता की बात थी. भारतीय भाषाओं में डेटा प्रविष्टि के लिए डिफ़ॉल्ट विकल्प INSCRIPT(INdian SCRIPT) लेआउट है. इस लेआउट में मानक 101 कुंजीपटल का उपयोग करता है.वर्णों की मैपिंग इस प्रकार से की गई है कि यह सभी भारतीय भाषाओं (बाएँ से दाईं ओर लिखी जाने वाली) के लिए समान कुंजीपटल बन जाता है.इसका प्रमुख कारण यही है कि भारतीय भाषाओं के वर्णों का सेट समान है. हम भारतीय भाषाओं की वर्णमाला के वर्णों को व्यंजन,स्वर,अनुनासिक और संयुक्ताक्षरों में विभाजित कर सकते हैं.प्रत्येक व्यंजन विशिष्ट ध्वनि और स्वर का संयोजन होता है. स्वर शुद्ध ध्वनियों को दर्शाता है. अनुनासिक वे नासिक्य ध्वनियाँ होती हैं,जिनका उच्चारण स्वर के साथ किया जाता है. संयुक्ताक्षर दो या अधिक वर्णों का संयोजन होता है.भारतीय भाषाओं की वर्णमाला की तालिका को स्वर और व्यंजन में विभाजित किया जाता है. स्वर दो प्रकार के होते हैं,दीर्घ और लघु. व्यंजनों को अनेक वर्गों में विभाजित किया जाता है. INSCRIPT लेआउट में यह व्यवस्था प्रतिबिंबित होती है. इसीलिए इसकी व्यवस्था बहुत सरल होती है.इन्स्क्रिप्ट ले आउट में सभी स्वरों को कुंजीपटल के बाईं ओर रखा गया है और व्यंजनों को दाईं ओर.यह व्यवस्था इसप्रकार से की गई है कि प्रत्येक वर्ग को दो कुंजियों में विभाजित कर दिया गया है.इस प्रकार इन भाषाओं के समान अकारादि क्रम के कारण ही सभी भारतीय भाषाओं के लिए समान कुंजीपटल और समान कोड विकसित किया जा सका है और सभी भारतीय भाषाओं के लिए समान कोडिंग के कारण ही भारतीय लिपियों में परस्पर लिप्यंतरण की सुविधा भी संपन्न हो पाई है.चूँकि ISCII में रोमन लिपि को भी समाहित किया गया है,इसलिए इंडिक लिपियों से रोमन लिपि में भी लिप्यंतरण किया जा सकता है.

· भारतीय भाषाओं के शब्दपरक,वाक्यपरक और अर्थपरक लक्षणों में समानता
इंडिक लिपियों के बाद आइए अब हम भारतीय भाषाओं के शब्दपरक,वाक्यपरक और अर्थपरक लक्षणों की समानता पर विचार करें.यह सही है कि भारतीय भाषाओं ने संस्कृत से काफ़ी मात्रा में शब्द उधार लिए हैं और इनका प्रयोग समान रूप से सभी भारतीय भाषाओं में कमोबेश किया जाता है. भारतीय भाषाओं में इस अखिल भारतीय स्वरूप के कारण ही संस्कृत इन सभी भारतीय भाषाओं की साझी विरासत है.प्रमुखत:संस्कृत में रचे गए आयुर्वेद,गणित और ज्योतिष शास्त्रों की शब्दावली बहुत हद तक समान रूप से सभी भारतीय भाषाओं में प्रयुक्त होती है. विभिन्न भारतीय भाषाओं के लिए तकनीकी शब्दावली को अंतिम रूप देते हुए भारत सरकार ने ये निर्देश जारी किए थे कि विभिन्न भारतीय भाषाओं के लिए नई पारिभाषिक शब्दावली का निर्माण करते समय यह ध्यान रखा जाए कि ये शब्द प्रमुखत:संस्कृत से ही व्युत्पन्न किए जाएँ और यही कारण है कि भारतीय भाषाओं की पारिभाषिक शब्दावली में बहु हद तक काफ़ी समानता है. भारतीय भाषाओं के लिए विभिन्न भाषिक उपकरण विकसित करते समय कंप्यूटर वैज्ञानिक यह देखकर दंग रह गए कि भारतीय भाषाओं में केवल शाब्दिक स्तर पर ही नहीं,वाक्यविन्यास और अर्थ की संरचना के स्तर पर भी काफ़ी समानता है. वस्तुत: उन्हें तो ऐसा लगा कि 1650 से अधिक भाषाओं और बोलियों के बावजूद भारत एकभाषिक क्षेत्र है.हिंदी और अन्य भारतीय भाषाओं के लिए विभिन्न भाषिक उपकरण विकसित करते समय अभिकलनात्मक भाषाविज्ञान (Computational Linguistics) के क्षेत्र में काम करने वाले कंप्यूटर वैज्ञानिकों को ऐसे अनेक क्षेत्र मिले,जिनमें काफ़ी समानता थी.

भारतीय भाषाओं के भाषावर्ग विशिष्ट लक्षण
यदि आप भाषाओं के मूल की ओर दृष्टि डालें तो पाएँगे कि विश्व-भर की भाषाओं में दो स्पष्ट लक्षण दिखाई पड़ते हैं: सार्वभौमिक लक्षण और भाषा-विशिष्ट लक्षण. सार्वभौमिक लक्षण वे हैं जो हिंदी, तमिल, अंग्रेज़ी, चीनी और अरबी जैसी विभिन्न भाषा-परिवारों से जुड़ी भाषाओं में भी समान रूप से पाए जाते हैं. उदाहरण के लिए, 'खाया' एक सकर्मक क्रिया है, जो खाए जाने के लिए एक कर्म और खाना क्रिया संपन्न करने के लिए एक कर्ता की आकांक्षा करती है और साथ ही यह भी अपेक्षा करती है कि उसका कर्ता सजीव हो. इस क्रिया की वृक्ष संरचना में ये सभी सार्वभौमिक लक्षण दिखाई पड़ते हैं.इसप्रकार इसकी सकर्मकता विश्व की सभी भाषाओं में समान है,लेकिन कर्ता के साथ 'ने' का प्रयोग हिंदी का भाषा-विशिष्ट लक्षण है. कुछ ऐसे भी लक्षण होते हैं जो भाषा-वर्ग विशिष्ट होते हैं. उदाहरण के लिए, एक विशेष वाक्य साँचे में कर्ता के साथ 'को' का प्रयोग सभी भारतीय भाषाओं में समान रूप से पाया जाता है; जैसे, हिंदी में 'राम को बुखार है', मराठी में 'रामला ताप आहे' ,तमिल में, 'रामक्कु ज्वरम्' ,मलयालम में ' रामन्नु पनियानु ' ,कन्नड़ में ‘रामनिगे ज्वर दिगे' , बँगला में 'रामेर ताप आछे' और अंग्रेज़ी में इसका अनुवाद होगा, 'Ram has a fever’. अंग्रेज़ी में आप देखेंगे कि 'को' का वाचक कोई परसर्ग या पूर्वसर्ग नहीं है. इससे स्पष्ट होता है कि भारत एकभाषिक क्षेत्र है. यदि हम इन लक्षणों के विश्लेषण के लिए भाषा प्रौद्योगिकी का उपयोग करें तो हम भारतीय भाषाओं में कंप्यूटर साधित स्वयं भाषा शिक्षक, ऑटो-करेक्ट, ग्रामर चैकर और मशीनी अनुवाद जैसे अत्यंत जटिल भाषिक उपकरणों का विकास भी कर सकते हैं.

· भारतीय भाषाओं के लिए समान भाषिक उपकरण
भारतीय भाषाओं की समानता के कारण ही हिंदी और अन्य भारतीय भाषाओं में मशीनी अनुवाद प्रणाली जैसे भाषिक उपकरणों के विकास का मार्ग प्रशस्त हुआ.अनुसारक नामक मशीनी अनुवाद प्रणाली, पाणिनीय व्याकरण पद्धति पर आधारित है और इसमें 5 भारतीय भाषा-युग्मों के अनुवाद की प्रणाली विकसित की गई है और इसका विकास IIIT,हैदराबाद के निदेशक डॉ.राजीव संगल के मार्गदर्शन में किया गया है. TAG (Tree Adjoining Grammar) अर्थात् वृक्ष संलग्न व्याकरण नाम से दूसरी कलनविधि का विकास पेन्सिल्वेनिया विश्वविद्यालय (अमरीका) के कंप्यूटर विज्ञान विभाग के अध्यक्ष और प्रोफ़ेसर अरविंद जोशी ने किया था. TAG हिंदी और अंग्रेज़ी जैसी भिन्न वाक्य संरचना वाली भाषाओं के पदनिरूपण (पार्सिंग) के लिए काफ़ी उपयुक्त मानी गई है. एक ओर अंग्रेज़ी स्थिर शब्द क्रम की भाषा है, वहीं हिंदी इसके ठीक विपरीत अपेक्षाकृत मुक्त शब्द क्रम की भाषा है .उदाहरण के लिए यदि आप अंग्रेज़ी के इस वाक्य के शब्द क्रम को बदल दें तो अर्थ का अनर्थ हो सकता है अर्थात् अर्थ पूरी तरह से बदल जाएगा. "Ram (कर्ता) killed (क्रिया) Ravan (कर्म)" का क्रम बदलकर इस प्रकार कर दें, "Ravan (कर्ता) killed (क्रिया) Ram (कर्म)" तो अर्थ पूरी तरह से बदल जाता है, लेकिन हिंदी में यदि क्रम बदल भी जाए तो भी अर्थ ज्यों का त्यों ही रहेगा. "राम (कर्ता) ने रावण (कर्म) को मारा (क्रिया)". "रावण (कर्म) को राम (कर्ता) ने मारा (क्रिया)". TAG में अंग्रेज़ी और हिंदी दोनों ही भाषाओं का पदनिरूपण क्रिया के आधार पर किया जाता है और अंग्रेज़ी को SVO के रूप में और हिंदी को SOV के रूप में पदनिरूपित कर देता है. इस परियोजना का प्रयोग-क्षेत्र प्रशासनिक भाषा था. प्रशासनिक भाषा के लक्षण सभी भाषाओं में लगभग समान हैं. उदाहरण के लिए प्रशासनिक भाषा में कर्मवाच्यपरक भूतकालिक कृदंतों का प्रयोग बहुतायत से किया जाता है. "Mr.Verma has been transferred from Delhi to Mumbai with effect from March1, 2005 and posted as Director (Operations)".किंतु हिंदी का भाषाविशिष्ट लक्षण यह है कि इसमें कर्ता के साथ आदरसूचक शब्द श्री या जी लगाने से इसका प्रयोग बहुवचन में किया जाता है और तदनुसार क्रिया भी बहुवचन में बदल जाती है. श्री वर्मा निदेशक हो गए (बहुवचन). इन उदाहरणों से मैं यही स्पष्ट करना चाहता था कि पार्सर बनाते समय यदि भाषा-विशिष्ट लक्षणों पर ध्यान नहीं दिया गया तो मशीनी अनुवाद जैसे भाषिक उपकरणों का सफलतापूर्वक विकास नहीं किया जा सकता.यह पार्सर अंग्रेज़ी, फ्रेंच, मराठी, जापानी और चीनी जैसी विभिन्न भाषा परिवारों के भाषा-विशिष्ट लक्षणों के विश्लेषण के लिए बहुत उपयुक्त पाया गया.

यह आवश्यक नहीं है कि समान देवनागरी लिपि होने के कारण हिंदी और मराठी जैसी दोनों भाषाओं के लिपिविशिष्ट लक्षणों में भी पूरी समानता हो..

देवनागरी लिपि की साझी विरासत होने पर भी कुछ भारतीय भाषाओं में समानता के बावजूद ऑटो करेक्ट जैसे भाषिक उपकरणों का विकास करते समय कुछ तथ्यों को ध्यान में रखना बहुत आवश्यक है.हिंदी और मराठी की समान लिपि देवनागरी होने के बावजूद ऑटो करेक्ट के संदर्भ में दोनों भाषाओं के भाषाविशिष्ट लक्षणों में काफ़ी असमानताएँ भी हो सकती है. उदाहरण के लिए, देवनागरी लिपि का प्रयोग हिंदी और मराठी दोनों ही भाषाओं के लिए किया जाता है,लेकिन लिपि एक होने के बावजूद इन दोनों भाषाओं की वर्तनी में काफ़ी फ़र्क है. यहाँ तक कि संस्कृत से लिए गए तत्सम शब्दों की वर्तनी में भी भिन्नता पाई जाती है. उदाहरण के लिए, हिंदी के इकारांत शब्द मराठी में ईकारांत हो जाते हैं. हिंदी का कवि मराठी में कवी हो जाता है. इसलिए दोनों भाषाओं का ऑटो करेक्ट भी अलग- अलग होना चाहिए. इसलिए हिंदी के ऑटो करेक्ट के लिए मैंने हिंदीभाषी क्षेत्रों और अहिंदीभाषी क्षेत्रों से वर्तनी संबंधी त्रुटियों के नमूने इकट्ठे करने शुरू कर दिए और पाया कि मातृभाषा के व्याघात के कारण मराठीभाषी और पंजाबीभाषी की हिंदी संबंधी अशुद्धियों में काफ़ी अंतर है. यदि मराठीभाषी और गुजरातीभाषी हिंदी में छोटी और बड़ी मात्रा की अशुद्धि करते हैं तो दक्षिण भारतीय भाषा भाषी महाप्राण की ध्वनि में अशुद्धि करते हैं. वे भाषा को बाषा और खाना को काना लिखते हैं.यह मातृभाषा व्याघात के कारण होता है. विश्व भर की सभी भाषाओं के बीच असंगतता (नॉन कॉम्पेटिबिलिटी),बहुविध फ़ॉन्ट और अलग-अलग ऑपरेटिंग सिस्टम संबंधी समस्याओं का एकमात्र समाधान है,युनिकोड.
वर्तमान परिदृश्य में, इंडिक भाषाओं के अधिकांश उपयोगकर्ता सिस्टम और फ़ॉन्ट की असंगतता के कारण आज भी अमानक फ़ॉन्ट का उपयोग कर रहे हैं और ई-मेल,गपशप(चैट),टैम्पलेट,ऑटो टेक्स्ट,थिसॉरस,स्पेलचैक जैसे अनुप्रयोगों का इंडिक भाषाओं में उपयोग करने में हिचकिचाते हैं. बहुत ही कम उपयोगकर्ता ऐसे हैं जो हिंदी और अन्य भारतीय भाषाओं में ऐक्सेल और ऐक्सेस का उपयोग करते हैं. इंडिक भाषाओं के उपयोगकर्ता भी बहुत कम हैं. इसका मुख्य कारण यह था कि इंडिक भाषाओं में विभिन्न सिस्टमों के आरपार कोई समान मानक नहीं था.इस दिशा में ISCII एक अच्छी शुरुआत थी,लेकिन विश्वीकरण के इस युग में विविध प्रकार के प्लेटफ़ॉर्म,फ़ॉन्ट और सिस्टम के बावजूद आवश्यकता एक ऐसे मानक की है,जिसके अंतर्गत विश्व की सभी भाषाएँ सह-अस्तित्व की भावना के साथ रह सकें.इन समस्याओं का एकमात्र समाधान है,युनिकोड.इसलिए हमारा प्रयास यह होना चाहिए कि इंडिक भाषाओं के उपयोगकर्ताओं को युनिको़ड में भाषा कंप्यूटिंग के लाभों से अवगत कराया जाए. युनिको़ड में भारतीय भाषाओं को ISCII के आधार पर ही एन्कोड किया गया है.
सभी भारतीय भाषाओं को युनिको़ड के वर्ण चार्ट में साथ-साथ रखा गया है.प्रत्येक भाषा को एक कोडपेज दिया गया है.इंडिक भाषा के कोडपेज में 128 कोड पॉइंट्स के ब्लॉक हैं. युनिकोड में सहेजे गए पाठ का प्रदर्शन ओपन टाइप फ़ॉट्स द्वारा किया जाता है,जिसका विकास ऐडोब और माइक्रोसॉफ़्ट द्वारा संयुक्त रूप में किया गया है. ओपन टाइप फ़ॉन्ट एक खुला मानक है और यह किसी कम्पनी विशेष की मिल्कियत नहीं है.
प्रत्येक भारतीय भाषा का सॉर्टिंग ऑर्डर अलग-अलग है,भले ही कुछ भाषाओं की लिपि एक ही क्यों न हो. ISCII और युनिकोड में यह एक महत्वपूर्ण अंतर है. ISCII के अंतर्गत सभी भारतीय भाषाओं के लिए समान सॉर्टिंग ऑर्डर रखा गया है.वस्तुत: यह सच नहीं है,क्योंकि प्रत्येक इंडिक लिपि में कुछ अक्षरों के स्तर पर कुछ न कुछ अंतर अवश्य है.कहीं कोई अक्षर अधिक है तो कहीं कोई अक्षर कम. युनिकोड सभी भाषाओं और अनुप्रयोगों को अपने तरीके से सॉर्टिंग करने की आज़ादी देता है.

· यदि आप दुनिया के साथ चलना चाहते हैं तो युनिकोड अपनाएँ.

विश्व भर में सूचनाओं के विनिमय के लिए युनिकोड एक मानक बनता जा रहा है,क्योंकि विश्व की सभी प्रमुख IT कंपनियों ने युनिकोड को मानक मानकर उसे अपना समर्थन देने की घोषणा कर दी है.युनिकोड मानक विश्व भर की भाषाओं के सभी वर्णों को एन्कोड करने में सक्षम है. युनिकोड मानक, वर्णों और उनके प्रयोग की जानकारी प्रदान करता है. युनिकोड मानक ,कंप्यूटर के बहुभाषी पाठों के उपयोगकर्ताओं, व्यापारियों, भाषावैज्ञानिकों,
अनुसंधानकर्ताओं,गणितज्ञों और तकनीशियनों के लिए बहुत उपयोगी है..
युनिकोड 16 बिट की एन्कोडिंग का उपयोग करता है,जिसमें 65000 से अधिक वर्ण (65536) होते हैं. युनिकोड मानक प्रत्येक वर्ण को विशिष्ट संख्या और नाम प्रदान करता है.इसके विपरीत ISCII में 8 बिट कोड का उपयोग किया जाता है,जो 7 बिट ASCII कोड का ही विस्तार है.इसमें ब्राह्मी लिपि से उद्भूत भारतीय लिपियों के 10 मूल अकारादि वर्णों का ही समावेश किया गया है. परंपरागत रूप में कंप्यूटर के अनु्प्रयोग केवल एक भाषा के पाठ का ही समावेश करते रहे हैं.बाद में इस बात की आवश्यकता महसूस की गई कि एक साथ अनेक भाषाओं के पाठों पर काम किया जाए और इसी कारण से कोड के संबंध में अतिरिक्त साधन जुटाए गए.अलग-अलग भाषाओं के अक्षर अलग-अलग भाषाओं के संदर्भ में सामान्यत:अपने कोड के आधार पर नहीं पहचाने जा सकते जब तक कि उन्हें अपनी रेंज के कोड के लिए कोई विशिष्ट संख्यात्मक मान न प्रदान किया जाए. इसलिए अंग्रेज़ी में "a" अक्षर को दिया गया कोड वास्तव में वही है,जो ग्रीक अक्षर "alpha" को या क्रिलिक वर्णमाला के समतुल्य अक्षर को प्रदान किया गया है.विभिन्न भाषाओं में लिखित बहुभाषिक पाठ के दस्तावेज़ को तब तक एक नहीं माना जा सकता,जब तक कि कोई ऐसा तंत्र विकसित न कर लिया जाए,जो विशिष्ट भाषा / लिपि के पाठ को विशेष रूप से चिह्नित न कर सके.
युनिकोड का मूल आधार यही है कि अधिक से अधिक रेंज की 0 से लगभग 65000 तक की संख्याओं के कोड निर्धारित किए गए. इस विशाल सेट में न केवल विश्व की सभी अलग-अलग भाषाओं की वर्णमालाओं के अक्षरों को समाविष्ट किया गया,बल्कि उनके विराम चिह्नों, गणित के प्रतीकों के समान विशेष आकारों को और चलमुद्रा के प्रतीकों को भी समाविष्ट किया गया.इस विशाल रेंज में अलग-अलग भाषाओं की प्रत्येक लिपि के लिए 128 क्रमिक संख्याओं के वर्ग निर्धारित किए जाते हैं और उसमें विशेष प्रतीकों के समूह भी समाविष्ट किए जाते हैं.अनेक भाषाओं की वर्णमाला का आकार 50 से भी कम होता है, इसलिए 128 की न्यूनतम रेंज अतिरिक्त प्रतीकों, विरामचिह्नों आदि के लिए काफ़ी पर्याप्त मानी जाती है.
युनिकोड की महत्वपूर्ण संकल्पना तो यही है कि किसी भी भाषा के लिए कोड का निर्धारण उसकी भाषिक आवश्यकताओं के आधार पर ही किया गया है.इसप्रकार अपनी लेखन प्रणाली में अपनी वर्णमाला के अक्षरों का उपयोग करने वाली विश्व की अधिकांश भाषाओं के संदर्भ में यदि विशेष प्रतीकों के साथ-साथ उनके सभी अक्षरों का समावेश हो जाता है तो उनकी मूल भाषिक आवश्यकताओं की पूर्ति हो ही जाती है.इनपुट स्ट्रिंग और प्रदर्शित स्ट्रिंग, जो अधिकांश भाषाओं / लिपियों के लिए समान ही हैं, के लिए निर्धारित युनिकोड मान में अपने अक्षरों की पहचान करते हुए पाठ का प्रदर्शन होता जाएगा.इसप्रकार किसी भी भाषा के युनिकोड फ़ॉन्ट के लिए वर्णमाला के अक्षरों से संबंधित केवल ग्लिफ़ को ही समाहित करने की आवश्यकता होती है और फ़ॉन्ट के ग्लिफ़ की पहचान उनको दर्शाने वाले अक्षरों के लिए प्रयुक्त उसी कोड से हो जाती है. युनिकोड पाठों के प्रदर्शन और भाषिक प्रोसेसिंग के संदर्भ में बहुभाषी जानकारी को बहुत प्रभावी रूप में सँभालता है.यहाँ पर विश्व की विभिन्न भाषाओं का एक ऐसा उदाहरण दिया गया है,जिसे युनिकोड में सहेजकर संबंधित भाषाओं में प्रदर्शित किया गया है.

GLOBAL INTEGRATION OF INDIANS THROUGH INDIC LANGUAGE COMPUTING

GLOBAL INTEGRATION OF INDIANS THROUGH INDIC LANGUAGE COMPUTING

Vijay K.Malhotra

Former Director (Official Languages), Ministry of Railways, Govt. of India, New Delhi (India)
E-mailID:malhotravk@gmail.com

ABSTRACT

It is not a coincidence that the developed countries are dominantly monolingual where as the developing countries are multilingual. In the so-called developed world, two languages are considered nuisance and many languages absurd. But the keyword for multilingual, multiethnic and multi-cultural society of India is Unity in Diversity. This is also true with regards to the languages and scripts of India. Inspite of the fact that some Indian languages and scripts look so different with each other that it is difficult to locate the underlying thread amongst these languages and scripts. It is almost impossible to believe that all the scripts of Indian languages (Except Urdu) have been originated from the common source of Brahmi script. The computer scientists of IIT, Kanpur realized this fact in 1971-72 while developing a common keyboard for all Indian languages based on GIST technology. The Pan Indian character of Indian languages also unites and integrates Indians through this Indian ethos of Unity in Diversity. While developing various Language Tools for Indian languages, the computer scientists were amazed to see the commonality amongst Indian languages not only in respect of terminology but also in respect of Syntax, Morphology and Semantics paving way for developing various language tools such as common code, common keyboard and common formalism based on Paninian framework.

INTRODUCTION
Language plays an important role in the gamut of politics. A cursory glance at world history is sufficient to show that it is not a coincidence that the developed countries are dominantly monolingual where as the developing countries are multilingual, multi-ethnic and multi-cultural. In the so-called developed world, two languages are considered nuisance and many languages absurd. In fact, the management of diversity is the greatest challenge before humanity today.The democratic world stands divided and isolated from one another because of Nation State built around Monolingual setting of mind. In Canada, mutual intolerance of French and English has brought Quebec on the verge of succession. Pakistan and Bangladesh parted company when insensitive politicians imposed Urdu as the sole official language.Recognition of many languages not only permits free flow of communication, but leads to the empowerment of people at grassroots. Linguistic diversity is as much in need of protection as bio-diversity. Like Chinese and English, Hindi also subsumes a number of dialects under its generic name.

Discussion/Proposal
Ø All Indic languages are National languages.
If you look at the debate of the Constituent Assembly of India on Sept 14, 1949, the
resolution to declare Hindi as the Official Language of Republic India was moved
by none other than a Tamil leader Gopalaswami Ayangar and this resolution was
passed unanimously without any abstention or opposition, but a clause was also
added to ensure the development of other Indian languages along with Hindi. Accordingly, all Indian languages including English and Sanskrit were declared as National Languages and they were included in the Constitution of India under Schedule 8 and the number of such languages has now gone up to 22. Article 351 also envisages developing Hindi as the language representing Composite Culture of India by way of assimilating the forms, style and the expressions of other National languages enumerated in the 8th Schedule of Constitution of India.
Ø Unity in Diversity
Ø India is a Linguistic Zone
Ø Common Code and Common Keyboard for all Indic scripts
Unity in diversity is the keyword for multilingual, multi-ethnic and multi-cultural society
of India. In spite of India having over 1650 languages and dialects prevalent across the
country, it is a Linguistic Zone. This is also true with regards to the languages and scripts
of India. In spite of the fact that some Indian languages and scripts look so different with each other that it is difficult to locate the underlying thread amongst these languages and scripts. For example, the scripts of Aryan and Dravidian languages look so different that it is almost impossible to believe that all Indic scripts (Except that of Urdu) have been originated from the common source of Brahmi script. By the time of Ashok , we find Brahmi used extensively in both the North and the South of India.Historically,it was revealed in 1837,but the computer scientists of IIT, Kanpur realized this fact in 1983 while developing a common keyboard for all Indian languages based on GIST (Graphics and Intelligence based Script Technology) technology. It was first demonstrated at the International Hindi Conference held in 1983 at New Delhi. The stupendous task of accommodating all the Indic scripts on a Qwerty Keyboard was scientifically based on the phonetic nature of Indic scripts derived from Brahmi script. GIST supported a common phonetic overlay and common code for all Indian languages.
Ø ISCII code and INscript keyboard for Indian languages.
Indian scripts are syllabic in nature, but their alphabets are phonetic and due to evolution of the scripts from the common source of Brahmi, they share a common heritage. There are few variations where certain scripts have additional or few alphabets. This aspect was incorporated into ISCII (Indian Standard Code for Information Interchange) code evolved in 1986-88 and the same was accepted as the standard by Bureau of Indian Standard. When it comes to the use of computers, the options available for data entry are a major concern. For the data entry in Indian languages, the default option is INSCRIPT (INdian SCRIPT) layout. This layout uses the standard 101 keyboard. The mapping of the characters is such that it remains common for all the Indian languages (written left to right). This is because of the fact that the basic character set of the Indian languages is common. We can divide the characters of Indian language alphabets into Consonants, Vowels, Nasals and Conjuncts. Every consonant represents a combination of a particular
sound and a vowel. The vowels are representations of pure sounds. The Nasals are
characters representing nasal sounds along with vowels. The conjuncts are combinations
of two or more characters. The Indian language alphabet table is divided into Vowels
(Swar) and Consonants (Vyanjan). The vowels are divided into long and short vowels
and the consonants are divided into vargs.The INSCRIPT layout takes advantage of
these observations and thus the organization are simple. In the Inscript keyboard layout,
all the vowels are placed on the left side of the keyboard layout and the consonants, on
the right side. The placement is such that the characters of one varg are split over two
keys. This is how the common keyboard and the common code for all Indian languages
were evolved because of the common alphabetic order of these languages. Due to the
common coding for all Indian languages the facility of mutual transliteration amongst
Indic scripts were also made possible. Since ISCII also includes Roman script along with
Indic scripts, transliteration is also possible from Indic scripts to Roman script.
Ø Commonality amongst the Lexical, Syntactic and Semantic features of Indian languages
After Indic scripts, now let’s talk about the commonality amongst the Lexical, Syntactic and Semantic features of Indian languages. This is true that most of the Indian languages have borrowed extensively from Sanskrit and these words are commonly used in all Indian languages. This Pan Indian character of Indian languages unites Indian languages through a common bond and heritage of Sanskrit. This is truer with the lexical features used in Ayurveda, GaNit and Jyotish shastraas written primarily in Sanskrit, but the lexical features of other regional languages are also common to a great extent. While finalizing the technical terms in various Indian languages, the Govt of India issued directives to ensure that the technical terminology coined in various Indian languages should be primarily derived from Sanskrit and this is the reason that the technical terminology coined in various Indian languages is common to a great extent. While developing various language tools for Indian languages, the computer scientists were amazed to see the commonality amongst Indian languages not only in respect of lexical features but also in respect of syntactic and semantic features. In fact they found that in spite of India having over 1650 languages and dialects prevalent across the country, it is a one Linguistic Zone. The areas of commonality amongst Indian languages and scripts observed by the computer scientists working in the filed of Computational Linguistics while developing various tools in Hindi and other Indian languages.
Ø Language group specific features of Indian languages
If you look at the core of the languages, you will find two distinct features in any language of the world: Universal features as well as language-specific features. Universal features are those that are common in all languages belonging to even extremely different languages across the globe such as Hindi, Tamil, English, Chinese and Arabic etc. For example, 'khaayaa' (ate) is a transitive verb, which requires an object to be eaten and a subject who eats. Similarly it requires its subject to be an animate. The tree structure of this verb as well as its universal features such as transitivity is common to all languages of the world, but the use of 'ne' with the subject of this verb is a language-specific feature of Hindi. There are certain features which are language group-specific also. For example, the use of 'ko' along with the subject of a specific sentence pattern is common to all languages in India; i.e. 'Raam Ko Bukhaar Hai' in Hindi 'Malaa Taap Aahe' in Marathi 'Ramakku Jwaram' in Tamil, 'Raamannu Paniyaa Nu' in Malayalam, ‘Ramanige Jvar Ide' in Kannada and 'Ramer Taap Aachhe' in Bengali, but in English it is translated as 'Ram has a fever’. Here you will notice the conspicuous absence of the corresponding use of 'Ko'. This shows that India is one linguistic zone. If we make use of the Computer technology to analyze these features, we can come out with the most sophisticated language tools such as Language Tutor, Auto Correct, Grammar Checker and even machine translation systems in Indian languages.

Ø Commonality amongst the Lexical ,Syntactic and Semantic features of Indian languages
The commonality amongst Indian Indian languages also paved the way for developing Language Tools such as Machine Translation System in Hindi and other Indian languages. Anusaaraka machine translation system for 5 pairs of Indian languages, a computational Paninian framework has been developed under the guidance of Dr. Rajeev Sangal,the Director of the International Institute of Information Technology,Hyderabad.another algorithm named as TAG (Tree Adjoining Grammar) has been developed by Prof. Aravind Joshi, Head of Computer Science Department of the University of Pennsylvania,USA. This algorithm is quite suitable for multiple languages having different syntactic structures such as Hindi and English. Whereas English is a positional language and Hindi a language of relatively free word order. For example, if you change the word order of the sentence "Ram (Subject) killed (verb) Ravan (Object)" and replace Subject with Object "Ravan (Subject) killed (verb) Ram (Object)" the meaning changes completely, but in Hindi the meaning remains the same even after changing the word order. "Raam (Subject) ne RaavaN (Object) ko maaraa (verb)". "raavaN (Object) ko Raam (Subject) ne maaraa (verb)".The TAG handles both languages on the basis of its verb and picks up the universal features as SVO in English and SOV in Hindi. The domain selected is that of Officialese (The language used in administration). The features of the Officialese are almost similar across the languages. For example, the use of past participles is quite common in Officialese. "Mr.Verma has been transferred from Delhi to Mumbai with effect from March1, 2005 and posted as Director (Operations)".But the typical feature of Hindi is the use of honorific use of Shri Verma. In English, it is enough to use Mr. before the name of the person to show respect and the verb remains singular. But in Hindi even the verb changes into plural. Shree Verma nideshak ho gaye (plural). With these examples, it is clear that unless language specific are addressed while developing the Parser, Language Tools such as Machine Translation system can not developed successfully. This Parser has been found quite useful to analyze language specific features in Hindi and other languages of the world such as English, French, Marathi, Japanese and Chinese belonging to different family groups.

Ø Script specific features may not necessarily be the common features between two languages such as Hindi and Marathi sharing the same Devanagari script.

In spite of the commonality amongst a few Indian languages sharing the common Devanagari Script, one has to keep in mind a few facts while developing the tools like Auto Correct that it is not a script specific feature but it’s a language specific feature. For example, Devanagari script is used for both Hindi and Marathi, but their spelling structure is quite different. Even the words commonly derived from Sanskrit are spelt differently in both languages. Most of the words ending with small "i" in Hindi are spelt with long "ii" in Marathi. For example, kavi is spelt in Hindi with small "i", but in Marathi it is spelt as kavee with long "ii". How can there be a common Auto Correct for both Hindi and Marathi? Besides, one can collect the samples from various Hindi speaking regions as well as non-Hindi speaking regions to understand the pattern of errors committed by different language speakers. The errors committed by Marathi speakers are quite different than that of Punjabi speakers. Since Hindi is used in most parts of India, there is lot of variations in its pattern of errors. If Marathi and Gujarati speakers commit mistakes of long and small "ii" and "i", the speakers of South Indian languages commit mistakes for aspirated sounds. They write bhaaShaa as baaShaa and kaanaa as kaanaa.This is because of the mother tongue interference.

Ø UNICODE is the answer to all problems with regards to non-compatibility, multiple fonts and various Operating Systems amongst all languages of the world across the globe.

In the present scenario, most of the users of Indic languages are still using non-standard fonts and due to the non-compatibility across the systems and fonts, they do not attempt to use multiple applications such as e-mail, chat, templates, auto text, thesaurus, spell checkers etc. Very few users attempt data applications such as Excel, Access in Hindi. Power Point is also not commonly used in Indic languages. This is due to the fact that there was no common standard in Indic languages across the systems. ISCII was a good beginning in this direction, but in the world of globalization, we need a global standard where all languages of the world can co-exist with each other irrespective of multiple platforms, fonts and systems. UNICODE is the answer to all these problems. Hence our endeavor should be to make Indic language users aware of the advantages of Language computing in Unicode. As far as Indian languages are considered, ISCII has been taken as the basis for encoding in Unicode. All Indian languages are grouped together in the Unicode character chart. Each language is given a codepage. Indic language codepages contain blocks of 128 code points. Rendering of Unicode text is handled by Open type fonts –a joint initiative by Adobe and Microsoft. Open type font is an open standard and not a proprietary of any company. The sort order for Indian language is different for each language, even if they use the same script. This is an important difference from ISCII which uses one sorting order for entire India that is not true in reality. That means, the sorting order will be slightly different amongst Indic scripts due to slight variations or the addition and deletion of some syllables. Unicode gives the freedom for the individual applications to handle the sorting.
Ø If you want to be with rest of the world, then move to Unicode.
Unicode is increasingly being accepted as a standard for Information Interchange worldwide as most of the major IT Companies have declared their support for it. Unicode standard provides the capacity to encode all of the characters used for the written languages of the world. The Unicode standards provide information about the character and their use. Unicode Standards are very useful for Computer users who deal with multilingual text, Business people, Linguists, Researchers, Scientists, Mathematicians and Technicians.
Unicode uses a 16 bit encoding that provides code point for more than 65000 characters (65536). Unicode Standards assigns each character a unique numeric value and name. On the contrary, ISCII uses 8 bit code which is an extension of the 7 bit ASCII code containing the basic alphabet required for the 10 Indian scripts which have originated from the Brahmi script..Traditionally,computer applications dealt with text corresponding to only one language. Subsequently the need to work with multilingual text was felt and this brought in additional requirements in respect of codes. The letters from different languages cannot be normally distinguished on the basis of their codes, for across different languages, the numerical values assigned for the codes fall in the same range. Thus one might find that the code assigned to the letter "a" in English is really the same as the code assigned for the Greek letter "alpha" or an equivalent letter in the Cyrillic alphabet. A multilingual document with text from different languages cannot really be identified as one, unless a mechanism is available to specifically mark sections of the text as belonging to a specific language/script.
The basic idea in Unicode was to assign codes over a much larger range of numbers from 0 to nearly 65000. This large set includes not only the letters of the alphabet from many different languages of the world but also punctuation, special shapes such as mathematical symbols, Currency symbols etc.This large range would be apportioned to different languages/scripts by assigning chunks of 128 consecutive numbers to each script which may also include a group of special symbols. The size of the alphabet in many languages is much less than 50 and so this minimal range of 128 is quite adequate even to cover additional symbols, punctuation etc.An important concept in Unicode is that codes are assigned to a language on the basis of linguistic requirements. Thus, for most languages of the world which use the letters of their alphabet in the writing system, the linguistic requirement is basically satisfied if all the letters are covered along with special symbols. Display of text would proceed by identifying the letters through their assigned Unicode values both in the input string and the displayed string, which for most languages/scripts would be identical. Thus a Unicode font for a language need incorporate only the glyphs corresponding to the letters of the alphabet and the glyphs in the font would be identified with the same codes used for the letters the represent. Here is the example of the same text saved in Unicode but displayed in various languages of the world.
ما هي الشفرة الموحدة "يونِكود" ؟ in Arabic
Какво е Unicode ? in Bulgarian
什麽是Unicode(統一碼/標準萬國碼)? in Trad'l Chinese
什么是Unicode(统一码)? in Simplified Chinese
Što je Unicode? in Croatian
Co je Unicode? in Czech
Hvad er Unicode? in Danish
Wat is Unicode? in Dutch
Kio estas Unikodo? in Esperanto
Mikä on Unicode? in Finnish
Qu'est ce qu'Unicode? in French
რა არის უნიკოდი? in Georgian
Was ist Unicode? in German
Τι είναι το Unicode; in Greek (Monotonic)
Τί εἶναι τὸ Unicode; in Greek (Polytonic)
מה זה יוניקוד (Unicode)? in Hebrew
यूनिकोड क्या है? in Hindi
Hvað er Unicode? in Icelandic
Que es Unicode? in Interlingua
Cos'è Unicode? in Italian
ユニコードとは何か？in Japanese
유니코드에 대해? in Korean
Kas tai yra Unikodas? in Lithuanian
Што е Unicode? in Macedonian
X'inhu l-Unicode? in Maltese
يونی‌کُد چيست؟ in Persian
Czym jest Unikod? in Polish
O que é Unicode? in Portuguese
Ce este Unicode? in Romanian
Что такое Unicode? in Russian
Kaj je Unicode? in Slovenian
¿Qué es Unicode? in Spanish
Vad är Unicode? in Swedish
Unicode คืออะไร? in Thai
ዩኒኮድ እንታይ ኢዩ? in Tigrigna
Što je Unicode? in Upper Sorbian
Evrensel Kod Nedir? in Turkish
ﻳﯘﻧﯩﻜﻮﺩ ﺩﯨﮕﻪﻥ ﻧﯩﻤﻪ؟ in Uyghur
Unicode dégen néme? in Uyghur (latin)
Unicode là gì? in Vietnamese
Beth yw Unicode? in Welsh
The main purpose of the Unicode is to transport information across computer systems.

Conclusions

Unity in Diversity is the key of Indian ethos.This is also reflected in the scripts and languages of Indian subcontinent.This is due to the fact that all Indic scripts have been originated from Brahmi script based
on the common phonetic order.Taking the commonality amongst Indic scripts and languages across the country ,computer scientists succeeded to develop a common code such as ISCII and common keyboard such as INSCRIPT for Indian languages.This has also paved the way of developing various Language
Tools in Hindi and other India languages such as mutual transliteration among Indian scripts.No doubt
that ISCII code did help to bring about national integration amongst Indians speaking multiple languages across the country,but with the advent of UNICODE,all the languages of the world have been covered
under a single uniform code,paving the way for global integration of Indians through Indic language computing.In fact Unicode has really transformed the entire world community into a global village in
the letter and spirit, thus converting the following Sanskrit saying into reality:vasudhaiv kuTumbakam…
the whole world is like a family.

Acknowledgements

Mr. Mahendra K. Verma, University of York (UK)
Dr. Aravind Joshi,University of Pennsylvania,(USA)
Dr. U.B. Pavanaja,CEO, Vishva Kannada Softech,Bangalore
Dr.Suraj Bhan Singh,Former Chairman,Commission for Scientic
& Technical Teminology in Hindi,New Delhi
Dr Krishna Kumar,Gitanjali Multilingual Literary Circle,Birmingham,UK

References

1. Language and Politics by D.P.Pattanayak published in GaveshaNaa:
63-64/1994/191-199
2. Hindi kaa vaakyaatmak vyaakaraN by Prof. S.B. Singh
3. www.cdac.in
4. www.unicode.org
5. www.bhashaindia.com

------------------------------------

Meet Mr. Vijay K. Malhotra

Meet Mr. Vijay K. Malhotra
Semantics – The Art of Language Computing
Vijay K. Malhotra, a veteran of several years’ service with Indian Railways, now pursues his passion in Indian languages with the BhashaIndia team. This interview highlights Mr. Malhotra’s work in the field of semantics and his association with the development of Office XP in Hindi
Since when did your develop an interest in languages?
V: My interest in languages dates back to my early education days in Gurukul Kangri University, Hardwar with special emphasis on Sanskrit and Hindi, but it was matured in the University of York, UK while teaching Hindi to the students belonging to various ethnic groups across the globe such as the British, Indians and Africans. For the first time I realized the importance of Linguistics and Language Technology for teaching a language of foreign origin to the non-natives.
Which feature of languages and their technology makes you passionate about them?
V: When I look at the core of the languages, I find two distinct features in any language of the world: Universal features as well as language-specific features. Universal features are those that are common in all languages belonging to even extremely different languages across the globe such as Hindi, Tamil, English, Chinese and Arabic etc. For example, 'khaayaa' (ate) is a transitive verb, which requires an object to be eaten and a subject who eats. Similarly it requires its subject to be an animate. The tree structure of this verb as well as its universal features such as transitivity is common to all languages of the world, but the use of 'ne' with the subject of this verb is a language-specific feature of Hindi. There are certain features which are language group-specific also. For example, the use of 'ko' along with the subject of a specific sentence pattern is common to all languages in India; i.e. 'Raam Ko Bukhaar Hai' in Hindi 'Malaa Taap Aahe' in Marathi 'Ramakku Jwaram' in Tamil, 'Raamannu Paniyaa Nu' in Malayalam, ‘Ramanige Jvar Ide' in Kannada and 'Ramer Taap Aachhe' in Bengali, but in English it is translated as 'Ram has a fever’. Here you will notice the conspicuous absence of the corresponding use of 'Ko'. This shows that India is one single linguistic zone. If we make use of the technology to analyze these features, we can come out with the most sophisticated language tools such as Language Tutor, Auto Correct, Grammar Checker and even machine translation systems in Indian languages. This is how and why I am passionate about the languages and their technology.
Did your family background play any role in your interest in linguistics?
V: Although I belong to a publishing family, my interest in linguistics grew over time while meeting various challenges of the implementation of Hindi in Government offices and PSUs. However my Gurukul background of studying Hindi and Sanskrit for over 14 years played a crucial role in developing my interest in languages but my interest in linguistics developed while teaching Hindi to non-natives in UK.
What made you retire voluntarily from Indian Railways and go back to your passion?
V: Indian Railways is one of the biggest organizations in the world with the working force of over 16 lakh employees and it has direct contact with the common man. As a Director (Official Languages), I was responsible for introduction of Hindi in day-to-today working of Indian Railways. This provided me an opportunity to find ways and means to implement Hindi at various levels across the country. I was also aware that the atmosphere in southern states is not very congenial to implement Hindi and hence I decided to use both culture and technology to promote Hindi. We used theatre to promote Hindi in non-Hindi speaking areas and we also encouraged the use of regional languages at all points of contact with public and finally technology helped us inculcate the habit of using computers for various language applications in Railways including reservation charts. In spite of this initial success, I found that the Language tools such as Language Tutor, Grammar Checker, Machine Translation System etc, are still not available in the market and if given a chance I would devote the rest of my life to minimize the digital divide between the language (Hindi) and technology. Microsoft offered me this opportunity and now I am back to my initial passion.
What were the initiatives taken by you to spread Hindi computing at the grass root level?
V: Since the employees of Indian Railways at grassroots level in Region A (Hindi speaking regions) and B (Maharashtra, Punjab and Gujarat) are well versed in Hindi and wanted English to be replaced with Hindi for day-to-day working as soon as possible, we achieved initial success but with the introduction of computerization, all applications such as reservation charts, salary bills, correspondence, notifications, tenders and contracts were once again being prepared in English even in Hindi-speaking areas, because the computers didn't have the proper facility to work in Hindi. Now, again, there was a challenge to develop computer applications in Hindi. In consultation with General Managers of zonal railways, we decided to prioritize the applications which were meant for the common man such as reservations charts, railway tickets, forms, salary bills, circulars, gazette notifications, identity cards for Class 3 and Class 4 employees. Since the computer applications at that time were confined to word processing only, we had a tough time to prepare the Reservations Charts and computerized railway tickets on proprietary systems in Hindi. Certain applications such as Seniority Lists and Sorting were still not possible in Hindi because of the lack of Language Tools in Hindi. The non-compatibility of various systems used in Indian Railways was another reason of the limited use of applications in Hindi. With the advent of Unicode and XML system in Indian languages raised the hopes of developing computer applications in Hindi, especially at the grassroots level.
How has been the response of the people, especially in the rural areas towards the initiatives?
V: The initiative of Reservation Charts in Hindi was welcomed by the common man in rural as well semi-urban areas. Since, this facility was made available all over India, the response of the passengers across the country was unprecedented. In non-Hindi speaking areas, the secret of our success was to make use of the respective regional languages along with Hindi and English. This could be possible with the extensive use of Computer technology only.
Do you think that the spread of Hindi computing has been successfully replicated for other Indian languages?
V: Since most of the Indian scripts have been originated from Brahmi, there is a lot of commonality among them in terms of their phonetic structure. The scientists working on Indian languages realized this and hence INSCRIPT was evolved as a common keyboard for all Indian languages. ISCII was a common standard for Indian languages and now Unicode has incorporated ISCII to cover all these languages. Unity in Diversity is also true with regards to Indian scripts and languages. Hence I am convinced that spread of Hindi computing has to be successfully replicated for other Indian languages.
You have been involved with projects like developing Hindi Parser with University of Pennsylvania and Hindi Thesaurus with University of York. Tell us something about these two projects?
V: In 1984, I was offered the Nuffield Fellowship to teach Hindi in the University of York, UK for one semester. Along with teaching I was also asked to take up a project of developing a prototype of Hindi Thesaurus. Although it was a Computational Linguistic project, I was not aware of the Computational aspect for developing the same. Similarly I didn't have the formal training of Linguistics so I decided to prepare the Thesaurus manually, but the university authorities arranged initial training of computer for me and asked me to use Oxford Concordance Manual for preparing Concordance and Indexing (sorting) for the same. Although the ideal way was to pick up the basic words though Corpora, but taking in view the limitation of time and lack of my Computational knowledge and skill, I decided to pick up only 2000 basic words and find out synonyms and antonyms for the same. It was also decided to group them under 14 semantic fields in the form of Concordance. This was how we developed a small prototype of Hindi Thesaurus. In 1996, the Computer Science Department of University of Pennsylvania, USA offered me to assist their NLP group to develop the Hindi Parser. I consider this as a great honour, as this invitation had come from the University, which was responsible for developing the 1st computer in the world - ENIAC. TAG (Tree Adjoining Grammar) was my favourite algorithm and it was developed by Prof. Aravind Joshi, Head of Computer Science Department of this University. I have always found this algorithm quite suitable for multiple languages having different syntactic structures such as Hindi and English. Whereas English is a positional language and Hindi a language of relatively free word order. For example, if you change the word order of the sentence "Ram (Subject) killed (verb) Ravan (Object)" and replace Subject with Object "Ravan (Subject) killed (verb) Ram (Object)" the meaning changes completely, but in Hindi the meaning remains the same even after changing the word order . "राम (Subject) ने रावण (Object) को मारा (verb)". "रावण (Object) को राम (Subject) ने मारा (verb)".The TAG handles both languages on the basis of its verb and picks up the universal features as SVO in English and SOV in Hindi. The domain selected for the project was Officialese (The language used in administration). The features of the Officialese are almost similar across the languages. For example, the use of past participles is quite common in Officialese. "Mr.Verma has been transferred from Delhi to Mumbai with effect from March1, 2005 and posted as Director (Operations)".But the typical feature of Hindi is the use of honorific use of Shri Verma. In English, it is enough to use Mr. before the name of the person to show respect and the verb remains singular. But in Hindi even the verb changes into plural. श्री वर्मा निदेशक हो गए (plural). With these examples, I wanted to make it clear that unless language specific are addressed while developing the Parser, Language Tools such as Machine Translation system can not developed successfully. This Parser was found quite useful to analyze language specific features in Hindi.
You have translated Windows XP to Hindi. How was the multilingual support of Windows XP helpful during the translation process, especially while translating the hardcore technical terms?
V: My job was to coordinate and moderate the translated text undertaken by various vendors. Before taking up the translation, I drafted some principles and got it approved by Redmond to maintain the uniformity throughout the system. Accordingly, hard code terms were divided into 2 groups. The terms indicating the concept are normally required to be used with its multiple derivatives. Hence they were translated in Hindi. For example, the term "Format >स्वरूप" has derivative such as "Formatted> स्वरूपित", "Print >मुद्रण" has derivative such as "Printed >मुद्रित", but the English terms already popular in Hindi were retained in the form of their transliteration, such as File, Computer, Window, Office, Bullet, Font etc. were not translated but transliterated in Hindi. Certain acronyms were also transliterated in Hindi because of their popular usage, but proper care was required to be taken while writing the same in Hindi. For example, ROM has been transliterated as रॉम and not as राम. In Hindi; most of the people normally avoid the use of "ardhchandra" such as डाक्टर.But we decided to make use of the same to avoid the ambiguity. With inclusion and addition of Hindi in the multilingual support of Windows XP, translation process for other Indian languages became easier because of the commonality of these languages, but for us it was not of any help. We had to start everything a fresh.
What led you to develop language tools like Auto Correct and Auto Correct for Hindi?
V: While reviewing various features of Office XP in Hindi, I found that there is a lot of scope to modify the existing feature of Auto Correct for Hindi. At the outset, I clarified that Auto Correct is not a script specific feature but it’s a language specific feature. For example, Devanagari script is used for both Hindi and Marathi, but their spelling structure is quite different. Even the words commonly derived from Sanskrit are spelt differently in both languages. Most of the words ending with small "i" in Hindi are spelt with long "ii" in Marathi. For example, कवि is spelt in Hindi with small "i", but in Marathi it is spelt as कवी with long "ii". How can there be a common Auto Correct for both Hindi and Marathi? Besides, I collected the samples from various Hindi speaking regions as well as non-Hindi speaking regions to understand the pattern of errors committed by different language speakers. The errors committed by Marathi speakers are quite different than that of Punjabi speakers. Since Hindi is used in most parts of India, there is lot of variations in its pattern of errors. If Marathi and Gujarati speakers commit mistakes of long and small "ii" and "i", the speakers of South Indian languages commit mistakes for aspirated sounds. They write भाषा as बाषा and खाना as काना.This is because of the mother tongue interference. Finally it may be noted that the modified Auto Correct developed by me is still to be uplinked.
What are your future plans for Hindi on the Bhasha India portal?
V: In the present form, Hindi Section of Bhasha India portal is just a Hindi translation of the portal originally conceived in English having no direct interaction with the real users of Hindi computing. Our future plan is to make it a forum of those users who originally think and write in Hindi and use computers to achieve their purpose. I also don't assume that they know English. In the present scenario, most of the Hindi users are still using non-standard fonts and due to the non-compatibility across the systems and fonts, they do not attempt to use multiple applications such as e-mail, chat, templates, auto text, thesaurus, spell checkers etc. Very few users attempt data applications such as Excel, Access in Hindi. Power Point is also not commonly used in Hindi. This was due to the fact that there was no common standard in Hindi across the systems. ISCII was a good beginning in this direction, but in the world of globalization, we need a global standard where all languages of the world can co-exist with each other irrespective of multiple platforms, fonts and systems. UNICODE is the answer to all these problems. Hence our endeavour will be to make them aware with the advantages of Hindi computing in Unicode. We will also discuss all their problems on our forum. To encourage those using new applications, we will launch various schemes of incentives. I am sure the Hindi section of bhashaindia.com will be the forum of real users of Hindi computing.

विजय कुमार मल्होत्रा से मिलिए

अर्थविज्ञान – भाषा कंप्यूटिंग की कला
विजय कुमार मल्होत्रा, भारतीय रेल में वर्षों तक कार्य करने के बाद भारतीय भाषाओं के प्रति अपने जुनून के कारण अब BhashaIndia टीम के साथ मिलकर कार्य करते रहे हैं. इस साक्षात्कार में अर्थविज्ञान के क्षेत्र में श्री मल्होत्रा के योगदान को और Office XP हिंदी के विकास में उनके सहयोग को उजागर किया गया है.
भाषाओं के प्रति आपकी रुचि कब जगी ?
मल्होत्रा: भाषाओं के प्रति तो मेरी रुचि तब से है, जब मैं गुरुकुल काँगड़ी विश्वविद्यालय, हरिद्वार में संस्कृत और हिंदी विषयों के साथ आरंभिक शिक्षा प्राप्त कर रहा था, लेकिन इसमें परिपक्वता तब आई, जब मैं हिंदी पढ़ाने के लिए यॉर्क विश्वविद्यालय,यू.के।पहुँचा और मुझे ब्रिटिश, भारतीय और अफ़्रीका जैसे विभिन्न महाद्वीपों के विद्यार्थियों को विदेशी भाषा के रूप में हिंदी पढ़ाने का दायित्व सौंपा गया. पहली बार मुझे महसूस हुआ कि विदेशी भाषा शिक्षण में भाषाविज्ञान और भाषा प्रौद्योगिकी की कितनी महत्वपूर्ण भूमिका है.

भाषा और भाषा प्रौद्योगिकी के किस पक्ष ने आपमें जुनून की हद तक लगाव पैदा किया ?
मल्होत्रा: जब भी मैं भाषाओं के मूल की ओर दृष्टि डालता हूँ तो विश्व-भर की भाषाओं में दो स्पष्ट लक्षण दिखाई पड़ते हैं: सार्वभौमिक लक्षण और भाषा-विशिष्ट लक्षण. सार्वभौमिक लक्षण वे हैं जो हिंदी, तमिल, अंग्रेज़ी, चीनी और अरबी जैसी विभिन्न भाषा-परिवारों से जुड़ी भाषाओं में भी समान रूप से पाए जाते हैं. उदाहरण के लिए, 'खाया' एक सकर्मक क्रिया है, जो खाए जाने के लिए एक कर्म और खाना क्रिया संपन्न करने के लिए एक कर्ता की आकांक्षा करती है और साथ ही यह भी अपेक्षा करती है कि उसका कर्ता सजीव हो. इस क्रिया की वृक्ष संरचना में ये सभी सार्वभौमिक लक्षण दिखाई पड़ते हैं.इसप्रकार इसकी सकर्मकता विश्व की सभी भाषाओं में समान है,लेकिन कर्ता के साथ 'ने' का प्रयोग हिंदी का भाषा-विशिष्ट लक्षण है. कुछ ऐसे भी लक्षण होते हैं जो भाषा-वर्ग विशिष्ट होते हैं. उदाहरण के लिए, एक विशेष वाक्य साँचे में कर्ता के साथ 'को' का प्रयोग सभी भारतीय भाषाओं में समान रूप से पाया जाता है; जैसे, हिंदी में 'राम को बुखार है', मराठी में 'रामला ताप आहे' ,तमिल में, 'रामक्कु ज्वरम्' ,मलयालम में ' रामन्नु पनियानु ' ,कन्नड़ में ‘रामनिगे ज्वर दिगे' , बँगला में 'रामेर ताप आछे' और अंग्रेज़ी में इसका अनुवाद होगा, 'Ram has a fever’. अंग्रेज़ी में आप देखेंगे कि 'को' का वाचक कोई परसर्ग या पूर्वसर्ग नहीं है. इससे स्पष्ट होता है कि भारत एक भाषिक क्षेत्र है. यदि हम इन लक्षणों के विश्लेषण के लिए भाषा प्रौद्योगिकी का उपयोग करें तो हम भारतीय भाषाओं में कंप्यूटर साधित स्वयं भाषा शिक्षक, ऑटो-करेक्ट, ग्रामर चैकर और मशीनी अनुवाद जैसे अत्यंत जटिल भाषिक उपकरणों का विकास भी कर सकते हैं. यही कारण है कि भाषा और भाषा प्रौद्योगिकी ने मुझमें जुनून की हद तक लगाव पैदा कर दिया है..
क्या भाषाविज्ञान के प्रति आपके लगाव में आपके परिवार की पृष्ठभूमि की भी कोई भूमिका रही है ?
मल्होत्रा: मेरा संबंध प्रकाशक परिवार से तो है, लेकिन भाषाविज्ञान में मेरी रुचि तब ज़्यादा बढ़ी, जब सरकारी कार्यालयों और सार्वजनिक क्षेत्र के उपक्रमों में हिंदी को कार्यान्वित करते हुए मुझे अनेक चुनौतियों का सामना करना पडा.यद्यपि 14 वर्षों तक गुरुकुल में हिंदी और संस्कृत की विशेष शिक्षा के कारण भाषाओं के प्रति मेरा रुझान स्वाभाविक ही रहा, लेकिन यू.के. में विदेशी भाषा के रूप में हिंदी पढ़ाते समय मुझे भाषाविज्ञान का वास्तविक महत्व समझ में आया और भाषाविज्ञान में मेरी रुचि बढ़ने लगी.
आखिर वह क्या वजह थी, जिसके कारण आपने भारतीय रेल से स्वैच्छिक सेवानिवृत्ति ली और अपने पसंदीदा जुनून के काम में जुट गए ?
मल्होत्रा: भारतीय रेल विश्व से सबसे बड़े संगठनों में से एक है, जिसमें 16 लाख से अधिक कर्मचारी काम करते हैं और उनका सीधा संपर्क आम आदमी से रहता है. निदेशक (राजभाषा) के रूप में मेरा यह दायित्व था कि मैं भारतीय रेल के दैनंदिन कार्यों में हिंदी के प्रयोग को बढ़ावा दूँ.इस दायित्व के कारण मुझे यह अवसर मिला कि मैं देश-भर में विभिन्न स्तरों पर हिंदी को कार्यान्वित करने के लिए नए-नए उपायों की खोज करूँ.मैं इस बात से भी बेखबर नहीं था कि दक्षिणी राज्यों में हिंदी के प्रति अनुकूल वातावरण नहीं है. इसलिए मैंने हिंदी के प्रचार-प्रसार के लिए संस्कृति और प्रौद्योगिकी का विशेष रूप से उपयोग करने का निश्चय किया. हमने नाटक की विधा का उपयोग अहिंदीभाषी क्षेत्रों में किया और क्षेत्रीय भाषाओं का उपयोग जन-संपर्क के सभी स्थलों पर शुरू कर दिया.रेलवे के सभी अनुप्रयोगों में कंप्यूटर प्रौद्योगिकी के व्यापक उपयोग से काफ़ी सफलता मिलने लगी. विशेष रूप से आरक्षण चार्ट में हिंदी के उपयोग को जनता ने काफ़ी पसंद किया, लेकिन आरंभिक सफलता के बावजूद मुझे लगा कि अभी भी बाज़ार में स्वयं भाषा शिक्षक, ऑटो-करेक्ट,ग्रामर चैकर और मशीनी अनुवाद जैसे भाषिक उपकरणों का बहुत अभाव है और यदि मुझे अवसर मिला तो मैं अपना शेष जीवन भाषा (हिंदी) और प्रौद्योगिकी के बीच के अंतराल को यथासंभव कम करने के लिए बिताना चाहूँगा.संयोगवश, माइक्रोसॉफ़्ट ने मुझे यह अवसर प्रदान किया और मैं अपने पसंदीदा जुनून के काम में जुट गया.
आपने हिंदी कंप्यूटिंग को निचले स्तर पर पहुँचाने के लिए क्या पहल की ?
मल्होत्रा: 'क' क्षेत्र (अर्थात् हिंदीभाषी राज्य) और 'ख' क्षेत्र (अर्थात् महाराष्ट्र, पंजाब और गुजरात) में निचले स्तर पर कार्यरत भारतीय रेल के अधिकांश कर्मचारी अच्छी तरह से हिंदी जानते हैं और यह चाहते भी हैं कि रेलवे के दैनंदिन कामकाज में अंग्रेज़ी के स्थान पर हिंदी का प्रयोग जल्दी से जल्दी शुरू हो जाए। इसलिए आरंभ में हमें निचले स्तर पर आंशिक सफलता तो मिली, लेकिन रेलवे में व्यापक कंप्यूटरीकरण के बाद वे तमाम कार्य हिंदीभाषी क्षेत्रों में भी फिर से अंग्रेज़ी में शुरू हो गए, जिन्हें सफलतापूर्वक हिंदी में किया जाने लगा था,जैसे,आरक्षण चार्ट, वेतन बिल, पत्राचार, अधिसूचनाएँ, निविदा-करार आदि. इसका मुख्य कारण था, कंप्यूटरों में हिंदी-सुविधा का अभाव. अब चुनौती यह थी कि कंप्यूटर पर हिंदी में इन तमाम अनुप्रयोगों को कैसे संपन्न किया जाए. रेलों के महाप्रबंधकों के साथ परामर्श करके प्राथमिकता के आधार पर यह तय किया गया कि शुरू में उन अनुप्रयोगों को लिया जाए, जिनका जनता और निचले स्तर के कर्मचारी से सीधा संबंध है, जैसे, आरक्षण चार्ट,रेलवे टिकट,फ़ॉर्म, वेतन बिल, परिपत्र, गज़ट अधिसूचनाएँ, निविदा-करार और श्रेणी 3 और 4 के कर्मचारियों के परिचय पत्र आदि. चूँकि उस समय कंप्यूटर के अनुप्रयोग शब्द संसाधन तक ही सीमित थे,
इसलिए प्रोप्राइटरी सिस्टम पर तैयार किए जाने वाले आरक्षण चार्ट, कंप्यूटरीकृत रेलवे टिकट और वेतन बिल हिंदी में बनाने में बहुत-सी कठिनाइयों का सामना करना पड़ा.हिंदी में सॉर्टिंग की सुविधा न होने के कारण वरीयता सूची और अनुक्रमणिका तो हिंदी में तैयार ही नहीं की जा सकती थी. भारतीय रेल में प्रयुक्त विभिन्न सिस्टमों के बीच कॉम्पेटिबिलिटी न होने के कारण भी हिंदी का प्रयोग सीमित रूप में ही किया जा सकता था. भारतीय भाषाओं में युनिकोड और XML सिस्टम के आगमन के बाद यह आशा की जाने लगी है कि अब हिंदी में निचले स्तर पर भी हर प्रकार के अनुप्रयोग के लिए कंप्यूटर का उपयोग किया जा सकेगा.

hindicomputing