The "Bangla Syntactic Treebank Corpus with Processing Pipeline and Distribution Platform" project aims to advance Bangla language processing by creating a syntactically annotated corpus, developing sophisticated language models like Word2Vec, GPT-2, BERT, T5, ELECTRA and XLNet., and building user-friendly tools like a annotation management system, web scraper, corpus analyzer and corpus aggregator . This system will serve as a key resource for both research and industry, becoming the central hub for Bengali text mining and language processing.
2M+
Gold Dataset
10M+
Silver Dataset
8+
Type of Dataset
3B+
Raw Dataset
200K+
Romanized Dataset
200K+
Lexical Dictionary
5+
Language Models
10+
Fine-tune Models
6+
Applications
3+
Publications
1+
Library
2M+
Gold Dataset
10M+
Silver Dataset
8+
Type of Dataset
3B+
Raw Dataset
200K+
Romanized Dataset
200K+
Lexical Dictionary
5+
Language Models
10+
Fine-tune Models
6+
Applications
3+
Publications
1+
Library
An NLP corpus (Natural Language Processing corpus) is a substantial and systematically organized collection of textual data, employed for the purposes of training, testing, and evaluating language models and various NLP algorithms. In this project, we developed several specialized corpora, each tailored to address distinct facets of Bangla language processing.
BdNC is a comprehensive collection of raw and running text, containing 40 gigabytes or over 3 billion words. This corpus is an essential resource for training robust Bangla language models and key resources for Information Retrieval application.
This corpus, also known as Banglish, contains at least 200,000 words of Bangla text written in Roman characters. It is balanced and representative, reflecting actual usage in everyday communication.
This is a manually constructed, syntactically annotated corpus containing 2 million running words. It serves as a gold standard for evaluating and developing Bangla language processing tools.
Automatically generated using the project’s processing pipeline, bdTreebankSilver consists of 10 million words. It provides a larger dataset for training and refining NLP models.
These include datasets specifically designed for tasks such as Question Answering, Word Sense Disambiguation, Syntactic Similarity, and Paraphrasing. These corpora are crucial for training supervised and semi-supervised machine learning models.
Language models are powerful tools that understand and generate human language by learning patterns and structures from vast amounts of text data. These models are integral to various NLP applications, enabling machines to comprehend and process language in a way that is useful for tasks such as translation, sentiment analysis, and more. In the project our trained models are:
BERT (Bidirectional Encoder Representations from Transformers) is a powerful pre-trained language model developed by Google, revolutionizing NLP by understanding both the left and right context of words. Built on the Transformer architecture, BERT uses self-attention mechanisms to capture long-range dependencies in text. It is pre-trained on two key tasks: masked language modeling (MLM) and next sentence prediction (NSP). This enables BERT to create context-aware word embeddings, making it highly effective for tasks like text classification, named entity recognition, and question answering.
Data Size: 32 GB
Parameters: 110M
Hugging Face Link: https://huggingface.co/banglagov/banBERT-Base
The applications developed in this project are designed to leverage the full potential of the language models and processing pipelines, making advanced Bangla NLP accessible and user-friendly for various end-users. These tools cater to both academic research and commercial use, providing comprehensive solutions for text analysis, data management, and linguistic research. Key Applications Delivered
Label Hub, a comprehensive text tagging and annotation management tool that has been developed to streamline the preparation of data for machine learning and enhance the annotation process across various Natural Language Processing (NLP) tasks. It supports a wide range of labeling types, including multiclass classification, sequence labeling, and relation labeling, with advanced features such as active learning to auto-tag known entities and multi-label support, and extensive multilingual and Unicode capabilities. The system efficiently manages large numbers of annotators while ensuring gold-standard data annotation with features for collaboration, task assignment, and detailed reporting.
Crawler is an advanced tool designed for admins to efficiently manage data extraction from various websites. Admins can add new sites, schedule crawlers for specific sites, and control custom extractions without writing any code. The system supports scheduled crawling, range crawling, bulk crawling, and customization for extracting data from YouTube channels. Users can export the crawled datasets in JSON format and monitor the extracted data from specific sites in real time.
The Data Aggregation component is designed to share labeled and raw data with the public. Given the diverse range of annotated data within the project, this system enables the project authority to distribute data to a broader audience. It supports various types of data, including text, images, audio, and video files for storing. Admins can also manage dataset characteristics and implement version control through the system, ensuring organized and accessible data sharing.
Corpus Analyzer, a text analysis and corpus management software. Its purpose is to allow researchers studying language behavior to search extensive text collections using complex and linguistically-informed queries. It offers a wide range of features, including word and phrase frequency analysis, N-grams, concordance, KWIC (Key Word in Context), and collocation. This platform supports the analysis of text data, allowing users to search, filter, sort, arrange, export, import, store, and analyze existing or newly added corpora.This tool provides lexicologists, historians, and researchers with a versatile platform for comprehensive corpus analysis.
Information Retrieval System is designed to extract and rank Bangla information based on user searches, drawing from continuous data collected by in-house crawler modules. It supports an unlimited corpus size, handling both real-time and scheduled data crawling, as well as offline manual imports. The system allows users to search through news articles and provides analysis such as named entity recognition and topic analysis, with options to sort by trends. Additionally, it offers detailed information on books, including author details, and compares pricing across different websites.
The ML Models platform allows users to interact with various trained models and view predicted outputs for different processing and downstream tasks. Current support includes Named Entity Recognition (NER), Part-of-Speech (PoS) tagging, Shallow Parsing, Question Answering, Paraphrase Identification, Coreference Resolution, and other unsupervised models based on our language models. Additionally, users can interact with a Bangla embedding visualization to explore and visualize Bangla words.
BDLexicon is a Bengali dictionary offering word meanings, parts of speech, synonyms, examples, and related words, designed for easy and interactive use.
Comprehensive Bengali definitions
Related words and phrases
Usage in context
Grammatical classification
Explore our latest research and findings in the field of Bangla language processing and syntactic analysis.
Sadia Afrin, Md. Shahad Mahmud Chowdhury, Md. Islam, Faisal Khan, Labib Chowdhury, Md. Mahtab, Nazifa Chowdhury, Massud Forkan, Neelima Kundu, Hakim Arif, Mohammad Mamun Or Rashid, Mohammad Amin, Nabeet Mohammed
Published in: Findings of the Association for Computational Linguistics: EMNLP 2023
Lemmatization holds significance in both natural language processing (NLP) and linguistics, as it effectively decreases data density and aids in comprehending contextual meaning. However, due to the highly inflected nature and morphological richness, lemmatization in Bangla text poses a complex challenge. In this study, we propose linguistic rules for lemmatization and utilize a dictionary along with the rules to design a lemmatizer specifically for Bangla. Our system aims to lemmatize words based on their parts of speech class within a given sentence. Unlike previous rule-based approaches, we analyzed the suffix marker occurrence according to the morpho-syntactic values and then utilized sequences of suffix markers instead of entire suffixes. To develop our rules, we analyze a large corpus of Bangla text from various domains, sources, and time periods to observe the word formation of inflected words. The lemmatizer achieves an accuracy of 96.36% when tested against a manually annotated test dataset by trained linguists and demonstrates competitive performance on three previously published Bangla lemmatization datasets. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanLemma in order to contribute to the further advancement of Bangla NLP.
Md. Motahar Mahtab, Faisal Ahamed Khan, Md. Ekramul Islam, Md. Shahad Mahmud Chowdhury, Labib Imam Chowdhury, Sadia Afrin, Hazrat Ali, Mohammad Mamun Or Rashid, Nabeel Mohammed, Mohammad Ruhul Amin
Published in: Findings of the Association for Computational Linguistics: NAACL 2025
In this study, we introduce BanNERD, the most extensive human-annotated and validated Bangla Named Entity Recognition Dataset to date, comprising over 85,000 sentences. BanNERD is curated from a diverse array of sources, spanning over 29 domains, thereby offering a comprehensive range of generalized contexts. To ensure the dataset’s quality, expert linguists developed a detailed annotation guideline tailored to the Bangla language. All annotations underwent rigorous validation by a team of validators, with final labels being determined via majority voting, thereby ensuring the highest annotation quality and a high IAA score of 0.88. In a cross-dataset evaluation, models trained on BanNERD consistently outperformed those trained on four existing Bangla NER datasets. Additionally, we propose a method named BanNERCEM (Bangla NER context-ensemble Method) which outperforms existing approaches on Bangla NER datasets and performs competitively on English datasets using lightweight Bangla pretrained LLMs. Our approach passes each context separately to the model instead of previous concatenation-based approaches achieving the highest average macro F1 score of 81.85% across 10 NER classes, outperforming previous approaches and ensuring better context utilization. We are making the code and datasets publicly available at https://github.com/eblict-gigatech/BanNERD in order to contribute to the further advancement of Bangla NLP.