PyPranav
/

Bhagwat-Corpus

@@ -1,75 +1,115 @@
----
-tags:
-  - vedic-philosophy
-  - sanskrit
-  - instruction-tuning
-  - synthetic-dataset
-  - question-answering
-license: apache-2.0
-language:
-  - en
-  - sa
-pretty_name: Bhagwat Corpus
-size_categories:
-  - 10K<n<100K
-dataset_info:
-  features:
-    - name: original_hf_id
-      dtype: string
-    - name: sanskrit_shloka
-      dtype: string
-    - name: english_translation
-      dtype: string
-    - name: generated_question
-      dtype: string
-    - name: generated_explanation
-      dtype: string
-    - name: generation_status
-      dtype: string
-  splits:
-    - name: train
-    - name: test
-    - name: validation
----
-# Bhagwat Corpus
-## Dataset Summary
-The **Bhagwat Corpus** is a synthetic dataset of approximately 90,000 examples designed for instruction-tuning large language models (LLMs) to generate Vedic philosophical responses grounded in scriptural tradition. Each example consists of:
-- A synthetic user question
-- A relevant Sanskrit shloka (verse) from the Mahabharata or Ramayana
-- An English translation of the shloka
-- A generated explanation and status for the response
-The dataset is based on the Itihasa corpus (Aralikatte et al., 2021), which provides Sanskrit-English shloka pairs from the Mahabharata and Ramayana. The Bhagwat Corpus augments this with synthetic questions and explanations, making it suitable for culturally aware, spiritually aligned conversational AI.
-## Supported Tasks and Leaderboards
-- **Instruction-tuning** of LLMs for Vedic/Indian philosophy
-- **Question answering** with scriptural grounding
-- **Text generation** (structured JSON output)
-## Languages
-- Sanskrit (`sa`)
-- English (`en`)
-## Usage Example
-You can load the dataset using the HuggingFace Datasets library:
-```python
-from datasets import load_dataset
-dataset = load_dataset("PyPranav/Bhagwat-Corpus-Data")
-print(dataset["train"][0])
-# Example output:
-# {
-#   'original_hf_id': 'test_idx_0',
-#   'sanskrit_shloka': '...',
-#   'english_translation': '...',
-#   'generated_question': '...',
-#   'generated_explanation': '...',
-#   'generation_status': 'success'
-# }
-```
-## License
-Apache 2.0

+---
+tags:
+  - vedic-philosophy
+  - instruction-tuning
+  - qlora
+  - synthetic-dataset
+  - json-output
+license: apache-2.0
+language:
+  - en
+  - sa
+datasets:
+  - PyPranav/Bhagwat-Corpus-Data
+library_name: transformers
+model-index:
+  - name: Bhagvad Corpus LLM (Instruction-tuned)
+    results: []
+---
+# Bhagvad Corpus LLM (Instruction-tuned)
+## Abstract
+Although large language models (LLMs) are being used more and more to answer complicated questions, they frequently fail to provide philosophically sound answers that are based on scriptural traditions such as Vedic thought. The deficiency of specialized instruction-tuning datasets made for such culturally diverse contexts is the cause of this gap.
+We present the **Bhagvad Corpus**, a synthetic dataset of approximately 90,000 examples based on the Itihasa corpus (Aralikatte et al., 2021)[1], which includes Sanskrit-English shloka pairs from the Mahabharata and Ramayana. A synthetic user question, the original shloka, its translation, and a thorough explanation linking the verse to the purpose of the query are all included in each instance found in the Bhagvad Corpus.
+We show how useful this dataset is by using QLoRA to refine LLMs so they can produce structured, shloka-backed responses in formats like JSON. Initial results show notable improvements in the relevance and depth of generated philosophical responses. We release Bhagvad Corpus publicly to support further research in building culturally aware and spiritually aligned language models.
+## Model Usage
+This model is instruction-tuned to generate Vedic philosophical responses, always outputting a JSON object with the following keys:
+- `sanskrit_shloka`: The relevant Sanskrit verse
+- `english_translation`: The English translation of the shloka
+- `explanation`: A detailed explanation connecting the shloka to the user's query
+### Prompt Template
+To use the model, format your prompt as follows:
+```
+Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+### Instruction:
+ Provide a Vedic philosophical response based on ancient scriptures. Always provide JSON output with the following keys: 'sanskrit_shloka', 'english_translation', 'explanation'.
+### Input:
+ <YOUR_QUESTION_HERE>
+### Response:
+```
+**Example:**
+```
+Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+### Instruction:
+ Provide a Vedic philosophical response based on ancient scriptures. Always provide JSON output with the following keys: 'sanskrit_shloka', 'english_translation', 'explanation'.
+### Input:
+ What is the Vedic perspective on forgiveness?
+### Response:
+```
+The model will generate a JSON object as the response, for example:
+```json
+{
+  "sanskrit_shloka": "क्षिप्रं हि मानुषे लोके सिद्धिर्भवति कर्मजा। ...",
+  "english_translation": "Success is quickly achieved in the human world by actions...",
+  "explanation": "According to the Mahabharata, forgiveness is considered a great virtue..."
+}
+```
+## How to Run
+You can load and run this model using any standard HuggingFace-compatible inference script or library. Here is a minimal example using the HuggingFace Transformers library in Python:
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+import torch
+tokenizer = AutoTokenizer.from_pretrained("<your-model-path-or-hub-name>")
+model = AutoModelForCausalLM.from_pretrained("<your-model-path-or-hub-name>").to("cuda")
+prompt = '''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
+### Instruction:
+ Provide a Vedic philosophical response based on ancient scriptures. Always provide JSON output with the following keys: 'sanskrit_shloka', 'english_translation', 'explanation'.
+### Input:
+ What is the Vedic perspective on forgiveness?
+### Response:
+'''
+inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=512)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+Replace `<your-model-path-or-hub-name>` with the path to this model directory or the HuggingFace Hub name if uploaded.
+## Output Format
+The model is designed to always return a JSON object with the following keys:
+- `sanskrit_shloka`
+- `english_translation`
+- `explanation`
+If the output is not valid JSON, you may need to post-process the string to extract the JSON part.
+## Citation
+If you use this model or dataset, please cite:
+[1] Aralikatte, R., et al. (2021). Itihasa Corpus: A Large-Scale, Synthetically Generated Dataset for Sanskrit-English Machine Translation. *arXiv preprint arXiv:2104.05561*.
+## License
+This model and dataset are released for research purposes. Please see the repository for license details.