Bhagvad Corpus LLM (Instruction-tuned)

Abstract

Although large language models (LLMs) are being used more and more to answer complicated questions, they frequently fail to provide philosophically sound answers that are based on scriptural traditions such as Vedic thought. The deficiency of specialized instruction-tuning datasets made for such culturally diverse contexts is the cause of this gap.

We present the Bhagvad Corpus, a synthetic dataset of approximately 90,000 examples based on the Itihasa corpus (Aralikatte et al., 2021)[1], which includes Sanskrit-English shloka pairs from the Mahabharata and Ramayana. A synthetic user question, the original shloka, its translation, and a thorough explanation linking the verse to the purpose of the query are all included in each instance found in the Bhagvad Corpus.

We show how useful this dataset is by using QLoRA to refine LLMs so they can produce structured, shloka-backed responses in formats like JSON. Initial results show notable improvements in the relevance and depth of generated philosophical responses. We release Bhagvad Corpus publicly to support further research in building culturally aware and spiritually aligned language models.

Model Usage

This model is instruction-tuned to generate Vedic philosophical responses, always outputting a JSON object with the following keys:

sanskrit_shloka: The relevant Sanskrit verse
english_translation: The English translation of the shloka
explanation: A detailed explanation connecting the shloka to the user's query

Prompt Template

To use the model, format your prompt as follows:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
 Provide a Vedic philosophical response based on ancient scriptures. Always provide JSON output with the following keys: 'sanskrit_shloka', 'english_translation', 'explanation'.

### Input:
 <YOUR_QUESTION_HERE>

### Response:

Example:

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
 Provide a Vedic philosophical response based on ancient scriptures. Always provide JSON output with the following keys: 'sanskrit_shloka', 'english_translation', 'explanation'.

### Input:
 What is the Vedic perspective on forgiveness?

### Response:

The model will generate a JSON object as the response, for example:

{
  "sanskrit_shloka": "क्षिप्रं हि मानुषे लोके सिद्धिर्भवति कर्मजा। ...",
  "english_translation": "Success is quickly achieved in the human world by actions...",
  "explanation": "According to the Mahabharata, forgiveness is considered a great virtue..."
}

How to Run

You can load and run this model using any standard HuggingFace-compatible inference script or library. Here is a minimal example using the HuggingFace Transformers library in Python:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("<your-model-path-or-hub-name>")
model = AutoModelForCausalLM.from_pretrained("<your-model-path-or-hub-name>").to("cuda")

prompt = '''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
 Provide a Vedic philosophical response based on ancient scriptures. Always provide JSON output with the following keys: 'sanskrit_shloka', 'english_translation', 'explanation'.

### Input:
 What is the Vedic perspective on forgiveness?

### Response:
'''

inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Replace <your-model-path-or-hub-name> with the path to this model directory or the HuggingFace Hub name if uploaded.

Output Format

The model is designed to always return a JSON object with the following keys:

sanskrit_shloka
english_translation
explanation

If the output is not valid JSON, you may need to post-process the string to extract the JSON part.

Citation

If you use this model or dataset, please cite:

[1] Aralikatte, R., et al. (2021). Itihasa Corpus: A Large-Scale, Synthetically Generated Dataset for Sanskrit-English Machine Translation. arXiv preprint arXiv:2104.05561.

License

This model and dataset are released for research purposes. Please see the repository for license details.

PyPranav
/

Bhagwat-Corpus