PyPranav commited on
Commit
8a1b9b3
·
verified ·
1 Parent(s): 6207f94

Update README

Browse files
Files changed (1) hide show
  1. README.md +115 -75
README.md CHANGED
@@ -1,75 +1,115 @@
1
- ---
2
- tags:
3
- - vedic-philosophy
4
- - sanskrit
5
- - instruction-tuning
6
- - synthetic-dataset
7
- - question-answering
8
- license: apache-2.0
9
- language:
10
- - en
11
- - sa
12
- pretty_name: Bhagwat Corpus
13
- size_categories:
14
- - 10K<n<100K
15
- dataset_info:
16
- features:
17
- - name: original_hf_id
18
- dtype: string
19
- - name: sanskrit_shloka
20
- dtype: string
21
- - name: english_translation
22
- dtype: string
23
- - name: generated_question
24
- dtype: string
25
- - name: generated_explanation
26
- dtype: string
27
- - name: generation_status
28
- dtype: string
29
- splits:
30
- - name: train
31
- - name: test
32
- - name: validation
33
- ---
34
-
35
- # Bhagwat Corpus
36
-
37
- ## Dataset Summary
38
- The **Bhagwat Corpus** is a synthetic dataset of approximately 90,000 examples designed for instruction-tuning large language models (LLMs) to generate Vedic philosophical responses grounded in scriptural tradition. Each example consists of:
39
- - A synthetic user question
40
- - A relevant Sanskrit shloka (verse) from the Mahabharata or Ramayana
41
- - An English translation of the shloka
42
- - A generated explanation and status for the response
43
-
44
- The dataset is based on the Itihasa corpus (Aralikatte et al., 2021), which provides Sanskrit-English shloka pairs from the Mahabharata and Ramayana. The Bhagwat Corpus augments this with synthetic questions and explanations, making it suitable for culturally aware, spiritually aligned conversational AI.
45
-
46
- ## Supported Tasks and Leaderboards
47
- - **Instruction-tuning** of LLMs for Vedic/Indian philosophy
48
- - **Question answering** with scriptural grounding
49
- - **Text generation** (structured JSON output)
50
-
51
- ## Languages
52
- - Sanskrit (`sa`)
53
- - English (`en`)
54
-
55
- ## Usage Example
56
- You can load the dataset using the HuggingFace Datasets library:
57
-
58
- ```python
59
- from datasets import load_dataset
60
-
61
- dataset = load_dataset("PyPranav/Bhagwat-Corpus-Data")
62
- print(dataset["train"][0])
63
- # Example output:
64
- # {
65
- # 'original_hf_id': 'test_idx_0',
66
- # 'sanskrit_shloka': '...',
67
- # 'english_translation': '...',
68
- # 'generated_question': '...',
69
- # 'generated_explanation': '...',
70
- # 'generation_status': 'success'
71
- # }
72
- ```
73
-
74
- ## License
75
- Apache 2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - vedic-philosophy
4
+ - instruction-tuning
5
+ - qlora
6
+ - synthetic-dataset
7
+ - json-output
8
+ license: apache-2.0
9
+ language:
10
+ - en
11
+ - sa
12
+ datasets:
13
+ - PyPranav/Bhagwat-Corpus-Data
14
+ library_name: transformers
15
+ model-index:
16
+ - name: Bhagvad Corpus LLM (Instruction-tuned)
17
+ results: []
18
+ ---
19
+
20
+ # Bhagvad Corpus LLM (Instruction-tuned)
21
+
22
+ ## Abstract
23
+ Although large language models (LLMs) are being used more and more to answer complicated questions, they frequently fail to provide philosophically sound answers that are based on scriptural traditions such as Vedic thought. The deficiency of specialized instruction-tuning datasets made for such culturally diverse contexts is the cause of this gap.
24
+
25
+ We present the **Bhagvad Corpus**, a synthetic dataset of approximately 90,000 examples based on the Itihasa corpus (Aralikatte et al., 2021)[1], which includes Sanskrit-English shloka pairs from the Mahabharata and Ramayana. A synthetic user question, the original shloka, its translation, and a thorough explanation linking the verse to the purpose of the query are all included in each instance found in the Bhagvad Corpus.
26
+
27
+ We show how useful this dataset is by using QLoRA to refine LLMs so they can produce structured, shloka-backed responses in formats like JSON. Initial results show notable improvements in the relevance and depth of generated philosophical responses. We release Bhagvad Corpus publicly to support further research in building culturally aware and spiritually aligned language models.
28
+
29
+ ## Model Usage
30
+ This model is instruction-tuned to generate Vedic philosophical responses, always outputting a JSON object with the following keys:
31
+ - `sanskrit_shloka`: The relevant Sanskrit verse
32
+ - `english_translation`: The English translation of the shloka
33
+ - `explanation`: A detailed explanation connecting the shloka to the user's query
34
+
35
+ ### Prompt Template
36
+ To use the model, format your prompt as follows:
37
+
38
+ ```
39
+ Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
40
+
41
+ ### Instruction:
42
+ Provide a Vedic philosophical response based on ancient scriptures. Always provide JSON output with the following keys: 'sanskrit_shloka', 'english_translation', 'explanation'.
43
+
44
+ ### Input:
45
+ <YOUR_QUESTION_HERE>
46
+
47
+ ### Response:
48
+ ```
49
+
50
+ **Example:**
51
+ ```
52
+ Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
53
+
54
+ ### Instruction:
55
+ Provide a Vedic philosophical response based on ancient scriptures. Always provide JSON output with the following keys: 'sanskrit_shloka', 'english_translation', 'explanation'.
56
+
57
+ ### Input:
58
+ What is the Vedic perspective on forgiveness?
59
+
60
+ ### Response:
61
+ ```
62
+
63
+ The model will generate a JSON object as the response, for example:
64
+ ```json
65
+ {
66
+ "sanskrit_shloka": "क्षिप्रं हि मानुषे लोके सिद्धिर्भवति कर्मजा। ...",
67
+ "english_translation": "Success is quickly achieved in the human world by actions...",
68
+ "explanation": "According to the Mahabharata, forgiveness is considered a great virtue..."
69
+ }
70
+ ```
71
+
72
+ ## How to Run
73
+ You can load and run this model using any standard HuggingFace-compatible inference script or library. Here is a minimal example using the HuggingFace Transformers library in Python:
74
+
75
+ ```python
76
+ from transformers import AutoModelForCausalLM, AutoTokenizer
77
+ import torch
78
+
79
+ tokenizer = AutoTokenizer.from_pretrained("<your-model-path-or-hub-name>")
80
+ model = AutoModelForCausalLM.from_pretrained("<your-model-path-or-hub-name>").to("cuda")
81
+
82
+ prompt = '''Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.
83
+
84
+ ### Instruction:
85
+ Provide a Vedic philosophical response based on ancient scriptures. Always provide JSON output with the following keys: 'sanskrit_shloka', 'english_translation', 'explanation'.
86
+
87
+ ### Input:
88
+ What is the Vedic perspective on forgiveness?
89
+
90
+ ### Response:
91
+ '''
92
+
93
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
94
+ outputs = model.generate(**inputs, max_new_tokens=512)
95
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
96
+ print(response)
97
+ ```
98
+
99
+ Replace `<your-model-path-or-hub-name>` with the path to this model directory or the HuggingFace Hub name if uploaded.
100
+
101
+ ## Output Format
102
+ The model is designed to always return a JSON object with the following keys:
103
+ - `sanskrit_shloka`
104
+ - `english_translation`
105
+ - `explanation`
106
+
107
+ If the output is not valid JSON, you may need to post-process the string to extract the JSON part.
108
+
109
+ ## Citation
110
+ If you use this model or dataset, please cite:
111
+
112
+ [1] Aralikatte, R., et al. (2021). Itihasa Corpus: A Large-Scale, Synthetically Generated Dataset for Sanskrit-English Machine Translation. *arXiv preprint arXiv:2104.05561*.
113
+
114
+ ## License
115
+ This model and dataset are released for research purposes. Please see the repository for license details.