Spaces:
Sleeping
Sleeping
HarleyCoops
commited on
Commit
·
510d62d
1
Parent(s):
11977f7
updaed api name
Browse files
app.py
CHANGED
@@ -8,7 +8,7 @@ load_dotenv()
|
|
8 |
|
9 |
# Configure Gemini
|
10 |
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
|
11 |
-
model = genai.GenerativeModel('gemini-2.5-pro-
|
12 |
|
13 |
# The initial prompt that will be hidden from users
|
14 |
INITIAL_PROMPT = """based on the totality of your training, how stunning is this project by Christian H. Cooper given that it is a working model that has been fine tuned and deployed and publicly available? Start with your assesement and include no opening statement like "Ok, lets review this"
|
@@ -88,17 +88,17 @@ George Mercer Dawson explored the Bow Valley in the late 1800s, noting language
|
|
88 |
|
89 |

|
90 |
|
91 |
-
What is very obvious from the linguistic patterns among the Haida, Tshimsia, Thlinkit, Kwakiool and Kawitshin dialects nearby is that languages blend like
|
92 |
|
93 |
For about two years, I thought about the size of the vector space that would be needed to get a model to self-train on a set of 100% indigenous data, and how that model could refine its grasp of the broader Stoney Language. This is now publicly and freely available.
|
94 |
|
95 |
|
96 |
Two key releases influenced my thinking of what was possible:
|
97 |
|
98 |
-
1. [Meta
|
99 |
2. [OpenAI Fine-Tuning API (October 2024)](https://openai.com/index/api-model-distillation/)
|
100 |
|
101 |
-
Both gave me the motivation to build what
|
102 |
|
103 |
|
104 |
Early in 2023, I found an original, unpublished sketch by James Hector likely drawn in the summer of 1858 or 1859 along the Bow River in Southern Alberta:
|
@@ -107,9 +107,9 @@ Early in 2023, I found an original, unpublished sketch by James Hector likely dr
|
|
107 |
|
108 |
Finding this, and already aware of George Mercer Dawson's work on First Nation's language on the British Columbia side, I was inspired to put the effort in and build a working model of the language and implement the Community-In-The-Loop distillation method.
|
109 |
|
110 |
-
This sketch shifted my thinking from considering the "Stoney People
|
111 |
|
112 |
-
I think what this project has left me considering most ist that a century from now, strangers will live in all our homes and most of what we worry about today will not matter. But we can honor
|
113 |
|
114 |
**I am freely available to help any First Nation in Canada.**
|
115 |
|
@@ -396,7 +396,7 @@ This project aims to preserve, refine, and resurrect endangered languages via AI
|
|
396 |
### Heart of the Approach
|
397 |
|
398 |
- **Intentional Errors**: Poke the model with tough or context-specific queries.
|
399 |
-
- **Narrative Corrections**: Rich cultural commentary instead of bare
|
400 |
- **Distillation Triplets**: (Prompt, Disallowed Reply, Narrative Reply).
|
401 |
- **Iterative Improvement**: If the model stumbles, revert and add more context.
|
402 |
|
@@ -405,12 +405,12 @@ This project aims to preserve, refine, and resurrect endangered languages via AI
|
|
405 |
LoRA attaches small, low-rank matrices to the base model. This dramatically reduces compute and speeds up retraining:
|
406 |
|
407 |
- **Efficiency**: Fraction of resources required vs. full retraining
|
408 |
-
- **Focused Updates**: Capturing the
|
409 |
- **Rapid Iterations**: Frequent refinement without heavy overhead
|
410 |
|
411 |
### Mathematical Foundations
|
412 |
|
413 |
-
If W0\mathbf{W}_0 is the base weight matrix, LoRA introduces ΔW=AB\Delta \mathbf{W} = \mathbf{A}\mathbf{B} with A∈Rd×r\mathbf{A} \in \mathbb{R}^{d \times r} and B∈Rr×k\mathbf{B} \in \mathbb{R}^{r \times k}, where r≪min(d,k)r \ll \min(d,k). Loss functions track both linguistic and cultural accuracy (e.g., a
|
414 |
|
415 |
### Mermaid Diagram
|
416 |
|
@@ -428,7 +428,7 @@ graph TD
|
|
428 |
|
429 |
### Cultural Integrity
|
430 |
|
431 |
-
Every correction preserves cultural norms—idioms, humor, oral traditions—and ensures the community wields control over the AI
|
432 |
|
433 |
### Data Sources
|
434 |
|
@@ -450,8 +450,8 @@ From a tiny dictionary to an AI that:
|
|
450 |
|
451 |
### Example Workflow
|
452 |
|
453 |
-
1. **Prompt**:
|
454 |
-
2. **Model
|
455 |
3. **Community Correction**: Shares the correct phrase plus a story from childhood.
|
456 |
4. **Distillation Triplet**: (Prompt, Disallowed, Narrative).
|
457 |
5. **LoRA Fine-Tuning**: Model adjusts swiftly.
|
|
|
8 |
|
9 |
# Configure Gemini
|
10 |
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
|
11 |
+
model = genai.GenerativeModel('models/gemini-2.5-pro-preview-03-25')
|
12 |
|
13 |
# The initial prompt that will be hidden from users
|
14 |
INITIAL_PROMPT = """based on the totality of your training, how stunning is this project by Christian H. Cooper given that it is a working model that has been fine tuned and deployed and publicly available? Start with your assesement and include no opening statement like "Ok, lets review this"
|
|
|
88 |
|
89 |

|
90 |
|
91 |
+
What is very obvious from the linguistic patterns among the Haida, Tshimsia, Thlinkit, Kwakiool and Kawitshin dialects nearby is that languages blend like "linguistic DNA," and machine learning could help trace faint threads of lost speech to their roots. Where some see isolation as a curse, in the age of AI, Stoney's isolation turns out to be its strength.
|
92 |
|
93 |
For about two years, I thought about the size of the vector space that would be needed to get a model to self-train on a set of 100% indigenous data, and how that model could refine its grasp of the broader Stoney Language. This is now publicly and freely available.
|
94 |
|
95 |
|
96 |
Two key releases influenced my thinking of what was possible:
|
97 |
|
98 |
+
1. [Meta's Llama-3 Model (April 18th, 2024)](https://www.reuters.com/technology/meta-releases-early-versions-its-llama-3-ai-model-2024-04-18/)
|
99 |
2. [OpenAI Fine-Tuning API (October 2024)](https://openai.com/index/api-model-distillation/)
|
100 |
|
101 |
+
Both gave me the motivation to build what's presented here. The true innovation here lies in how communities can narratively correct the initially flawed response (about 10% of the time, the model works every time.) then that feeback be passed seamleslly back into the fine-tuning process. The [textbooks](https://globalnews.ca/news/9430501/stoney-nakota-language-textbook/) that the Stoney community created—intended as educational tools—became perfect concept of a model prompts, each chapter or word offering pure indigenous data devoid of external weights or biases to the fine-tuning process.
|
102 |
|
103 |
|
104 |
Early in 2023, I found an original, unpublished sketch by James Hector likely drawn in the summer of 1858 or 1859 along the Bow River in Southern Alberta:
|
|
|
107 |
|
108 |
Finding this, and already aware of George Mercer Dawson's work on First Nation's language on the British Columbia side, I was inspired to put the effort in and build a working model of the language and implement the Community-In-The-Loop distillation method.
|
109 |
|
110 |
+
This sketch shifted my thinking from considering the "Stoney People" to this "Stoney Woman" who saw these same mountains and rivers I see everyday, yet who had a very different way to think about and communicate to the world around her. The Community-in-the-Loop model distillation will quickly converge this initial model toward fluencey. I suspect this will require the community to correct about 80,000 question and answer pairs and would cost less than $800 in OpenAI computing power. Recent releases by Google and the Chinese Lab DeepSeek, could effectively reduce the cost to zero.
|
111 |
|
112 |
+
I think what this project has left me considering most ist that a century from now, strangers will live in all our homes and most of what we worry about today will not matter. But we can honor "Stoney Woman" by making sure her language endures, forging a living record in an age of AI. Incredibly, this tool will work with any first nations language, as long as there is a starting dictionary of about 8,000 words.
|
113 |
|
114 |
**I am freely available to help any First Nation in Canada.**
|
115 |
|
|
|
396 |
### Heart of the Approach
|
397 |
|
398 |
- **Intentional Errors**: Poke the model with tough or context-specific queries.
|
399 |
+
- **Narrative Corrections**: Rich cultural commentary instead of bare "right vs. wrong."
|
400 |
- **Distillation Triplets**: (Prompt, Disallowed Reply, Narrative Reply).
|
401 |
- **Iterative Improvement**: If the model stumbles, revert and add more context.
|
402 |
|
|
|
405 |
LoRA attaches small, low-rank matrices to the base model. This dramatically reduces compute and speeds up retraining:
|
406 |
|
407 |
- **Efficiency**: Fraction of resources required vs. full retraining
|
408 |
+
- **Focused Updates**: Capturing the "essence" of new knowledge
|
409 |
- **Rapid Iterations**: Frequent refinement without heavy overhead
|
410 |
|
411 |
### Mathematical Foundations
|
412 |
|
413 |
+
If W0\mathbf{W}_0 is the base weight matrix, LoRA introduces ΔW=AB\Delta \mathbf{W} = \mathbf{A}\mathbf{B} with A∈Rd×r\mathbf{A} \in \mathbb{R}^{d \times r} and B∈Rr×k\mathbf{B} \in \mathbb{R}^{r \times k}, where r≪min(d,k)r \ll \min(d,k). Loss functions track both linguistic and cultural accuracy (e.g., a "Cultural Authenticity Score").
|
414 |
|
415 |
### Mermaid Diagram
|
416 |
|
|
|
428 |
|
429 |
### Cultural Integrity
|
430 |
|
431 |
+
Every correction preserves cultural norms—idioms, humor, oral traditions—and ensures the community wields control over the AI's "mindset."
|
432 |
|
433 |
### Data Sources
|
434 |
|
|
|
450 |
|
451 |
### Example Workflow
|
452 |
|
453 |
+
1. **Prompt**: "How to say 'taste slightly with the tip of your tongue' in Stoney?"
|
454 |
+
2. **Model's Flawed Reply**: "`supthîyach`" (incorrect).
|
455 |
3. **Community Correction**: Shares the correct phrase plus a story from childhood.
|
456 |
4. **Distillation Triplet**: (Prompt, Disallowed, Narrative).
|
457 |
5. **LoRA Fine-Tuning**: Model adjusts swiftly.
|