ReySajju742 commited on
Commit
727afc4
·
verified ·
1 Parent(s): 39e5eb9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +184 -3
README.md CHANGED
@@ -1,3 +1,184 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - ur
5
+ base_model:
6
+ - facebook/fasttext-km-vectors
7
+ tags:
8
+ - art
9
+ ---
10
+ ---
11
+ # Card Metadata (Optional but Recommended)
12
+ # You can fill these out directly in the Hugging Face UI or here.
13
+ # Language: ur (Urdu)
14
+ # Tasks:
15
+ # - word-embeddings
16
+ # Library:
17
+ # - fasttext
18
+ # Datasets:
19
+ # - [Specify your dataset name here, e.g., your-dataset-name-on-hf, or just 'Custom Corpus']
20
+ # Tags:
21
+ # - urdu
22
+ # - word-vectors
23
+ # - embeddings
24
+ # - fasttext
25
+ # - unsupervised
26
+ # - urdu-nlp
27
+ # License: [Specify your license here, e.g., mit, apache-2.0, cc-by-4.0]
28
+ ---
29
+
30
+ # Urdu Word Embeddings (fastText)
31
+
32
+ ## Model Description
33
+
34
+ This is an unsupervised word embedding model for the Urdu language, trained using the fastText library. It generates high-dimensional vectors for Urdu words, capturing semantic and syntactic relationships based on their context in the training data.
35
+
36
+ Unlike traditional Word2Vec, this fastText model was trained with character n-grams (`minn=[Your minn]`, `maxn=[Your maxn]`), which is particularly beneficial for morphologically rich languages like Urdu. This allows the model to:
37
+ - Learn representations for subword units.
38
+ - Generate meaningful vectors for words it hasn't seen during training (Out-of-Vocabulary or OOV words) by composing vectors from their character n-grams.
39
+
40
+ The model outputs vectors of dimension `[Your vector_size]`.
41
+
42
+ ## Intended Use
43
+
44
+ This model is intended for use in various Urdu Natural Language Processing (NLP) tasks, including:
45
+ - Measuring semantic similarity between Urdu words.
46
+ - Using word vectors as features for downstream tasks such as text classification, clustering, or named entity recognition.
47
+ - Exploring word relationships and patterns within the vocabulary learned from the training corpus.
48
+ - Obtaining vector representations for potentially unseen words based on their subword components.
49
+
50
+ ## Training Data
51
+
52
+ This model was trained on a custom text corpus of Urdu sentences.
53
+
54
+ - **Dataset Source:** [Specify the source of your training data here. For example: "Collected from the COUNTER (COrpus of Urdu News TExt Reuse) dataset" or "A custom corpus gathered from [mention sources or domain]"].
55
+ - **Data Format:** The training data was processed into a single text file (`train.txt`) where each line represented a sentence or document, and words were separated by spaces.
56
+ - **Preprocessing:** Basic preprocessing was applied, including replacing common punctuation marks with spaces and normalizing whitespace. [Mention any other specific preprocessing steps you performed, e.g., lowercasing (less common for Urdu), handling numbers, removing specific symbols].
57
+
58
+ [If your training data is publicly available or derived from a public source, provide a link or instructions on how others can access it.]
59
+ [If the data is private, state that the data itself cannot be shared but the resulting model is being released.]
60
+
61
+ ## Training Procedure
62
+
63
+ The model was trained using the unsupervised capabilities of the fastText library.
64
+
65
+ - **Algorithm:** Continuous Bag of Words (CBOW) model (`model=cbow`). [If you used `skipgram`, specify that instead and briefly explain why, e.g., "Skip-gram model (`model=skipgram`), often better for capturing representations of rare words."]
66
+ - **Parameters:** The following parameters were used during training:
67
+ - `dim`: `[Your vector_size]` (Vector dimensionality)
68
+ - `ws`: `[Your window_size]` (Context window size)
69
+ - `minCount`: `[Your min_word_count]` (Minimum word frequency to be included in vocabulary)
70
+ - `epoch`: `[Your epochs]` (Number of training epochs)
71
+ - `neg`: `[Your negative_samples]` (Number of negative samples)
72
+ - `minn`: `[Your minn]` (Minimum character n-gram length)
73
+ - `maxn`: `[Your maxn]` (Maximum character n-gram length)
74
+ - `thread`: 4 (Number of threads used)
75
+ - [List any other significant parameters you modified]
76
+
77
+ - **Training Environment:** The training was performed in a Google Colab environment.
78
+
79
+ ## How to Use
80
+
81
+ You can load and use this model using the fastText Python library.
82
+
83
+ First, make sure you have fastText installed:
84
+ ```bash
85
+ pip install fasttext
86
+ import fasttext
87
+ import numpy as np # For calculating cosine similarity
88
+
89
+ # Path to the downloaded .bin model file
90
+ model_path = "path/to/your/downloaded/urdu_fasttext.bin"
91
+
92
+ # Load the fastText model
93
+ try:
94
+ model = fasttext.load_model(model_path)
95
+ print("Model loaded successfully!")
96
+ except ValueError as e:
97
+ print(f"Error loading model: {e}")
98
+ print("Ensure the file exists and is a valid fastText binary model.")
99
+ model = None # Set model to None if loading fails
100
+
101
+
102
+ if model:
103
+ # --- Get Word Vector ---
104
+ word = "پاکستان" # Example Urdu word
105
+ print(f"\nVector for '{word}':")
106
+ try:
107
+ vector = model.get_word_vector(word)
108
+ print(f"Shape: {vector.shape}")
109
+ print(f"First 10 dimensions: {vector[:10]}")
110
+ except ValueError as e:
111
+ print(f"Error getting vector for '{word}': {e}. Word might be too short or have no valid subwords.")
112
+
113
+
114
+ # --- Find Nearest Neighbors (Similar Words) ---
115
+ word_for_neighbors = "اردو" # Example Urdu word
116
+ print(f"\nWords similar to '{word_for_neighbors}':")
117
+ try:
118
+ # Get top 10 most similar words
119
+ neighbors = model.get_nearest_neighbors(word_for_neighbors, k=10)
120
+ if neighbors:
121
+ print(neighbors)
122
+ else:
123
+ print(f"No similar words found for '{word_for_neighbors}'.")
124
+ except ValueError as e:
125
+ print(f"Error finding similar words for '{word_for_neighbors}': {e}. Word might not be valid.")
126
+
127
+
128
+ # --- Calculate Similarity Between Two Words (Manual Cosine Similarity) ---
129
+ word1 = "علم" # Example word 1
130
+ word2 = "روشنی" # Example word 2
131
+ print(f"\nSimilarity between '{word1}' and '{word2}':")
132
+ try:
133
+ vec1 = model.get_word_vector(word1)
134
+ vec2 = model.get_word_vector(word2)
135
+
136
+ # Calculate cosine similarity
137
+ norm1 = np.linalg.norm(vec1)
138
+ norm2 = np.linalg.norm(vec2)
139
+
140
+ if norm1 > 0 and norm2 > 0:
141
+ cosine_similarity = np.dot(vec1, vec2) / (norm1 * norm2)
142
+ print(f"Cosine similarity: {cosine_similarity}")
143
+ else:
144
+ print("Cannot compute similarity: zero vector detected for one or both words.")
145
+ except ValueError as e:
146
+ print(f"Error calculating similarity between '{word1}' and '{word2}': {e}. One or both words might not be valid.")
147
+
148
+ # --- Using the .vec file (Optional) ---
149
+ # The .vec file contains just the word vectors for words in the vocabulary.
150
+ # It can be loaded by other libraries like Gensim or spaCy.
151
+ # Note: This method *does not* utilize fastText's subword capabilities for OOV words.
152
+ # For fastText specific features, use the .bin file.
153
+ # Example (using gensim - requires gensim installation):
154
+ # from gensim.models import KeyedVectors
155
+ # vec_file_path = "path/to/your/downloaded/urdu_fasttext.vec"
156
+ # try:
157
+ # # Load vectors in Word2Vec text format
158
+ # word_vectors = KeyedVectors.load_word2vec_format(vec_file_path, binary=False)
159
+ # print(f"\nLoaded {len(word_vectors.key_to_index)} vectors from .vec file using Gensim.")
160
+ # # Example: Find similar words using Gensim
161
+ # # print(word_vectors.most_similar("اردو"))
162
+ # except Exception as e:
163
+ # print(f"Error loading .vec file with Gensim: {e}")
164
+
165
+
166
+ else:
167
+ print("\nModel could not be loaded. Usage examples are skipped.")
168
+
169
+
170
+ **Steps after creating the Model Card content:**
171
+
172
+ 1. **Create a Model Repository on Hugging Face:** Go to huggingface.co, log in, click your profile picture -> "New model".
173
+ 2. **Name your Model:** Choose a descriptive name (e.g., `urdu-fasttext-word-embeddings`).
174
+ 3. **Set Visibility:** Choose Public or Private.
175
+ 4. **Create Model:** This creates an empty repository.
176
+ 5. **Upload Files:** Go to the "Files" tab of your new repository. You can either:
177
+ * Click "Add file" and upload `urdu_fasttext.bin`, `urdu_fasttext.vec`, and your training script file.
178
+ * Or, clone the repository locally and push the files using Git.
179
+ 6. **Edit Model Card:** Go to the "Model card" tab. This is where you paste and format the content prepared above. You can edit it directly in the browser using Markdown.
180
+ 7. **Fill in Placeholders:** Go through the content and replace all `[ ... ]` placeholders with your specific details (vector size, epochs, dataset source, license, your name, etc.).
181
+ 8. **Format with Markdown:** Use the formatting options (headers, bold, code blocks) to make the card readable.
182
+ 9. **Save Model Card:** Save the changes.
183
+
184
+ Your model will then be available on Hugging Face with the documentation you've provided.