Joe Armani
commited on
Commit
·
febdb1e
1
Parent(s):
2a3cfd8
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,2 +1,43 @@
|
|
| 1 |
-
#
|
| 2 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Retrieval-based learning chatbot
|
| 2 |
+
|
| 3 |
+
CSC525 - Module 8 Option 2 - Retrieval-based Learning Chatbot - Joseph Armani
|
| 4 |
+
|
| 5 |
+
## TODO
|
| 6 |
+
|
| 7 |
+
A Python tool to generate high-quality dialog variations.
|
| 8 |
+
|
| 9 |
+
This package automatically downloads the following models during installation:
|
| 10 |
+
|
| 11 |
+
- Universal Sentence Encoder v4 (TensorFlow Hub)
|
| 12 |
+
- ChatGPT Paraphraser T5-base
|
| 13 |
+
- Helsinki-NLP translation models (en-de, de-es, es-en)
|
| 14 |
+
- GPT-2 (for perplexity scoring)
|
| 15 |
+
- spaCy en_core_web_sm
|
| 16 |
+
- nltk wordnet and averaged_perceptron_tagger_eng models
|
| 17 |
+
|
| 18 |
+
## Install package
|
| 19 |
+
|
| 20 |
+
pip install -e .
|
| 21 |
+
|
| 22 |
+
## Description
|
| 23 |
+
|
| 24 |
+
This Python script demonstrates a complete pipeline for dialogue augmentation, including validation, optimization, and data augmentation.
|
| 25 |
+
It creates high-quality augmented versions of dialogues by applying various text augmentation techniques and quality control checks.
|
| 26 |
+
Two approaches are used for text augmentation: paraphrasing and back-translation. The pipeline also includes quality metrics for evaluating the augmented text.
|
| 27 |
+
Special handling is implemented for very short text such as greetings and farewells, which are predefined and filtered for quality.
|
| 28 |
+
The pipeline is designed to process a dataset of dialogues and generate multiple high-quality augmented versions of each dialogue.
|
| 29 |
+
The pipeline ensures duplicate dialogues are not generated and that the output meets quality thresholds for semantic similarity, grammar, fluency, diversity, and content preservation.
|
| 30 |
+
|
| 31 |
+
## References
|
| 32 |
+
|
| 33 |
+
Accsany, P. (2024). Working with JSON data in Python. Real Python. <https://realpython.com/python-json/>
|
| 34 |
+
Explosion AI Team. (n.d.). Spacy · industrial-strength natural language processing in python. <https://spacy.io/>
|
| 35 |
+
GeeksforGeeks. (2024). Text augmentation techniques in NLP. GeeksforGeeks. <https://www.geeksforgeeks.org/text-augmentation-techniques-in-nlp/>
|
| 36 |
+
Helsinki-NLP. (2024). Opus-MT [Computer software]. GitHub. <https://github.com/Helsinki-NLP/Opus-MT>
|
| 37 |
+
Hugging Face. (n.d.). Transformers. Hugging Face. <https://huggingface.co/docs/transformers/en/index>
|
| 38 |
+
Humarin. (2023). ChatGPT paraphraser on T5-base [Computer software]. Hugging Face. <https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base>
|
| 39 |
+
Keita, Z. (2022). Data augmentation in NLP using back-translation with MarianMT. Towards Data Science. <https://towardsdatascience.com/data-augmentation-in-nlp-using-back-translation-with-marianmt-a8939dfea50a>
|
| 40 |
+
Memgraph. (2023). Cosine similarity in Python with scikit-learn. Memgraph. <https://memgraph.com/blog/cosine-similarity-python-scikit-learn>
|
| 41 |
+
Morris, J. (n.d.). language-tool-python (Version 2.8.1) [Computer software]. PyPI. <https://pypi.org/project/language-tool-python/>
|
| 42 |
+
TensorFlow. (n.d.). Universal sentence encoder. TensorFlow Hub. <https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder>
|
| 43 |
+
Waheed, A. (2023). How to calculate ROUGE score in Python. Python Code. <https://thepythoncode.com/article/calculate-rouge-score-in-python>
|