| # Retrieval-based learning chatbot | |
| CSC525 - Module 8 Option 2 - Retrieval-based Learning Chatbot - Joseph Armani | |
| ## TODO | |
| A Python tool to generate high-quality dialog variations. | |
| This package automatically downloads the following models during installation: | |
| - Universal Sentence Encoder v4 (TensorFlow Hub) | |
| - ChatGPT Paraphraser T5-base | |
| - Helsinki-NLP translation models (en-de, de-es, es-en) | |
| - GPT-2 (for perplexity scoring) | |
| - spaCy en_core_web_sm | |
| - nltk wordnet and averaged_perceptron_tagger_eng models | |
| ## Install package | |
| pip install -e . | |
| ## Description | |
| This Python script demonstrates a complete pipeline for dialogue augmentation, including validation, optimization, and data augmentation. | |
| It creates high-quality augmented versions of dialogues by applying various text augmentation techniques and quality control checks. | |
| Two approaches are used for text augmentation: paraphrasing and back-translation. The pipeline also includes quality metrics for evaluating the augmented text. | |
| Special handling is implemented for very short text such as greetings and farewells, which are predefined and filtered for quality. | |
| The pipeline is designed to process a dataset of dialogues and generate multiple high-quality augmented versions of each dialogue. | |
| The pipeline ensures duplicate dialogues are not generated and that the output meets quality thresholds for semantic similarity, grammar, fluency, diversity, and content preservation. | |
| ## References | |
| Accsany, P. (2024). Working with JSON data in Python. Real Python. <https://realpython.com/python-json/> | |
| Explosion AI Team. (n.d.). Spacy · industrial-strength natural language processing in python. <https://spacy.io/> | |
| GeeksforGeeks. (2024). Text augmentation techniques in NLP. GeeksforGeeks. <https://www.geeksforgeeks.org/text-augmentation-techniques-in-nlp/> | |
| Helsinki-NLP. (2024). Opus-MT [Computer software]. GitHub. <https://github.com/Helsinki-NLP/Opus-MT> | |
| Hugging Face. (n.d.). Transformers. Hugging Face. <https://huggingface.co/docs/transformers/en/index> | |
| Humarin. (2023). ChatGPT paraphraser on T5-base [Computer software]. Hugging Face. <https://huggingface.co/humarin/chatgpt_paraphraser_on_T5_base> | |
| Keita, Z. (2022). Data augmentation in NLP using back-translation with MarianMT. Towards Data Science. <https://towardsdatascience.com/data-augmentation-in-nlp-using-back-translation-with-marianmt-a8939dfea50a> | |
| Memgraph. (2023). Cosine similarity in Python with scikit-learn. Memgraph. <https://memgraph.com/blog/cosine-similarity-python-scikit-learn> | |
| Morris, J. (n.d.). language-tool-python (Version 2.8.1) [Computer software]. PyPI. <https://pypi.org/project/language-tool-python/> | |
| TensorFlow. (n.d.). Universal sentence encoder. TensorFlow Hub. <https://www.tensorflow.org/hub/tutorials/semantic_similarity_with_tf_hub_universal_encoder> | |
| Waheed, A. (2023). How to calculate ROUGE score in Python. Python Code. <https://thepythoncode.com/article/calculate-rouge-score-in-python> | |