seduerr
/

t5-pawraphrase

+# Invoking more Creativity with Pawraphrases based on T5
+## This micro-service allows to find paraphrases for a given text based on T5.
+![Imgur](https://i.imgur.com/v6DFBE0.png)
+We explain how we finetune the architecture T5 with the dataset PAWS (both from Google) to get the capability of creating paraphrases (or pawphrases since we are using the PAWS dataset :smile:). With this, we can create paraphrases for any given textual input.
+Find the code for the service in this [Github Repository](https://github.com/seduerr91/pawraphrase_public). In order to create your own __'pawrasphrase tool'__, follow these steps:
+### Step 1: Find a Useful Architecture and Datasets
+Since Google's [T5](https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html) has been trained on multiple tasks (e.g., text summarization, question-answering) and it is solely based on Text-to-Text tasks it is pretty useful for extending its task-base through finetuning it with paraphrases.
+Luckily, the [PAWS](https://github.com/google-research-datasets/paws) dataset consists of approximately 50.000 labeled paraphrases that we can use to fine-tune T5.
+### Step 2: Prepare the PAWS Dataset for the T5 Architecture
+Once identified, it is crucial to prepare the PAWS dataset to feed it into the T5 architecture for finetuning. Since PAWS is coming both with paraphrases and non-paraphases, it needs to be filtered for paraphrases only. Next, after packing it into a Pandas DataFrame, the necessary table headers had to be created. Next, you split the resulting training samples into test, train, and validation set.
+![Imgur](https://i.imgur.com/MTM6apI.png)
+### Step 3: Fine-tune T5-base with PAWS
+Next, following these [training instructions](https://towardsdatascience.com/paraphrase-any-question-with-t5-text-to-text-transfer-transformer-pretrained-model-and-cbb9e35f1555), in which they used the Quora dataset, we use the PAWS dataset and feed into T5. Central is the following code to ensure that T5 understands that it has to _paraphrase_. The adapted version can be found [here](https://github.com/seduerr91/pawraphrase_public/blob/master/t5_pawraphrase_training.ipynb).
+![Imgur](https://i.imgur.com/uAd0bVo.png)
+Additionally, it is helpful to force the old versions of _torch==1.4.0, transformers==2.9.0_ and *pytorch_lightning==0.7.5*, since the newer versions break (trust me, I am speaking from experience). However, when doing such training, it is straightforward to start with the smallest architecture (here, _T5-small_) and a small version of your dataset (e.g., 100 paraphrase examples) to quickly identify where the training may fail or stop.
+### Step 4: Start Inference by yourself.
+Next, you can use the fine-tuned T5 Architecture to create paraphrases from every input. As seen in the introductory image. The corresponding code can be found [here](https://github.com/seduerr91/pawraphrase_public/blob/master/t5_pawraphrase_inference.ipynb).
+### Step 5: Using the fine-tuning through a GUI
+Finally, to make the service useful we can provide it as an API as done with the infilling model [here](https://seduerr91.github.io/blog/ilm-fastapi) or with this [frontend](https://github.com/renatoviolin/T5-paraphrase-generation) which was prepared by Renato. Kudos!
+Thank you for reading this article. I'd be curious about your opinion.
+#### Who am I?
+I am Sebastian an NLP Deep Learning Research Scientist (M.Sc. in IT and Business). In my former life, I was a manager at Austria's biggest bank. In the future, I want to work remotely flexibly & in the field of NLP.
+Drop me a message on [LinkedIn](https://www.linkedin.com/in/sebastianduerr/) if you want to get in touch!