Commit 
							
							·
						
						e25c9fb
	
1
								Parent(s):
							
							8fa24cb
								
Update README.md
Browse files
    	
        README.md
    CHANGED
    
    | @@ -5,11 +5,11 @@ language: | |
| 5 | 
             
            ---
         | 
| 6 | 
             
            ## Model weights for Parallel Roberta-Large model ##
         | 
| 7 |  | 
| 8 | 
            -
            We provide the [weights](https://huggingface.co/luffycodes/Parallel-Roberta-Large) for the  | 
| 9 |  | 
| 10 | 
             
            To use this model, use the following [paf_modeling_roberta.py](https://github.com/luffycodes/Parallel-Transformers-Pytorch/blob/main/paf_modeling_roberta.py) file.
         | 
| 11 |  | 
| 12 | 
            -
            Here is how to use this model to get the features of a given text in PyTorch | 
| 13 |  | 
| 14 | 
             
            ```python
         | 
| 15 | 
             
            from transformers import RobertaTokenizer
         | 
| @@ -21,9 +21,24 @@ encoded_input = tokenizer(text, return_tensors='pt') | |
| 21 | 
             
            output = model(**encoded_input)
         | 
| 22 | 
             
            ```
         | 
| 23 |  | 
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
|  | |
| 24 | 
             
            
         | 
| 25 |  | 
| 26 | 
            -
             | 
|  | |
|  | |
| 27 |  | 
| 28 | 
             
            When fine-tuned on downstream tasks, this model achieves the following results:
         | 
| 29 |  | 
|  | |
| 5 | 
             
            ---
         | 
| 6 | 
             
            ## Model weights for Parallel Roberta-Large model ##
         | 
| 7 |  | 
| 8 | 
            +
            We provide the [weights](https://huggingface.co/luffycodes/Parallel-Roberta-Large) for the Parallel Attention and Feedforward design (PAF) for RoBERTa-Large.
         | 
| 9 |  | 
| 10 | 
             
            To use this model, use the following [paf_modeling_roberta.py](https://github.com/luffycodes/Parallel-Transformers-Pytorch/blob/main/paf_modeling_roberta.py) file.
         | 
| 11 |  | 
| 12 | 
            +
            ## Here is how to use this model to get the features of a given text in PyTorch
         | 
| 13 |  | 
| 14 | 
             
            ```python
         | 
| 15 | 
             
            from transformers import RobertaTokenizer
         | 
|  | |
| 21 | 
             
            output = model(**encoded_input)
         | 
| 22 | 
             
            ```
         | 
| 23 |  | 
| 24 | 
            +
            ## Efficient GPU implementation
         | 
| 25 | 
            +
            [gpu_paf_modeling_roberta.py](https://github.com/luffycodes/Parallel-Transformers-Pytorch/blob/main/gpu_paf_modeling_roberta.py) provides an efficient gpu implementation of PAF design for pytorch.
         | 
| 26 | 
            +
             | 
| 27 | 
            +
            It clubs the computation of key, query, value, and first feedforward network sub-layer(intermediate) computation into one.
         | 
| 28 | 
            +
            ```
         | 
| 29 | 
            +
            self.kqv_ffn1.weight.data = torch.cat((attention.self.key.weight.data, attention.self.query.weight.data,
         | 
| 30 | 
            +
                                                           attention.self.value.weight.data,
         | 
| 31 | 
            +
                                                           intermediate.dense.weight.data))
         | 
| 32 | 
            +
            ```          
         | 
| 33 | 
            +
            However, I could not efficiently optimize the second feedforward network sub-layer computation to run in parallel.
         | 
| 34 | 
            +
             | 
| 35 | 
            +
            ## What is Parallel Attention and Feed-Forward Design?
         | 
| 36 | 
            +
             | 
| 37 | 
             
            
         | 
| 38 |  | 
| 39 | 
            +
            *On the left is the standard Series Attention and Feed-Forward Net Design (SAF) for transformers models. On the right is the Parallel Attention and Feed-Forward Net Design (PAF) used in transformer models like PaLM (Chowdhery et al., 2022) and Mesh-Transformers (Wang, 2021)*
         | 
| 40 | 
            +
             | 
| 41 | 
            +
            ## Evaluation results of [PAF-RoBERTa-Large](https://huggingface.co/luffycodes/parallel-roberta-large).
         | 
| 42 |  | 
| 43 | 
             
            When fine-tuned on downstream tasks, this model achieves the following results:
         | 
| 44 |  |