Spaces:
				
			
			
	
			
			
					
		Running
		
	
	
	
			
			
	
	
	
	
		
		
					
		Running
		
	Update main.py
Browse files
    	
        main.py
    CHANGED
    
    | @@ -822,8 +822,9 @@ def intro(): | |
| 822 | 
             
                            "Web datasets are inherently noisy and varied. The TxT360 pipeline implements sophisticated filtering and deduplication techniques to clean and remove redundancies while preserving data integrity."
         | 
| 823 | 
             
                        ),
         | 
| 824 | 
             
                        P(
         | 
| 825 | 
            -
                            "Curated datasets are typically structured and consistently formatted. TxT360 filters these sources with selective steps to maintain their integrity while providing seamless integration into the larger dataset. Both data source types are globally deduplicated together resulting in 5.7T tokens of high-quality data."
         | 
| 826 | 
             
                        ),
         | 
|  | |
| 827 | 
             
                        P(
         | 
| 828 | 
             
                            "We provide details and context for the choices behind TxT360 in the respective Web Data Processing and Curated Source Processing section. A deep dive in the deduplication [here]. "
         | 
| 829 | 
             
                        ),
         | 
|  | |
| 822 | 
             
                            "Web datasets are inherently noisy and varied. The TxT360 pipeline implements sophisticated filtering and deduplication techniques to clean and remove redundancies while preserving data integrity."
         | 
| 823 | 
             
                        ),
         | 
| 824 | 
             
                        P(
         | 
| 825 | 
            +
                            "Curated datasets are typically structured and consistently formatted. TxT360 filters these sources with selective steps to maintain their integrity while providing seamless integration into the larger dataset. Both data source types are globally deduplicated together resulting in 5.7T tokens of high-quality data. The table below shows the final source distribution of TxT360 tokens."
         | 
| 826 | 
             
                        ),
         | 
| 827 | 
            +
                        table_div_data,
         | 
| 828 | 
             
                        P(
         | 
| 829 | 
             
                            "We provide details and context for the choices behind TxT360 in the respective Web Data Processing and Curated Source Processing section. A deep dive in the deduplication [here]. "
         | 
| 830 | 
             
                        ),
         | 
