Update README.md

b68e027 verified 8 months ago

5.29 kB

	---
	license: other
	license_name: deepseek-license
	license_link: LICENSE
	tags:
	- Axolotl
	- Deepspeed
	datasets:
	- GusPuffy/python-decompiler-37-0.7-train
	---

	<p align="center">
	<img width="120px" alt="Sentient Simulations Plumbob" src="https://www.sentientsimulations.com/transparent-plumbob2.png">
	</p>
	<p align="center"><a href="https://www.sentientsimulations.com/">[🏠Sentient Simulations]</a> \| <a href="https://discord.com/invite/JTjbydmUAp">[Discord]</a> \| <a href="https://www.patreon.com/SentientSims">[Patreon]</a>
	<hr>

	# Sentient Simulations AI Python 3.7 Decompiler - 6.7b - v0.9

	### 1. Introduction

	The Sentient Simulations AI Python Decompiler is a [deepseek-ai/deepseek-coder-6.7b-base](https://huggingface.co/deepseek-ai/deepseek-coder-6.7b-base) finetune for the specific task of decompiling Python 3.7 bytecode back to its original Python source code.

	* Full model - [LINK](https://huggingface.co/GusPuffy/sentient-simulations-pydecompiler-3.7-6.7b-v0.9)
	* GGUF - [LINK](https://huggingface.co/GusPuffy/sentient-simulations-pydecompiler-3.7-6.7b-v0.9-GGUF)

	### 2. Data Preparation

	The Sentient Simulations AI Python Decompiler data used Python 3.7 source code that was then compiled to bytecode. The bytecode was used as the input and the source code was used as the output to teach the model how to generate the original source code from Python bytecode. Below are the steps to prepare the data.

	1. Grab a ton of Python code, or use something like [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2), and compile it using the version of Python you want to use
	1. Throw out any code that doesn't compile to that version of Python
	1. Remove all comments from the code
	1. Format all the code using [Python black](https://github.com/psf/black) for consistency
	1. Format the bytecode in a way that reduces tokens and is easier for the AI to read - I tried with custom tokens initially but I got inconsistent results
	1. Generate input output pairs for the training data
	1. [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl) Sample packing was used to prepare the data with a constant context of 16k tokens

	### 4. Training
	The model was trained for 4 days on 3x3090s using [Deepspeed Zero 3 Offload](https://www.deepspeed.ai/2021/03/07/zero3-offload.html) at 16k context.

	### 5. Prerequisites
	1. Create a Python 3.7 environment to get the byte code
	1. Create a Python 3.10 environment to run the decompiler
	1. Download the python files and GGUF file
	```
	conda create -n pydecompiler-37 python=3.7 -y
	conda create -n pydecompiler-310 python=3.10 -y
	conda activate pydecompiler-310
	pip install huggingface-hub
	mkdir pydecompiler
	cd pydecompiler
	huggingface-cli download GusPuffy/sentient-simulations-pydecompiler-3.7-6.7b-v0.9 --local-dir . --local-dir-use-symlinks False --include "*.py"
	huggingface-cli download GusPuffy/sentient-simulations-pydecompiler-3.7-6.7b-v0.9-GGUF --local-dir . --local-dir-use-symlinks False --include "sentient-simulations-pydecompiler-3.7-6.7b-v0.9-q8_0.gguf"
	```
	4. [Install llama cpp](https://github.com/abetlen/llama-cpp-python?tab=readme-ov-file#installation), make sure to use the prefix during install for whatever backend you want to use if you have GPUs or only want to use CPU

	### 6. Test Example
	1. Convert a python file to source and byte code using Python 3.7
	```
	conda activate pydecompiler-37
	python bytecode.py bytecode.py > bytecode-decompiled.pycb
	```
	2. The bytecode has been written to bytecode-decompiled.pycb, now we need to switch to the Python 3.10 environment to run the decompiler on the test example
	```
	conda activate pydecompiler-310
	python decompile.py bytecode-decompiled.pycb
	```
	3. Compare the contents of the AI decompiled code in 'bytecode-decompiled.py' with the actual contents of 'bytecode.py'

	### 7. Decompilation of a Directory of .pyc files
	If you have a bunch of files you want to decompile, you can run the following command to decompile the entire directory.

	1. Convert the .pyc files to bytecode strings using Python 3.7
	```
	conda activate pydecompiler-37
	python convert_pyc_to_bytecode.py directory_with_files
	```
	2. Decompile the bytecode strings back to Python source code (This will take some time depending on how many files there are)
	```
	conda activate pydecompiler-310
	python decompile.py directory_with_files
	```
	Note that files over 15k token count are skipped.
	Files over 10k will most likely be truncated due to the context limit.

	At the end of the decompiled source code is a note if the context limit was reached or the AI decided it was complete:
	```
	# Finish Reason: stop
	# Finish Reason: length
	```

	### 8. Next Iteration

	The dataset has shown pretty amazing results for decompiling files under 16k context using only a 7b model.

	I would like to train a larger context model or the 34b version of deepseek coder.

	Let me know if you have compute available and you are interested in training a longer context version of this tool!

	### 9. License
	This is a fune tune of deepseek so checkout their license [LICENSE-MODEL](https://github.com/deepseek-ai/deepseek-coder/blob/main/LICENSE-MODEL) for more details.

	### 10. Contact

	If you have any questions, please raise an issue or find me on [discord](https://discord.com/invite/JTjbydmUAp)