---
license: llama3.1
inference: false
fine-tuning: false
tags:
- llama3.1
base_model: meta-llama/Llama-3.1-70B-Instruct
pipeline_tag: text-generation
library_name: transformers
---
# NoxtuaCompliance

Noxtua-Compliance-70B-V1 is a specialized large language model designed for legal compliance applications. It is finetuned from the Llama-3-70B-Instruct model using a custom legal cases dataset to understand more complex contexts and achieve precise results when analyzing complex legal issues.

## Model details

Model Name: Noxtua-Compliance-70B-V1

Base Model: Llama-3-70B-Instruct

Parameter Count: 70 billion

## Run with vllm

```bash
docker run --runtime nvidia --gpus=all -v ~/.cache/huggingface:/root/.cache/huggingface -p 8000:8000 --ipc=host vllm/vllm-openai:v0.6.6.post1 --model ACATECH/ncos --tensor-parallel-size=2 --disable-log-requests --max-model-len 120000 --gpu-memory-utilization 0.95
```

## Use with transformers

See the snippet below for usage with Transformers:

```python
import torch
import transformers

model_id = "ACATECH/ncos"
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token_id = tokenizer.eos_token_id

pipeline = transformers.pipeline(
    "text-generation", 
    model=model_id, 
    tokenizer=tokenizer, 
    max_new_tokens=1024, 
    torch_dtype = torch.float16,
    device_map="auto",
    trust_remote_code=True
)

messages = [
    {"role": "system", "content": "You are an intelligent AI assistant in the legal domain called Noxtua NCOS from the company Xayn. You will assist the user with care, respect and professionalism. Always answer in the same language as the question. Freely use legal jargon."},
    {"role": "user", "content": "Carry out an entire authority check of the following text."},
]

print(pipeline(messages))
```

Please consider setting temperature = 0 to get consistent outputs.

### Framework versions

- Transformers 4.47.1
- Pytorch 2.5.1+cu121

## Recommended Hardware

Running this model requires 2 or more 80GB GPUs, e.g. NVIDIA A100, with at least 150GB of free disk space.

## Recommended Hardware

This model is designed for high-performance environments with two or more 80GB GPUs (e.g., NVIDIA A100) and at least 150GB of free disk space.

### Quantization

If hardware constraints prevent loading the full model, quantization can reduce memory requirements. Note, however, that quantization inherently discards model information.

> **Warning**: We do not recommend quantized usage. Expert-curated domain knowledge embedded in the model may be lost, degrading performance in critical tasks.

If you proceed with quantization, we recommend using the GGUF format. This method enables out-of-core quantization—essential when RAM is a limiting factor.

To convert the model to GGUF, use the `lama.cpp` tools (tested with release `b5233`). Due to the model’s custom setup, use the legacy conversion script, which includes the required `--vocab-type` flag.

```
python ./llama.cpp/examples/convert_legacy_llama.py ./ncos_model_directory/ --outfile ncos.gguf --vocab-type bpe
```

Once converted, the model can be quantized without fully loading it into RAM. 
> Info: To chose the right quantization scheme for your use case, please read up on the different kind of quantization and parameters for each option. Some methods offer to use example data to guide the quantiozation of the model. This can help to avoid loss of information that is relevant to your intended application.

For demonstration, the following command performs an uninformed 4-bit quantization using the `q4_0` method:

```
./lama.cpp/llama-quantize ./ncos.gguf ./ncos-q4_0.gguf q4_0
```
The resulting 4bit version of the model is roughly 40GB in size. The model can run on hardware belolw the 'recommended hardware', as described above. If running the model via the "CPU"-option, i.e. without a GPU, you can even run the model on consumer setups with around 50GB of RAM. With other quantization options it might even be possible to reduce the size further.

In addition to running the model in gradio, as sketched above, you can also deploy on-premise using the ollama-library (version: v0.6.7). After setting up a ollama-"modelfile" according to your use case (e.g. the preferred system prompt and some additional setups can be found in the config files of the model) you can add the model to ollama like this: 
``` 
ollama create ncos-q4_0 -f ./ncos-gguf/Modelfile
```