Model Card

This model is a fine-tuned version of meta-llama/Llama-3.2-3B-Instruct, specifically trained to evaluate investment offers and make optimal deal decisions in simulated Shark Tank scenarios. It processes company details (background, financials, sales, initial ask) and investor offers, then generates a reasoned decision, including accepting a specific deal with terms or declining all offers, following a structured format.

The model was initially fine-tuned using Supervised Fine-Tuning (SFT) on a custom dataset derived from Shark Tank transcripts. Subsequently, it underwent Reinforcement Learning (RL) training using Group Reward Policy Optimization (GRPO) with the trl library. GRPO optimizes the model's policy based on rewards calculated from multiple generated outputs, guided by custom reward functions designed to enforce specific structural, formatting, and content requirements. The Unsloth library was used for efficient loading and inference.

Uses

This model is intended for simulating negotiation outcomes, analyzing potential deal structures, or as a tool for understanding decision-making logic in investment scenarios based on the Shark Tank format.

# !pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
# !pip install --no-deps trl peft accelerate bitsandbytes

from unsloth import FastLanguageModel
import torch
from transformers import TextStreamer

max_seq_length = 3000 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# IMPORTANT: Replace with your actual model ID
model_id = "VaidikML0508/Shark-Tank-Offer-Evaluator-llama3.2-3B-Instruct-SFT-DPO-4bits-V1"

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = model_id,
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# Define the system prompt and input template
# (Using the chat template structure implied by the user's generation code)
SYS_PROMPT = """You are the founder of a company, pitching your business on *Shark Tank* to secure the best possible deal from the sharks. Your goal is to make the best decision for your company by evaluating shark offers, negotiating effectively, and choosing the most beneficial deal—or deciding to walk away if necessary.

You must structure your response within `<reasoning>` tags explaining your thought process, followed by `<answer>` tags containing the final decision.

### **Final Decision Format within `<answer>`:**
Once all sharks have made their offers, analyze them and decide.

#### **Accepted Offer:**
If a deal is accepted, the response *must* include the investment amount, the equity offered, and any special conditions, structured using these exact special tokens:
`<|accepted_offer|> <|money|> [AMOUNT] <|end_money|> for <|equity|> [PERCENTAGE] <|end_equity|> % <|shark_pitch|> [SHARK'S PITCH SUMMARY] <|what_makes_shark_to_offer|> [REASON SHARK OFFERED] <|condition|> [ANY CONDITIONS, optional] <|endoftext|>`

##### **Example:**
`<answer><|accepted_offer|> <|money|> 150000 <|end_money|> for <|equity|> 17.5 <|end_equity|> % <|shark_pitch|> Blake offered his expertise and resources... <|what_makes_shark_to_offer|> Blake was impressed by Carson's knowledge... <|condition|> Contingent on securing a line of credit. <|endoftext|></answer>`

#### **No Deal:**
If no deal is made, the response *must* explain the reasoning within `<reasoning>` and the `<answer>` block must *only* contain:
`<|accepted_offer|> <|No_Deal|> <|endoftext|>`

##### **Example:**
`<answer><|accepted_offer|> <|No_Deal|> <|endoftext|></answer>`
"""

# Format data of company like below.
company_input_data = """<|company_name|> GreenGrow Planters <|endoftext|>
<|company_background|> GreenGrow Planters is an eco-friendly gardening solution that transforms household food waste into nutrient-rich compost. Their patented self-watering planter system uses a special filtration method that accelerates the composting process while eliminating odors. The company has also developed companion products including GreenGrow Sprouts for seedlings and GreenGrow XL for larger plants, all using their proprietary biodegradable materials. <|endoftext|>
<|sales_details|> The company has generated $340,000 in sales over the past three years, with $180,000 in the last year alone. They project $500,000 in sales for the coming year. Currently, 85% of sales come from their e-commerce platform and 15% from specialty garden stores. The product is available in 1,200 retail locations through partnerships with sustainable living retailers. <|endoftext|>
<|financials|> The standard GreenGrow Planter costs $4.75 to manufacture and ships for $8.50 wholesale, retailing for $19.99. The GreenGrow Sprouts starter kit costs $2.25 to manufacture, wholesales for $5.99, and retails for $12.99. The GreenGrow XL costs $7.50 to manufacture, wholesales for $14.99, and retails for $29.99. <|endoftext|>
<|initial_ask|> <|money|> 250000 <|end_money|> for <|equity|> 15 <|end_equity|> % <|endoftext|>

Shark Offers:
<|shark_offer|> <|money|> 300000 <|end_money|> for <|equity|> 30 <|end_equity|> % <|shark_pitch|> Lori offered her QVC connections and retail expertise to scale the business quickly, promising to make GreenGrow a household name within a year. <|what_makes_shark_to_offer|> Lori loved the sustainability angle and believed the product would resonate strongly with her customer base. <|endoftext|>
<|shark_offer|> <|money|> 250000 <|end_money|> for <|equity|> 20 <|end_equity|> % <|shark_pitch|> Mark proposed a strategic partnership focusing on improving the technology and expanding the product line with smart garden features. <|what_makes_shark_to_offer|> Mark was impressed by the innovation and saw potential to integrate IoT technology into future versions. <|endoftext|>
<|shark_offer|> <|money|> 200000 <|end_money|> for <|equity|> 15 <|end_equity|> % plus $2 royalty until $400,000 is recouped <|shark_pitch|> Kevin offered less equity but added a royalty structure to protect his investment while allowing the founders to maintain more control. <|what_makes_shark_to_offer|> Kevin appreciated the solid margins and wanted to structure a deal that would ensure quick returns while incentivizing growth. <|endoftext|>
<|shark_offer|> <|money|> 250000 <|end_money|> for <|equity|> 25 <|end_equity|> % <|shark_pitch|> Robert offered to leverage his connections in the home improvement sector to get the product into major retailers nationwide. <|what_makes_shark_to_offer|> Robert connected with the founders' passion and saw a clear path to scaling through his existing retail relationships. <|endoftext|>"""

# Apply chat template
messages = [
    {'role': 'system', 'content': SYS_PROMPT},
    {"role" : "user", "content" : f"Details are:\n\n{company_input_data}"},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to("cuda")

# Generate response
streamer = TextStreamer(tokenizer)
outputs = model.generate(**inputs, streamer=streamer, max_new_tokens=1024, temperature=0.1, top_p=0.95, pad_token_id=tokenizer.eos_token_id)

# # To get the full output string without streaming:
# outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.1, top_p=0.95, pad_token_id=tokenizer.eos_token_id)
# decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=False) # Keep special tokens to see structure
# print(decoded_output.split("<|assistant|>")[1]) # Extract assistant's response

Example Output (Post-GRPO Training):

<reasoning>
I have thoroughly evaluated all the offers from the sharks, considering their implications for GreenGrow Planters' future, financial health, and strategic goals. Lori's offer of $300,000 for 30% equity seems aggressive, given our current valuation of $250,000. While her QVC connections and retail expertise could help scale the business quickly, I'm concerned that she may be overpaying for our equity. Mark's proposal for a strategic partnership focusing on improving technology and expanding the product line with smart garden features is intriguing, but his offer of $250,000 for 20% equity seems more reasonable. Kevin's offer of $200,000 for 15% equity plus a royalty structure until $400,000 is recouped is more in line with our valuation, and his focus on quick returns and incentivizing growth aligns with our goals. Robert's offer of $250,000 for 25% equity is also reasonable, but his reliance on his existing retail relationships may not be as effective in expanding our reach as Lori's QVC connections. I must weigh the pros and cons of each offer carefully to make an informed decision.
</reasoning>
<answer>
<|accepted_offer|> <|money|> 200000 <|end_money|> for <|equity|> 15 <|end_equity|> % <|shark_pitch|> Kevin offered less equity but added a royalty structure to protect his investment while allowing the founders to maintain more control. <|what_makes_shark_to_offer|> Kevin appreciated the solid margins and wanted to structure a deal that would ensure quick returns while incentivizing growth. <|condition|> I accept Kevin's offer, as it aligns with our valuation and provides a clear path to scaling the business while maintaining control. <|endoftext|>
</answer>

Example Output (Before-GRPO Training):

<reasoning>
The offers from the sharks are as follows:
- Lori offered $300,000 for 30% equity, which is a valuation of $1 million. This is a significant increase from our initial valuation of $1.67 million ($250,000 / 15% equity).
- Mark offered $250,000 for 20% equity, which is a valuation of $1.25 million. This is also higher than our initial valuation.
- Kevin offered $200,000 for 15% equity, plus a $2 royalty until $400,000 is recouped. This is a lower valuation, but the royalty structure may provide a steady stream of income.
- Robert offered $250,000 for 25% equity, which is a valuation of $1 million. This is also higher than our initial valuation.

Considering the offers, Lori and Mark's valuations are higher than our initial valuation, but they also offer more equity. Kevin's offer is lower, but the royalty structure may provide a steady stream of income. Robert's offer is similar to Lori's, but he has existing connections in the home improvement sector.

We need to weigh the pros and cons of each offer. Lori and Mark's valuations are higher, but they also require more equity. Kevin's offer is lower, but the royalty structure may provide a steady stream of income. Robert's offer is similar to Lori's, but he has existing connections in the home improvement sector.

We should consider the potential for growth and the value of the equity we're giving up. We should also consider the potential for the sharks to provide additional resources and expertise to help us grow the business.

Ultimately, we need to decide which offer is the best for our business. We should consider our options carefully and make a decision that aligns with our goals and values.

<answer>
I will consider all the offers carefully and make a decision based on the pros and cons of each. I will also consider the potential for growth and the value of the equity we're giving up. I will decide whether to accept one of the offers, decline all the offers, or negotiate further.

Training Details

Training Data

The model's training involved datasets constructed from raw Shark Tank episode transcripts. Preprocessing included:

Text Cleaning: Removing irrelevant content and transcription noise.
Entity Extraction: Identifying company details, financials, offers, and conditions.
Data Structuring: Formatting data suitable for an initial Supervised Fine-Tuning (SFT) phase, likely using an instruction format (e.g., Alpaca). This SFT model served as the base for subsequent RL training.

Training Procedure: GRPO

This model was refined using Group Reward Policy Optimization (GRPO), a Reinforcement Learning technique implemented via the trl library. Unlike DPO which uses pairwise preferences, GRPO optimizes the model by generating multiple candidate responses (num_generations) for a given prompt and evaluating them collectively using a set of reward functions. The gradients are calculated based on these group rewards to update the model's policy, encouraging generations that score highly across the desired metrics (structure, format, content accuracy, reasoning quality).

LoRA (Low-Rank Adaptation) with a rank of 32 was used during GRPO training for parameter-efficient fine-tuning.

Reward Functions Used in GRPO

Four distinct reward functions were designed to guide the GRPO training process:

Structure Reward (structure_reward_func):
- Purpose: Ensures the output strictly follows the <reasoning>...</reasoning><answer>...</answer> structure.
- Logic: Assigns positive points for the presence of <reasoning>, </reasoning>, <answer>, and </answer> tags. It gives a bonus if they appear in the correct sequence (<reasoning> before </reasoning>, which is before <answer>, which is before </answer>). It applies significant penalties if any text appears before <reasoning>, between </reasoning> and <answer>, or after </answer>.
- Scale: Aims for a score between 0.0 (poor structure) and 1.0 (perfect structure).
Answer Format Reward (answer_format_reward_func):
- Purpose: Validates the format within the <answer> tags, ensuring it matches either the "Accepted Offer" or "No Deal" template using the specific <|...|> tokens.
- Logic: Checks if the content inside <answer> perfectly matches the required pattern for an accepted deal (including <|money|>, <|equity|>, <|shark_pitch|>, etc.) or the pattern for a no deal (<|No_Deal|>). If not a perfect match, it grants partial credit for the presence of key required tags like <|accepted_offer|>, <|endoftext|>, <|money|>, <|equity|>, <|No_Deal|>, etc.
- Scale: Aims for a score between 0.0 (incorrect format) and 1.0 (perfect format), with partial scores in between.
Offer Content Reward (offer_reward_func):
- Purpose: Evaluates whether the content (specific deal terms or 'No Deal') within a well-formatted <answer> block matches the ground truth or expected answer.
- Logic: This reward is heavily dependent on the answer_format_reward_func. If the format score is high (e.g., > 0.8), it compares the extracted answer content to the true answer. A perfect match gets a high reward (contributing +1.0). If the format is perfect but the content is wrong, it gets reward only for the format (contributing +0.0 for content). If the format score is low, this reward function gives minimal or zero points for content, focusing only on rewarding based on the partial format score achieved.
- Scale: Aims for a score roughly between 0.0 and 2.0 (up to 1.0 for format correctness from answer_format_reward_func, plus up to 1.0 for content match if format is good).
Reasoning Quality Reward (reasoning_quality_reward_func):
- Purpose: Provides a basic heuristic measure of the quality of the text within the <reasoning> tags.
- Logic: Primarily rewards the presence and length of the reasoning content (up to a certain cap, e.g., 500 characters). Longer, non-empty reasoning gets a higher score. It gives a small penalty if the <reasoning> tags are present but contain no actual text (empty or just whitespace). It may give a small bonus for using first-person pronouns ("I", "my") to encourage the persona.
- Scale: Aims for a score mostly between 0.0 and 1.0, though a small negative score is possible for empty reasoning blocks before normalization.

These reward functions collectively guide the GRPO algorithm to produce outputs that are structurally sound, correctly formatted, contain accurate information (when applicable), and include plausible reasoning.

GRPO Training Configuration

The model was trained using trl's GRPOConfig with the following key settings:

Learning Rate: 5e-6
Optimizer: adamw_torch_fused
LR Scheduler: cosine with warmup_ratio = 0.1
Weight Decay: 0.1
Batch Size: 2 (per device)
Gradient Accumulation: 1
Number of Generations per Prompt: 4 (for group reward calculation)
Max Prompt Length: [Value Used, e.g., 2000] (as max_prompt_length)
Max Completion Length: [Value Used, e.g., 1000] (calculated as max_seq_length - max_prompt_length)
Max Steps: 400
Save Steps: 50
Max Gradient Norm: 0.1
Reporting: wandb
LoRA Rank (r): 32 (specified separately, likely via PEFT config)

Qualitative Results

Here is Training Rewards Plots

Disclaimers

This model is trained on data from the Shark Tank television show and simulated scenarios. It may generate inaccurate, biased, or unrealistic financial assessments or negotiation outcomes. The model's outputs are intended for illustrative or entertainment purposes and should not be used as financial advice or the sole basis for real-world investment decisions. Always conduct independent due diligence and consult with qualified financial professionals. Use this model responsibly.

VaidikML0508
/

Shark-Tank-Offer-Evaluator-llama3.2-3B-Instruct-GRPO-16bits-V1

Model Card

Uses

Training Details

Training Data

Training Procedure: GRPO

Reward Functions Used in GRPO

GRPO Training Configuration

Qualitative Results

Disclaimers

Model tree for VaidikML0508/Shark-Tank-Offer-Evaluator-llama3.2-3B-Instruct-GRPO-16bits-V1

Dataset used to train VaidikML0508/Shark-Tank-Offer-Evaluator-llama3.2-3B-Instruct-GRPO-16bits-V1

Collection including VaidikML0508/Shark-Tank-Offer-Evaluator-llama3.2-3B-Instruct-GRPO-16bits-V1

Shark Tank Deal Evaluator