|
|
--- |
|
|
language: en |
|
|
tags: |
|
|
- text-classification |
|
|
- multilabel-classification |
|
|
- housing |
|
|
- climate-change |
|
|
- sustainability |
|
|
- solar-energy |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Solar Energy Classifier (Distilbert) |
|
|
|
|
|
This model classifies content related to solar power on climate change subreddits. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- Model Type: Distilbert |
|
|
- Task: Multilabel text classification |
|
|
- Sector: Solar Energy |
|
|
- Base Model: Distilbert base uncased |
|
|
- Labels: 7 |
|
|
- Training Data: Sample from 1000 GPT 4o-mini-labeled Reddit posts from climate subreddits (2010-2023) |
|
|
|
|
|
## Labels |
|
|
|
|
|
The model predicts 7 labels simultaneously: |
|
|
|
|
|
1. **Decommissioning And Waste**: Talks about end-of-life panel/turbine disposal, recycling, landfill issues. |
|
|
2. **Foreign Dependence And Trade**: References Chinese panel dominance, tariffs, trade wars, or reshoring supply chains. |
|
|
3. **Grid Stability And Storage**: Discussions of intermittency, batteries, pumped hydro, or grid reliability with high renewables. |
|
|
4. **Land Use**: Raises land-area or space requirements, farmland loss, or siting footprint of solar/wind. |
|
|
5. **Local Economy**: Claims solar/wind projects create or harm local jobs, investment, or economic growth. |
|
|
6. **Subsidy And Tariff Debate**: Argues over feed-in-tariffs, net-metering rules or subsidy fairness. |
|
|
7. **Utility Bills**: Mentions household or community electricity bills going up or down due to solar/wind. |
|
|
|
|
|
|
|
|
Note: Label order in predictions matches the order above. |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
import torch, sys, os, tempfile |
|
|
from transformers import DistilBertTokenizer |
|
|
from huggingface_hub import snapshot_download |
|
|
|
|
|
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') |
|
|
|
|
|
def print_sorted_label_scores(label_scores): |
|
|
# Sort label_scores dict by score descending |
|
|
sorted_items = sorted(label_scores.items(), key=lambda x: x[1], reverse=True) |
|
|
for label, score in sorted_items: |
|
|
print(f" {label}: {score:.6f}") |
|
|
|
|
|
# Model link and examples for this specific model |
|
|
model_link = 'sanchow/solar_energy-distilbert-classifier' |
|
|
examples = [ |
|
|
"Solar panels on rooftops can significantly reduce electricity bills." |
|
|
] |
|
|
|
|
|
print(f"\n{'='*60}") |
|
|
print("MODEL: SOLAR ENERGY SECTOR") |
|
|
print(f"{'='*60}") |
|
|
|
|
|
print(f"Downloading model: {model_link}") |
|
|
with tempfile.TemporaryDirectory() as temp_dir: |
|
|
snapshot_download( |
|
|
repo_id=model_link, |
|
|
local_dir=temp_dir, |
|
|
local_dir_use_symlinks=False |
|
|
) |
|
|
model_class_path = os.path.join(temp_dir, 'model_class.py') |
|
|
if not os.path.exists(model_class_path): |
|
|
print(f"model_class.py not found in downloaded files") |
|
|
print(f" Available files: {os.listdir(temp_dir)}") |
|
|
else: |
|
|
sys.path.insert(0, temp_dir) |
|
|
from model_class import MultilabelClassifier |
|
|
tokenizer = DistilBertTokenizer.from_pretrained(temp_dir) |
|
|
checkpoint = torch.load(os.path.join(temp_dir, 'model.pt'), map_location='cpu', weights_only=False) |
|
|
model = MultilabelClassifier(checkpoint['model_name'], len(checkpoint['label_names'])) |
|
|
model.load_state_dict(checkpoint['model_state_dict']) |
|
|
model.to(device) |
|
|
model.eval() |
|
|
print("Model loaded successfully") |
|
|
print(f" Labels: {checkpoint['label_names']}") |
|
|
print("\nSolar Energy classifier results:\n") |
|
|
for i, test_text in enumerate(examples): |
|
|
inputs = tokenizer( |
|
|
test_text, |
|
|
return_tensors="pt", |
|
|
truncation=True, |
|
|
max_length=512, |
|
|
padding=True |
|
|
).to(device) |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
predictions = outputs.cpu().numpy() if isinstance(outputs, (tuple, list)) else outputs.cpu().numpy() |
|
|
label_scores = {label: float(score) for label, score in zip(checkpoint['label_names'], predictions[0])} |
|
|
print(f"Example {i+1}: '{test_text}'") |
|
|
print("Predictions (all label scores, highest first):") |
|
|
print_sorted_label_scores(label_scores) |
|
|
print("-" * 40) |
|
|
``` |
|
|
|
|
|
|
|
|
## Performance |
|
|
|
|
|
Best model performance: |
|
|
- Micro Jaccard: 0.4106 |
|
|
- Macro Jaccard: 0.5228 |
|
|
- F1 Score: 0.8590 |
|
|
- Accuracy: 0.8590 |
|
|
|
|
|
Dataset: ~900 GPT-labeled samples per sector (600 train, 150 validation, 150 test) |
|
|
|
|
|
|
|
|
|
|
|
## Optimal Thresholds |
|
|
|
|
|
```python |
|
|
optimal_thresholds = {'Decommissioning And Waste': 0.37254738295870854, 'Foreign Dependence And Trade': 0.37613221483784043, 'Grid Stability And Storage': 0.43063579501768967, 'Land Use': 0.2008681860202493, 'Local Economy': 0.3853212494245655, 'Subsidy And Tariff Debate': 0.42756546792925043, 'Utility Bills': 0.3370254357621166} |
|
|
for label, score in zip(label_names, predictions[0]): |
|
|
threshold = optimal_thresholds.get(label, 0.5) |
|
|
if score > threshold: |
|
|
print(f"{label}: {score:.3f}") |
|
|
``` |
|
|
|
|
|
|
|
|
## Training |
|
|
|
|
|
Trained on GPT-labeled Reddit data: |
|
|
1. Data collection from climate subreddits |
|
|
2. keyword based filtering for sector-specific content |
|
|
3. GPT labeling for multilabel classification |
|
|
4. 80/10/10 train/validation/test split |
|
|
5. Fine-tuning with threshold optimization |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{solar_energy_distilbert_classifier, |
|
|
title={Solar Energy Classifier for Climate Change Analysis}, |
|
|
author={Sandeep Chowdhary}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
journal={Hugging Face Hub}, |
|
|
howpublished={\url{https://huggingface.co/echoboi/solar_energy-distilbert-classifier}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Trained on data from specific climate change subreddits and limited to English content |
|
|
- Performance depends on GPT-generated labels |
|
|
|