Update README.md

c1bc514 verified 3 months ago

5.61 kB

	---
	language: en
	tags:
	- text-classification
	- multilabel-classification
	- housing
	- climate-change
	- sustainability
	- solar-energy
	license: mit
	---

	# Solar Energy Classifier (Distilbert)

	This model classifies content related to solar power on climate change subreddits.

	## Model Details

	- Model Type: Distilbert
	- Task: Multilabel text classification
	- Sector: Solar Energy
	- Base Model: Distilbert base uncased
	- Labels: 7
	- Training Data: Sample from 1000 GPT 4o-mini-labeled Reddit posts from climate subreddits (2010-2023)

	## Labels

	The model predicts 7 labels simultaneously:

	1. Decommissioning And Waste: Talks about end-of-life panel/turbine disposal, recycling, landfill issues.
	2. Foreign Dependence And Trade: References Chinese panel dominance, tariffs, trade wars, or reshoring supply chains.
	3. Grid Stability And Storage: Discussions of intermittency, batteries, pumped hydro, or grid reliability with high renewables.
	4. Land Use: Raises land-area or space requirements, farmland loss, or siting footprint of solar/wind.
	5. Local Economy: Claims solar/wind projects create or harm local jobs, investment, or economic growth.
	6. Subsidy And Tariff Debate: Argues over feed-in-tariffs, net-metering rules or subsidy fairness.
	7. Utility Bills: Mentions household or community electricity bills going up or down due to solar/wind.


	Note: Label order in predictions matches the order above.

	## Usage

	```python
	import torch, sys, os, tempfile
	from transformers import DistilBertTokenizer
	from huggingface_hub import snapshot_download

	device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

	def print_sorted_label_scores(label_scores):
	# Sort label_scores dict by score descending
	sorted_items = sorted(label_scores.items(), key=lambda x: x[1], reverse=True)
	for label, score in sorted_items:
	print(f" {label}: {score:.6f}")

	# Model link and examples for this specific model
	model_link = 'sanchow/solar_energy-distilbert-classifier'
	examples = [
	"Solar panels on rooftops can significantly reduce electricity bills."
	]

	print(f"\n{'='*60}")
	print("MODEL: SOLAR ENERGY SECTOR")
	print(f"{'='*60}")

	print(f"Downloading model: {model_link}")
	with tempfile.TemporaryDirectory() as temp_dir:
	snapshot_download(
	repo_id=model_link,
	local_dir=temp_dir,
	local_dir_use_symlinks=False
	)
	model_class_path = os.path.join(temp_dir, 'model_class.py')
	if not os.path.exists(model_class_path):
	print(f"model_class.py not found in downloaded files")
	print(f" Available files: {os.listdir(temp_dir)}")
	else:
	sys.path.insert(0, temp_dir)
	from model_class import MultilabelClassifier
	tokenizer = DistilBertTokenizer.from_pretrained(temp_dir)
	checkpoint = torch.load(os.path.join(temp_dir, 'model.pt'), map_location='cpu', weights_only=False)
	model = MultilabelClassifier(checkpoint['model_name'], len(checkpoint['label_names']))
	model.load_state_dict(checkpoint['model_state_dict'])
	model.to(device)
	model.eval()
	print("Model loaded successfully")
	print(f" Labels: {checkpoint['label_names']}")
	print("\nSolar Energy classifier results:\n")
	for i, test_text in enumerate(examples):
	inputs = tokenizer(
	test_text,
	return_tensors="pt",
	truncation=True,
	max_length=512,
	padding=True
	).to(device)
	with torch.no_grad():
	outputs = model(**inputs)
	predictions = outputs.cpu().numpy() if isinstance(outputs, (tuple, list)) else outputs.cpu().numpy()
	label_scores = {label: float(score) for label, score in zip(checkpoint['label_names'], predictions[0])}
	print(f"Example {i+1}: '{test_text}'")
	print("Predictions (all label scores, highest first):")
	print_sorted_label_scores(label_scores)
	print("-" * 40)
	```


	## Performance

	Best model performance:
	- Micro Jaccard: 0.4106
	- Macro Jaccard: 0.5228
	- F1 Score: 0.8590
	- Accuracy: 0.8590

	Dataset: ~900 GPT-labeled samples per sector (600 train, 150 validation, 150 test)



	## Optimal Thresholds

	```python
	optimal_thresholds = {'Decommissioning And Waste': 0.37254738295870854, 'Foreign Dependence And Trade': 0.37613221483784043, 'Grid Stability And Storage': 0.43063579501768967, 'Land Use': 0.2008681860202493, 'Local Economy': 0.3853212494245655, 'Subsidy And Tariff Debate': 0.42756546792925043, 'Utility Bills': 0.3370254357621166}
	for label, score in zip(label_names, predictions[0]):
	threshold = optimal_thresholds.get(label, 0.5)
	if score > threshold:
	print(f"{label}: {score:.3f}")
	```


	## Training

	Trained on GPT-labeled Reddit data:
	1. Data collection from climate subreddits
	2. keyword based filtering for sector-specific content
	3. GPT labeling for multilabel classification
	4. 80/10/10 train/validation/test split
	5. Fine-tuning with threshold optimization

	## Citation

	If you use this model in your research, please cite:

	```bibtex
	@misc{solar_energy_distilbert_classifier,
	title={Solar Energy Classifier for Climate Change Analysis},
	author={Sandeep Chowdhary},
	year={2025},
	publisher={Hugging Face},
	journal={Hugging Face Hub},
	howpublished={\url{https://huggingface.co/echoboi/solar_energy-distilbert-classifier}},
	}
	```

	## Limitations

	- Trained on data from specific climate change subreddits and limited to English content
	- Performance depends on GPT-generated labels