personalized-gen / README.md

balhafni

Update README.md

468f1d0 verified 7 months ago

preview code

raw

history blame contribute delete

No virus

4.13 kB

	---
	license: mit
	language:
	- en
	---

	# Personalized Text Generation with Fine-Grained Linguistic Control

	This model was obtained by fine-tuning the [Pythia-1b](https://huggingface.co/EleutherAI/) model on three datasets: [Blogs Authorship Corpus](https://u.cs.biu.ac.il/~koppel/BlogCorpus.htm), [IMDb62](https://umlt.infotech.monash.edu/?page_id=266), and [Amazon 5-core Reviews](https://cseweb.ucsd.edu/~jmcauley/datasets/amazon_v2/).
	The model was built so that it can generate text based on multiple fine-grained linguistic attributes that reflect authors’ writing style. Our fine-tuning procedure and the hyperparameters we used can be found in our paper "[Personalized Text Generation with Fine-Grained Linguistic Control](https://arxiv.org/abs/2402.04914)". Our fine-tuning code and data can be found [here](https://github.com/balhafni/personalized-gen).


	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model = AutoModelForCausalLM.from_pretrained('balhafni/personalized-gen')
	tokenizer = AutoTokenizer.from_pretrained('balhafni/personalized-gen')


	ling_atts = {"ADJ": "5-8", "ADP": "10-11", "ADV": "6-8", "AUX": "9-11",
	"CONJ": "2-4", "DET": "7-10", "FKGL": "5-6", "NOUN": "11-18",
	"NUM": "2-3", "PART": "4-5", "PRON": "14-17", "PROPN": "8-11",
	"PUNCT": "22-25", "ROOT": "9-10", "SCONJ": "3-4", "VERB": "16-20",
	"acl": "0-1", "acomp": "1-2", "advcl": "2-3", "advmod": "7-9",
	"amod": "3-6", "appos": "0-1", "attr": "1-2", "attribution": "2-3",
	"aux": "6-7", "auxpass": "0-1", "case": "0-1", "cc": "2-4",
	"ccomp": "3-4", "compound": "5-6", "conj": "2-4", "contrast": "0-1",
	"det": "7-10", "dobj": "6-7", "domain": "blog",
	"elaboration": "10-12", "mark": "2-3", "neg": "2-3", "nmod": "0-1",
	"npadvmod": "1-2", "nsubj": "13-16", "nsubjpass": "0-1",
	"num_sents": "9-10", "num_tokens": "118-139", "nummod": "1-2",
	"pcomp": "0-1", "pobj": "8-10", "poss": "2-3", "prep": "9-10"
	}

	prompt = ("Today's lunch was a layered entree, consisting of, "
	"shredded lettuce and popcorn chicken.")

	inputs = [''.join([f'{k}:{v}' for k, v in ling_atts.items()]) + prompt]
	inputs = tokenizer(inputs, return_tensors='pt')

	preds = model.generate(**inputs,
	max_length=1024,
	pad_token_id=tokenizer.pad_token_id,
	no_repeat_ngram_size=2
	)

	decoded_preds = tokenizer.batch_decode(preds[:, inputs['input_ids'].shape[1]:],
	skip_special_tokens=True)[0]
	output = prompt + ' ' + decoded_preds.strip()
	print(output)
	```


	## Citation

	```BibTeX
	@inproceedings{alhafni-etal-2024-personalized,
	title = "Personalized Text Generation with Fine-Grained Linguistic Control",
	author = "Alhafni, Bashar and
	Kulkarni, Vivek and
	Kumar, Dhurv and
	Raheja, Vipul",
	month = march,
	year = "2024",
	address = "Malta",
	publisher = "Association for Computational Linguistics",
	abstract = "As the text generation capabilities of large language models become increasingly prominent, recent studies have focused on controlling particular aspects of the generated text to make it more personalized. However, most research on controllable text generation focuses on controlling the content or modeling specific high-level/coarse-grained attributes that reflect authors’ writing styles, such as formality, domain, or sentiment. In this paper, we focus on controlling fine-grained attributes spanning multiple linguistic dimensions, such as lexical and syntactic attributes. We introduce a novel benchmark to train generative models and evaluate their ability to generate personalized text based on multiple fine-grained linguistic attributes. We systematically investigate the performance of various large language models on our benchmark and draw insights from the factors that impact their performance. We make our code, data, and pretrained models publicly available.",
	}
	```