Transformers documentation
Generation
Generation
各フレームワークには、それぞれの GenerationMixin クラスに実装されたテキスト生成のための Generate メソッドがあります。
- PyTorch generate() は GenerationMixin に実装されています。
- TensorFlow generate() は TFGenerationMixin に実装されています。
- Flax/JAX generate() は FlaxGenerationMixin に実装されています。
選択したフレームワークに関係なく、GenerationConfig を使用して生成メソッドをパラメータ化できます。 クラスインスタンス。動作を制御する生成パラメータの完全なリストについては、このクラスを参照してください。 生成方法のこと。
モデルの生成構成を検査する方法、デフォルトとは何か、パラメーターをアドホックに変更する方法を学習するには、 カスタマイズされた生成構成を作成して保存する方法については、「 テキスト生成戦略ガイド。このガイドでは、関連機能の使用方法についても説明しています。 トークンストリーミングのような。
GenerationConfig
class transformers.GenerationConfig
< source >( **kwargs )
Parameters that control the length of the output
-  max_length (int, optional, defaults to 20) — The maximum length the generated tokens can have. Corresponds to the length of the input prompt +max_new_tokens. Its effect is overridden bymax_new_tokens, if also set.
-  max_new_tokens (int, optional) — The maximum numbers of tokens to generate, ignoring the number of tokens in the prompt.
-  min_length (int, optional, defaults to 0) — The minimum length of the sequence to be generated. Corresponds to the length of the input prompt +min_new_tokens. Its effect is overridden bymin_new_tokens, if also set.
-  min_new_tokens (int, optional) — The minimum numbers of tokens to generate, ignoring the number of tokens in the prompt.
-  early_stopping (boolorstr, optional, defaults toFalse) — Controls the stopping condition for beam-based methods, like beam-search. It accepts the following values:True, where the generation stops as soon as there arenum_beamscomplete candidates;False, where an heuristic is applied and the generation stops when is it very unlikely to find better candidates;"never", where the beam search procedure only stops when there cannot be better candidates (canonical beam search algorithm).
-  max_time(float, optional) — The maximum amount of time you allow the computation to run for in seconds. generation will still finish the current pass after allocated time has been passed.
Parameters that control the generation strategy used
-  do_sample (bool, optional, defaults toFalse) — Whether or not to use sampling ; use greedy decoding otherwise.
-  num_beams (int, optional, defaults to 1) — Number of beams for beam search. 1 means no beam search.
-  num_beam_groups (int, optional, defaults to 1) — Number of groups to dividenum_beamsinto in order to ensure diversity among different groups of beams. this paper for more details.
-  penalty_alpha (float, optional) — The values balance the model confidence and the degeneration penalty in contrastive search decoding.
-  use_cache (bool, optional, defaults toTrue) — Whether or not the model should use the past last key/values attentions (if applicable to the model) to speed up decoding.
Parameters for manipulation of the model output logits
-  temperature (float, optional, defaults to 1.0) — The value used to modulate the next token probabilities.
-  top_k (int, optional, defaults to 50) — The number of highest probability vocabulary tokens to keep for top-k-filtering.
-  top_p (float, optional, defaults to 1.0) — If set to float < 1, only the smallest set of most probable tokens with probabilities that add up totop_por higher are kept for generation.
-  typical_p (float, optional, defaults to 1.0) — Local typicality measures how similar the conditional probability of predicting a target token next is to the expected conditional probability of predicting a random token next, given the partial text already generated. If set to float < 1, the smallest set of the most locally typical tokens with probabilities that add up totypical_por higher are kept for generation. See this paper for more details.
-  epsilon_cutoff (float, optional, defaults to 0.0) — If set to float strictly between 0 and 1, only tokens with a conditional probability greater thanepsilon_cutoffwill be sampled. In the paper, suggested values range from 3e-4 to 9e-4, depending on the size of the model. See Truncation Sampling as Language Model Desmoothing for more details.
-  eta_cutoff (float, optional, defaults to 0.0) — Eta sampling is a hybrid of locally typical sampling and epsilon sampling. If set to float strictly between 0 and 1, a token is only considered if it is greater than eithereta_cutofforsqrt(eta_cutoff) * exp(-entropy(softmax(next_token_logits))). The latter term is intuitively the expected next token probability, scaled bysqrt(eta_cutoff). In the paper, suggested values range from 3e-4 to 2e-3, depending on the size of the model. See Truncation Sampling as Language Model Desmoothing for more details.
-  diversity_penalty (float, optional, defaults to 0.0) — This value is subtracted from a beam’s score if it generates a token same as any beam from other group at a particular time. Note thatdiversity_penaltyis only effective ifgroup beam searchis enabled.
-  repetition_penalty (float, optional, defaults to 1.0) — The parameter for repetition penalty. 1.0 means no penalty. See this paper for more details.
-  encoder_repetition_penalty (float, optional, defaults to 1.0) — The paramater for encoder_repetition_penalty. An exponential penalty on sequences that are not in the original input. 1.0 means no penalty.
-  length_penalty (float, optional, defaults to 1.0) — Exponential penalty to the length that is used with beam-based generation. It is applied as an exponent to the sequence length, which in turn is used to divide the score of the sequence. Since the score is the log likelihood of the sequence (i.e. negative),length_penalty> 0.0 promotes longer sequences, whilelength_penalty< 0.0 encourages shorter sequences.
-  no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size can only occur once.
-  bad_words_ids(List[List[int]], optional) — List of list of token ids that are not allowed to be generated. Check NoBadWordsLogitsProcessor for further documentation and examples.
-  force_words_ids(List[List[int]]orList[List[List[int]]], optional) — List of token ids that must be generated. If given aList[List[int]], this is treated as a simple list of words that must be included, the opposite tobad_words_ids. If givenList[List[List[int]]], this triggers a disjunctive constraint, where one can allow different forms of each word.
-  renormalize_logits (bool, optional, defaults toFalse) — Whether to renormalize the logits after applying all the logits processors or warpers (including the custom ones). It’s highly recommended to set this flag toTrueas the search algorithms suppose the score logits are normalized but some logit processors or warpers break the normalization.
-  constraints (List[Constraint], optional) — Custom constraints that can be added to the generation to ensure that the output will contain the use of certain tokens as defined byConstraintobjects, in the most sensible way possible.
-  forced_bos_token_id (int, optional, defaults tomodel.config.forced_bos_token_id) — The id of the token to force as the first generated token after thedecoder_start_token_id. Useful for multilingual models like mBART where the first generated token needs to be the target language token.
-  forced_eos_token_id (Union[int, List[int]], optional, defaults tomodel.config.forced_eos_token_id) — The id of the token to force as the last generated token whenmax_lengthis reached. Optionally, use a list to set multiple end-of-sequence tokens.
-  remove_invalid_values (bool, optional, defaults tomodel.config.remove_invalid_values) — Whether to remove possible nan and inf outputs of the model to prevent the generation method to crash. Note that usingremove_invalid_valuescan slow down generation.
-  exponential_decay_length_penalty (tuple(int, float), optional) — This Tuple adds an exponentially increasing length penalty, after a certain amount of tokens have been generated. The tuple shall consist of:(start_index, decay_factor)wherestart_indexindicates where penalty starts anddecay_factorrepresents the factor of exponential decay
-  suppress_tokens  (List[int], optional) — A list of tokens that will be suppressed at generation. TheSupressTokenslogit processor will set their log probs to-infso that they are not sampled.
-  begin_suppress_tokens  (List[int], optional) — A list of tokens that will be suppressed at the beginning of the generation. TheSupressBeginTokenslogit processor will set their log probs to-infso that they are not sampled.
-  forced_decoder_ids (List[List[int]], optional) — A list of pairs of integers which indicates a mapping from generation indices to token indices that will be forced before sampling. For example,[[1, 123]]means the second generated token will always be a token of index 123.
-  sequence_bias (Dict[Tuple[int], float], optional)) — Dictionary that maps a sequence of tokens to its bias term. Positive biases increase the odds of the sequence being selected, while negative biases do the opposite. Check SequenceBiasLogitsProcessor for further documentation and examples.
-  guidance_scale (float, optional) — The guidance scale for classifier free guidance (CFG). CFG is enabled by settingguidance_scale > 1. Higher guidance scale encourages the model to generate samples that are more closely linked to the input prompt, usually at the expense of poorer quality.
-  low_memory (bool, optional) — Switch to sequential beam search and sequential topk for contrastive search to reduce peak memory. Used with beam search and contrastive search.
Parameters that define the output variables of `generate`
-  num_return_sequences(int, optional, defaults to 1) — The number of independently computed returned sequences for each element in the batch.
-  output_attentions (bool, optional, defaults toFalse) — Whether or not to return the attentions tensors of all attention layers. Seeattentionsunder returned tensors for more details.
-  output_hidden_states (bool, optional, defaults toFalse) — Whether or not to return the hidden states of all layers. Seehidden_statesunder returned tensors for more details.
-  output_scores (bool, optional, defaults toFalse) — Whether or not to return the prediction scores. Seescoresunder returned tensors for more details.
-  output_logits (bool, optional) — Whether or not to return the unprocessed prediction logit scores. Seelogitsunder returned tensors for more details.
-  return_dict_in_generate (bool, optional, defaults toFalse) — Whether or not to return a ModelOutput instead of a plain tuple.
Special tokens that can be used at generation time
-  pad_token_id (int, optional) — The id of the padding token.
-  bos_token_id (int, optional) — The id of the beginning-of-sequence token.
-  eos_token_id (Union[int, List[int]], optional) — The id of the end-of-sequence token. Optionally, use a list to set multiple end-of-sequence tokens.
Generation parameters exclusive to encoder-decoder models
-  encoder_no_repeat_ngram_size (int, optional, defaults to 0) — If set to int > 0, all ngrams of that size that occur in theencoder_input_idscannot occur in thedecoder_input_ids.
-  decoder_start_token_id (Union[int, List[int]], optional) — If an encoder-decoder model starts decoding with a different token than bos, the id of that token or a list of lengthbatch_size. Indicating a list enables different start ids for each element in the batch (e.g. multilingual models with different target languages in one batch)
Generation parameters exclusive to [assistant generation](https
-  num_assistant_tokens (int, optional, defaults to 5) — Defines the number of speculative tokens that shall be generated by the assistant model before being checked by the target model at each iteration. Higher values fornum_assistant_tokensmake the generation more speculative : If the assistant model is performant larger speed-ups can be reached, if the assistant model requires lots of corrections, lower speed-ups are reached.
-  num_assistant_tokens_schedule (str, optional, defaults to"heuristic") — Defines the schedule at which max assistant tokens shall be changed during inference.- "heuristic": When all speculative tokens are correct, increase- num_assistant_tokensby 2 else reduce by 1.- num_assistant_tokensvalue is persistent over multiple generation calls with the same assistant model.
- "heuristic_transient": Same as- "heuristic"but- num_assistant_tokensis reset to its initial value after each generation call.
- "constant":- num_assistant_tokensstays unchanged during generation
 
-  prompt_lookup_num_tokens (int, optional, default toNone) — The number of tokens to be output as candidate tokens.
-  max_matching_ngram_size (int, optional, default toNone) — The maximum ngram size to be considered for matching in the prompt. Default to 2 if not provided.
Parameters specific to the caching mechanism
-  cache_implementation (str, optional, default toNone) — Cache class that should be used when generating.
Wild card
Class that holds a configuration for a generation task. A generate call supports the following generation methods
for text-decoder, text-to-text, speech-to-text, and vision-to-text models:
- greedy decoding by calling _greedy_search()ifnum_beams=1anddo_sample=False
- contrastive search by calling _contrastive_search()ifpenalty_alpha>0.andtop_k>1
- multinomial sampling by calling _sample()ifnum_beams=1anddo_sample=True
- beam-search decoding by calling _beam_search()ifnum_beams>1anddo_sample=False
- beam-search multinomial sampling by calling _beam_sample()ifnum_beams>1anddo_sample=True
- diverse beam-search decoding by calling _group_beam_search(), ifnum_beams>1andnum_beam_groups>1
- constrained beam-search decoding by calling _constrained_beam_search(), ifconstraints!=Noneorforce_words_ids!=None
- assisted decoding by calling _assisted_decoding(), ifassistant_modelorprompt_lookup_num_tokensis passed to.generate()
You do not need to call any of the above methods directly. Pass custom parameter values to ‘.generate()‘. To learn more about decoding strategies refer to the text generation strategies guide.
A large number of these flags control the logits or the stopping criteria of the generation. Make sure you check the generate-related classes for a full description of the possible manipulations, as well as examples of their usage.
from_pretrained
< source >( pretrained_model_name: Union config_file_name: Union = None cache_dir: Union = None force_download: bool = False local_files_only: bool = False token: Union = None revision: str = 'main' **kwargs ) → GenerationConfig
Parameters
-  pretrained_model_name (stroros.PathLike) — This can be either:- a string, the model id of a pretrained model configuration hosted inside a model repo on huggingface.co.
- a path to a directory containing a configuration file saved using the
save_pretrained() method, e.g., ./my_model_directory/.
 
-  config_file_name (stroros.PathLike, optional, defaults to"generation_config.json") — Name of the generation configuration JSON file to be loaded frompretrained_model_name.
-  cache_dir (stroros.PathLike, optional) — Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.
-  force_download (bool, optional, defaults toFalse) — Whether or not to force to (re-)download the configuration files and override the cached versions if they exist.
-  resume_download (bool, optional, defaults toFalse) — Whether or not to delete incompletely received file. Attempts to resume the download if such a file exists.
-  proxies (Dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g.,{'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}.The proxies are used on each request.
-  token (strorbool, optional) — The token to use as HTTP bearer authorization for remote files. IfTrue, or not specified, will use the token generated when runninghuggingface-cli login(stored in~/.huggingface).
-  revision (str, optional, defaults to"main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, sorevisioncan be any identifier allowed by git.To test a pull request you made on the Hub, you can pass `revision=“refs/pr/ “. 
-  return_unused_kwargs (bool, optional, defaults toFalse) — IfFalse, then this function returns just the final configuration object.If True, then this functions returns aTuple(config, unused_kwargs)where unused_kwargs is a dictionary consisting of the key/value pairs whose keys are not configuration attributes: i.e., the part ofkwargswhich has not been used to updateconfigand is otherwise ignored.
-  subfolder (str, optional, defaults to"") — In case the relevant files are located inside a subfolder of the model repo on huggingface.co, you can specify the folder name here.
-  kwargs (Dict[str, Any], optional) — The values in kwargs of any keys which are configuration attributes will be used to override the loaded values. Behavior concerning key/value pairs whose keys are not configuration attributes is controlled by thereturn_unused_kwargskeyword parameter.
Returns
The configuration object instantiated from this pretrained model.
Instantiate a GenerationConfig from a generation configuration file.
Examples:
>>> from transformers import GenerationConfig
>>> # Download configuration from huggingface.co and cache.
>>> generation_config = GenerationConfig.from_pretrained("openai-community/gpt2")
>>> # E.g. config was saved using *save_pretrained('./test/saved_model/')*
>>> generation_config.save_pretrained("./test/saved_model/")
>>> generation_config = GenerationConfig.from_pretrained("./test/saved_model/")
>>> # You can also specify configuration names to your generation configuration file
>>> generation_config.save_pretrained("./test/saved_model/", config_file_name="my_configuration.json")
>>> generation_config = GenerationConfig.from_pretrained("./test/saved_model/", "my_configuration.json")
>>> # If you'd like to try a minor variation to an existing configuration, you can also pass generation
>>> # arguments to `.from_pretrained()`. Be mindful that typos and unused arguments will be ignored
>>> generation_config, unused_kwargs = GenerationConfig.from_pretrained(
...     "openai-community/gpt2", top_k=1, foo=False, do_sample=True, return_unused_kwargs=True
... )
>>> generation_config.top_k
1
>>> unused_kwargs
{'foo': False}from_model_config
< source >( model_config: PretrainedConfig ) → GenerationConfig
Instantiates a GenerationConfig from a PretrainedConfig. This function is useful to convert legacy PretrainedConfig objects, which may contain generation parameters, into a stand-alone GenerationConfig.
save_pretrained
< source >( save_directory: Union config_file_name: Union = None push_to_hub: bool = False **kwargs )
Parameters
-  save_directory (stroros.PathLike) — Directory where the configuration JSON file will be saved (will be created if it does not exist).
-  config_file_name (stroros.PathLike, optional, defaults to"generation_config.json") — Name of the generation configuration JSON file to be saved insave_directory.
-  push_to_hub (bool, optional, defaults toFalse) — Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the repository you want to push to withrepo_id(will default to the name ofsave_directoryin your namespace).
-  kwargs (Dict[str, Any], optional) — Additional key word arguments passed along to the push_to_hub() method.
Save a generation configuration object to the directory save_directory, so that it can be re-loaded using the
from_pretrained() class method.
GenerationMixin
A class containing all functions for auto-regressive text generation, to be used as a mixin in PreTrainedModel.
The class exposes generate(), which can be used for:
- greedy decoding by calling _greedy_search()ifnum_beams=1anddo_sample=False
- contrastive search by calling _contrastive_search()ifpenalty_alpha>0andtop_k>1
- multinomial sampling by calling _sample()ifnum_beams=1anddo_sample=True
- beam-search decoding by calling _beam_search()ifnum_beams>1anddo_sample=False
- beam-search multinomial sampling by calling _beam_sample()ifnum_beams>1anddo_sample=True
- diverse beam-search decoding by calling _group_beam_search(), ifnum_beams>1andnum_beam_groups>1
- constrained beam-search decoding by calling _constrained_beam_search(), ifconstraints!=Noneorforce_words_ids!=None
- assisted decoding by calling _assisted_decoding(), ifassistant_modelorprompt_lookup_num_tokensis passed to.generate()
You do not need to call any of the above methods directly. Pass custom parameter values to ‘generate’ instead. To learn more about decoding strategies refer to the text generation strategies guide.
generate
< source >( inputs: Optional = None generation_config: Optional = None logits_processor: Optional = None stopping_criteria: Optional = None prefix_allowed_tokens_fn: Optional = None synced_gpus: Optional = None assistant_model: Optional = None streamer: Optional = None negative_prompt_ids: Optional = None negative_prompt_attention_mask: Optional = None **kwargs  ) → ModelOutput or torch.LongTensor
Parameters
-  inputs (torch.Tensorof varying shape depending on the modality, optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. IfNonethe method initializes it withbos_token_idand a batch size of 1. For decoder-only modelsinputsshould be in the format ofinput_ids. For encoder-decoder models inputs can represent any ofinput_ids,input_values,input_features, orpixel_values.
-  generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call.**kwargspassed to generate matching the attributes ofgeneration_configwill override them. Ifgeneration_configis not provided, the default will be used, which has the following loading priority: 1) from thegeneration_config.jsonmodel file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
-  logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
-  stopping_criteria (StoppingCriteriaList, optional) — Custom stopping criteria that complements the default stopping criteria built from arguments and a generation config. If a stopping criteria is passed that is already created with the arguments or a generation config an error is thrown. If your stopping criteria depends on thescoresinput, make sure you passreturn_dict_in_generate=True, output_scores=Truetogenerate. This feature is intended for advanced users.
-  prefix_allowed_tokens_fn (Callable[[int, torch.Tensor], List[int]], optional) — If provided, this function constraints the beam search to allowed tokens only at each step. If not provided no constraint is applied. This function takes 2 arguments: the batch IDbatch_idandinput_ids. It has to return a list with the allowed tokens for the next generation step conditioned on the batch IDbatch_idand the previously generated tokensinputs_ids. This argument is useful for constrained generation conditioned on the prefix, as described in Autoregressive Entity Retrieval.
-  synced_gpus (bool, optional) — Whether to continue running the while loop until max_length. Unless overridden this flag will be set toTrueunder DeepSpeed ZeRO Stage 3 multiple GPUs environment to avoid hanging if one GPU finished generating before other GPUs. Otherwise it’ll be set toFalse.
-  assistant_model (PreTrainedModel, optional) — An assistant model that can be used to accelerate generation. The assistant model must have the exact same tokenizer. The acceleration is achieved when forecasting candidate tokens with the assistent model is much faster than running generation with the model you’re calling generate from. As such, the assistant model should be much smaller.
-  streamer (BaseStreamer, optional) — Streamer object that will be used to stream the generated sequences. Generated tokens are passed throughstreamer.put(token_ids)and the streamer is responsible for any further processing.
-  negative_prompt_ids (torch.LongTensorof shape(batch_size, sequence_length), optional) — The negative prompt needed for some processors such as CFG. The batch size must match the input batch size. This is an experimental feature, subject to breaking API changes in future versions.
-  negative_prompt_attention_mask (torch.LongTensorof shape(batch_size, sequence_length), optional) — Attention_mask fornegative_prompt_ids.
-  kwargs (Dict[str, Any], optional) — Ad hoc parametrization ofgeneration_configand/or additional model-specific kwargs that will be forwarded to theforwardfunction of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with decoder_.
Returns
ModelOutput or torch.LongTensor
A ModelOutput (if return_dict_in_generate=True
or when config.return_dict_in_generate=True) or a torch.FloatTensor.
If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False), the possible
ModelOutput types are:
If the model is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible
ModelOutput types are:
Generates sequences of token ids for models with a language modeling head.
Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the
model’s default generation configuration. You can override any generation_config by passing the corresponding
parameters to generate(), e.g. .generate(inputs, num_beams=4, do_sample=True).
For an overview of generation strategies and code examples, check out the following guide.
compute_transition_scores
< source >( sequences: Tensor scores: Tuple beam_indices: Optional = None normalize_logits: bool = False  ) → torch.Tensor
Parameters
-  sequences (torch.LongTensor) — The generated sequences. The second dimension (sequence_length) is either equal tomax_lengthor shorter if all batches finished early due to theeos_token_id.
-  scores (tuple(torch.FloatTensor)) — Transition scores for each vocabulary token at each generation step. Beam transition scores consisting of log probabilities of tokens conditioned on log softmax of previously generated tokens in this beam. Tuple oftorch.FloatTensorwith up tomax_new_tokenselements (one element for each generated token), with each tensor of shape(batch_size*num_beams, config.vocab_size).
-  beam_indices (torch.LongTensor, optional) — Beam indices of generated token id at each generation step.torch.LongTensorof shape(batch_size*num_return_sequences, sequence_length). Only required if anum_beams>1at generate-time.
-  normalize_logits (bool, optional, defaults toFalse) — Whether to normalize the logits (which, for legacy reasons, may be unnormalized).
Returns
torch.Tensor
A torch.Tensor of shape (batch_size*num_return_sequences, sequence_length) containing
the transition scores (logits)
Computes the transition scores of sequences given the generation scores (and beam indices, if beam search was used). This is a convenient method to quicky obtain the scores of the selected tokens at generation time.
Examples:
>>> from transformers import GPT2Tokenizer, AutoModelForCausalLM
>>> import numpy as np
>>> tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
>>> model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> tokenizer.pad_token_id = tokenizer.eos_token_id
>>> inputs = tokenizer(["Today is"], return_tensors="pt")
>>> # Example 1: Print the scores for each token generated with Greedy Search
>>> outputs = model.generate(**inputs, max_new_tokens=5, return_dict_in_generate=True, output_scores=True)
>>> transition_scores = model.compute_transition_scores(
...     outputs.sequences, outputs.scores, normalize_logits=True
... )
>>> # input_length is the length of the input prompt for decoder-only models, like the GPT family, and 1 for
>>> # encoder-decoder models, like BART or T5.
>>> input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
>>> generated_tokens = outputs.sequences[:, input_length:]
>>> for tok, score in zip(generated_tokens[0], transition_scores[0]):
...     # | token | token string | log probability | probability
...     print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.3f} | {np.exp(score.numpy()):.2%}")
|   262 |  the     | -1.414 | 24.33%
|  1110 |  day     | -2.609 | 7.36%
|   618 |  when    | -2.010 | 13.40%
|   356 |  we      | -1.859 | 15.58%
|   460 |  can     | -2.508 | 8.14%
>>> # Example 2: Reconstruct the sequence scores from Beam Search
>>> outputs = model.generate(
...     **inputs,
...     max_new_tokens=5,
...     num_beams=4,
...     num_return_sequences=4,
...     return_dict_in_generate=True,
...     output_scores=True,
... )
>>> transition_scores = model.compute_transition_scores(
...     outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False
... )
>>> # If you sum the generated tokens' scores and apply the length penalty, you'll get the sequence scores.
>>> # Tip 1: recomputing the scores is only guaranteed to match with `normalize_logits=False`. Depending on the
>>> # use case, you might want to recompute it with `normalize_logits=True`.
>>> # Tip 2: the output length does NOT include the input length
>>> output_length = np.sum(transition_scores.numpy() < 0, axis=1)
>>> length_penalty = model.generation_config.length_penalty
>>> reconstructed_scores = transition_scores.sum(axis=1) / (output_length**length_penalty)
>>> print(np.allclose(outputs.sequences_scores, reconstructed_scores))
TrueTFGenerationMixin
A class containing all of the functions supporting generation, to be used as a mixin in TFPreTrainedModel.
The class exposes generate(), which can be used for:
- greedy decoding by calling greedy_search()ifnum_beams=1anddo_sample=False
- contrastive search by calling contrastive_search()ifpenalty_alpha>0andtop_k>1
- multinomial sampling by calling sample()ifnum_beams=1anddo_sample=True
- beam-search decoding by calling beam_search()ifnum_beams>1
You do not need to call any of the above methods directly. Pass custom parameter values to ‘generate’ instead. To learn more about decoding strategies refer to the text generation strategies guide.
generate
< source >( inputs: Optional = None generation_config: Optional = None logits_processor: Optional = None seed = None **kwargs  ) → ModelOutput or tf.Tensor
Parameters
-  inputs (tf.Tensorof varying shape depending on the modality, optional) — The sequence used as a prompt for the generation or as model inputs to the encoder. IfNonethe method initializes it withbos_token_idand a batch size of 1. For decoder-only modelsinputsshould of in the format ofinput_ids. For encoder-decoder models inputs can represent any ofinput_ids,input_values,input_features, orpixel_values.
-  generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call.**kwargspassed to generate matching the attributes ofgeneration_configwill override them. Ifgeneration_configis not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.jsonmodel file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
-  logits_processor (LogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
-  seed (List[int], optional) — Random seed to control sampling, containing two integers, used whendo_sampleisTrue. See theseedargument from stateless functions intf.random.
-  kwargs (Dict[str, Any], optional) — Ad hoc parametrization ofgenerate_configand/or additional model-specific kwargs that will be forwarded to theforwardfunction of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with decoder_.
Returns
ModelOutput or tf.Tensor
A ModelOutput (if return_dict_in_generate=True or when
config.return_dict_in_generate=True) or a tf.Tensor.
If the model is not an encoder-decoder model (model.config.is_encoder_decoder=False), the possible
ModelOutput types are:
- TFGreedySearchDecoderOnlyOutput,
- TFSampleDecoderOnlyOutput,
- TFBeamSearchDecoderOnlyOutput,
- TFBeamSampleDecoderOnlyOutput
If the model is an encoder-decoder model (model.config.is_encoder_decoder=True), the possible
ModelOutput types are:
Generates sequences of token ids for models with a language modeling head.
Most generation-controlling parameters are set in generation_config which, if not passed, will be set to the
model’s default generation configuration. You can override any generation_config by passing the corresponding
parameters to generate, e.g. .generate(inputs, num_beams=4, do_sample=True).
For an overview of generation strategies and code examples, check out the following guide.
compute_transition_scores
< source >( sequences: Tensor scores: Tuple beam_indices: Optional = None normalize_logits: bool = False  ) → tf.Tensor
Parameters
-  sequences (tf.Tensor) — The generated sequences. The second dimension (sequence_length) is either equal tomax_lengthor shorter if all batches finished early due to theeos_token_id.
-  scores (tuple(tf.Tensor)) — Transition scores for each vocabulary token at each generation step. Beam transition scores consisting of log probabilities of tokens conditioned on log softmax of previously generated tokens Tuple oftf.Tensorwith up tomax_new_tokenselements (one element for each generated token), with each tensor of shape(batch_size*num_beams, config.vocab_size).
-  beam_indices (tf.Tensor, optional) — Beam indices of generated token id at each generation step.tf.Tensorof shape(batch_size*num_return_sequences, sequence_length). Only required if anum_beams>1at generate-time.
-  normalize_logits (bool, optional, defaults toFalse) — Whether to normalize the logits (which, for legacy reasons, may be unnormalized).
Returns
tf.Tensor
A tf.Tensor of shape (batch_size*num_return_sequences, sequence_length) containing
the transition scores (logits)
Computes the transition scores of sequences given the generation scores (and beam indices, if beam search was used). This is a convenient method to quicky obtain the scores of the selected tokens at generation time.
Examples:
>>> from transformers import GPT2Tokenizer, TFAutoModelForCausalLM
>>> import numpy as np
>>> tokenizer = GPT2Tokenizer.from_pretrained("openai-community/gpt2")
>>> model = TFAutoModelForCausalLM.from_pretrained("openai-community/gpt2")
>>> tokenizer.pad_token_id = tokenizer.eos_token_id
>>> inputs = tokenizer(["Today is"], return_tensors="tf")
>>> # Example 1: Print the scores for each token generated with Greedy Search
>>> outputs = model.generate(**inputs, max_new_tokens=5, return_dict_in_generate=True, output_scores=True)
>>> transition_scores = model.compute_transition_scores(
...     outputs.sequences, outputs.scores, normalize_logits=True
... )
>>> # input_length is the length of the input prompt for decoder-only models, like the GPT family, and 1 for
>>> # encoder-decoder models, like BART or T5.
>>> input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
>>> generated_tokens = outputs.sequences[:, input_length:]
>>> for tok, score in zip(generated_tokens[0], transition_scores[0]):
...     # | token | token string | logits | probability
...     print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.3f} | {np.exp(score.numpy()):.2%}")
|   262 |  the     | -1.413 | 24.33%
|  1110 |  day     | -2.609 | 7.36%
|   618 |  when    | -2.009 | 13.41%
|   356 |  we      | -1.859 | 15.58%
|   460 |  can     | -2.508 | 8.14%
>>> # Example 2: Reconstruct the sequence scores from Beam Search
>>> outputs = model.generate(
...     **inputs,
...     max_new_tokens=5,
...     num_beams=4,
...     num_return_sequences=4,
...     return_dict_in_generate=True,
...     output_scores=True,
... )
>>> transition_scores = model.compute_transition_scores(
...     outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False
... )
>>> # If you sum the generated tokens' scores and apply the length penalty, you'll get the sequence scores.
>>> # Tip: recomputing the scores is only guaranteed to match with `normalize_logits=False`. Depending on the
>>> # use case, you might want to recompute it with `normalize_logits=True`.
>>> output_length = input_length + np.sum(transition_scores.numpy() < 0, axis=1)
>>> length_penalty = model.generation_config.length_penalty
>>> reconstructed_scores = np.sum(transition_scores, axis=1) / (output_length**length_penalty)
>>> print(np.allclose(outputs.sequences_scores, reconstructed_scores))
TrueFlaxGenerationMixin
A class containing all functions for auto-regressive text generation, to be used as a mixin in FlaxPreTrainedModel.
The class exposes generate(), which can be used for:
- greedy decoding by calling _greedy_search()ifnum_beams=1anddo_sample=False
- multinomial sampling by calling _sample()ifnum_beams=1anddo_sample=True
- beam-search decoding by calling _beam_search()ifnum_beams>1anddo_sample=False
You do not need to call any of the above methods directly. Pass custom parameter values to ‘generate’ instead. To learn more about decoding strategies refer to the text generation strategies guide.
generate
< source >( input_ids: Array generation_config: Optional = None prng_key: Optional = None trace: bool = True params: Optional = None logits_processor: Optional = None **kwargs )
Parameters
-  input_ids (jnp.ndarrayof shape(batch_size, sequence_length)) — The sequence used as a prompt for the generation.
-  generation_config (~generation.GenerationConfig, optional) — The generation configuration to be used as base parametrization for the generation call.**kwargspassed to generate matching the attributes ofgeneration_configwill override them. Ifgeneration_configis not provided, the default will be used, which had the following loading priority: 1) from thegeneration_config.jsonmodel file, if it exists; 2) from the model configuration. Please note that unspecified parameters will inherit GenerationConfig’s default values, whose documentation should be checked to parameterize generation.
-  trace (bool, optional, defaults toTrue) — Whether to trace generation. Settingtrace=Falseshould only be used for debugging and will lead to a considerably slower runtime.
-  params (Dict[str, jnp.ndarray], optional) — Optionally the model parameters can be passed. Can be useful for parallelized generation.
-  logits_processor (FlaxLogitsProcessorList, optional) — Custom logits processors that complement the default logits processors built from arguments and generation config. If a logit processor is passed that is already created with the arguments or a generation config an error is thrown. This feature is intended for advanced users.
-  kwargs (Dict[str, Any], optional) — Ad hoc parametrization ofgenerate_configand/or additional model-specific kwargs that will be forwarded to theforwardfunction of the model. If the model is an encoder-decoder model, encoder specific kwargs should not be prefixed and decoder specific kwargs should be prefixed with decoder_.
Generates sequences of token ids for models with a language modeling head.