google/gemma-3n-E4B-it-litert-preview

Jun 2
gemma 3n is quite fast, for cell phones it is a model that works very well, which makes it more possible to reach more people.
,
Maybe instead of buying a GPU I should buy a decent cell phone to work as a server
kth8
Jun 3
what? 4x4 is not a good question to test a model's capabilities and 0.4 tokens per second is not fast
Renu11
Google org Jul 3
We're so glad to hear you're impressed with Gemma 3n's performance on mobile phones. Its efficiency and speed on edge devices are key areas we've focused on and making AI more accessible. Thanks for sharing your experience.
jeffzhou2000
Jul 4
•
edited Jul 4
gemma-3n-E2B on Snapdragon 8Elite based Android phone:
/data/local/tmp/llama-cli  -ngl 99 -t 4 -n 256 --no-warmup  -mg 4 -no-cnv -m /sdcard/gemma-3n-E2B-it-Q8_0.gguf -p "introduce the movie Once Upon a Time in America briefly.\n"
warning: no usable GPU found, --gpu-layers option will be ignored
warning: one possible reason is that llama.cpp was compiled without GPU support
warning: consult docs/build.md for compilation instructions
warning: llama.cpp was compiled without support for GPU offload. Setting the main GPU has no effect.
build: 6017 (b6b6684b8) with Android (12896553, +pgo, +bolt, +lto, +mlgo, based on r530567c) clang version 19.0.0 (https://android.googlesource.com/toolchain/llvm-project 97a699bf4812a18fb657c2779f5296a4ab2694d2) for x86_64-unknown-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 42 key-value pairs and 727 tensors from /sdcard/gemma-3n-E2B-it-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3n
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 4.5B
llama_model_loader: - kv   3:                            general.license str              = gemma
llama_model_loader: - kv   4:                   general.base_model.count u32              = 1
llama_model_loader: - kv   5:                  general.base_model.0.name str              = Gemma 3n E4b It
llama_model_loader: - kv   6:          general.base_model.0.organization str              = Google
llama_model_loader: - kv   7:              general.base_model.0.repo_url str              = https://huggingface.co/google/gemma-3...
llama_model_loader: - kv   8:                               general.tags arr[str,5]       = ["automatic-speech-recognition", "aut...
llama_model_loader: - kv   9:                     gemma3n.context_length u32              = 32768
llama_model_loader: - kv  10:                   gemma3n.embedding_length u32              = 2048
llama_model_loader: - kv  11:                        gemma3n.block_count u32              = 30
llama_model_loader: - kv  12:                gemma3n.feed_forward_length u32              = 8192
llama_model_loader: - kv  13:               gemma3n.attention.head_count u32              = 8
llama_model_loader: - kv  14:   gemma3n.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  15:               gemma3n.attention.key_length u32              = 256
llama_model_loader: - kv  16:             gemma3n.attention.value_length u32              = 256
llama_model_loader: - kv  17:                     gemma3n.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  18:           gemma3n.attention.sliding_window u32              = 512
llama_model_loader: - kv  19:            gemma3n.attention.head_count_kv u32              = 2
llama_model_loader: - kv  20:                   gemma3n.altup.active_idx u32              = 0
llama_model_loader: - kv  21:                   gemma3n.altup.num_inputs u32              = 4
llama_model_loader: - kv  22:   gemma3n.embedding_length_per_layer_input u32              = 256
llama_model_loader: - kv  23:         gemma3n.attention.shared_kv_layers f32              = 10.000000
llama_model_loader: - kv  24:          gemma3n.activation_sparsity_scale arr[f32,30]      = [1.644853, 1.644853, 1.644853, 1.6448...
llama_model_loader: - kv  25:   gemma3n.attention.sliding_window_pattern arr[bool,30]     = [true, true, true, true, false, true,...
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {{ bos_token }}\n{%- if messages[0]['r...
llama_model_loader: - kv  27:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  28:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  29:                      tokenizer.ggml.tokens arr[str,262144]  = ["<pad>", "<eos>", "<bos>", "<unk>", ...
llama_model_loader: - kv  30:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00...
llama_model_loader: - kv  31:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  32:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  33:                tokenizer.ggml.eos_token_id u32              = 1
llama_model_loader: - kv  34:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  35:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  36:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  37:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  38:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  39:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  40:               general.quantization_version u32              = 2
llama_model_loader: - kv  41:                          general.file_type u32              = 7
llama_model_loader: - type  f32:  362 tensors
llama_model_loader: - type  f16:   93 tensors
llama_model_loader: - type q8_0:  272 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 4.45 GiB (8.59 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3n
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 2048
print_info: n_layer          = 30
print_info: n_head           = 8
print_info: n_head_kv        = 2
print_info: n_rot            = 256
print_info: n_swa            = 512
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 1.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = E2B
print_info: model params     = 4.46 B
print_info: general.name     = n/a
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 '<bos>'
print_info: EOS token        = 1 '<eos>'
print_info: EOT token        = 106 '<end_of_turn>'
print_info: UNK token        = 3 '<unk>'
print_info: PAD token        = 0 '<pad>'
print_info: LF token         = 248 '<0x0A>'
print_info: EOG token        = 1 '<eos>'
print_info: EOG token        = 106 '<end_of_turn>'
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors:   CPU_Mapped model buffer size =  4560.05 MiB
..........................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified:        CPU KV buffer size =    32.00 MiB
llama_kv_cache_unified: size =   32.00 MiB (  4096 cells,   4 layers,  1 seqs), K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 1024 cells
llama_kv_cache_unified:        CPU KV buffer size =    32.00 MiB
llama_kv_cache_unified: size =   32.00 MiB (  1024 cells,  16 layers,  1 seqs), K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_kv_cache_unified: LLAMA_SET_ROWS=0, using old ggml_cpy() method for backwards compatibility
llama_context:        CPU compute buffer size =   520.00 MiB
llama_context: graph nodes  = 2881
llama_context: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
main: llama threadpool init, n_threads = 4

system_info: n_threads = 4 (n_threads_batch = 4) / 8 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | REPACK = 1 | 

backend 4
introduce the movie Once Upon a Time in America briefly.
sampler seed: 2347874688
sampler params: 
    repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
    dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
    top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
    mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 256, n_keep = 1

Once Upon a Time in America (1984) is a sprawling crime epic directed by Sergio Leone. It follows the intertwined lives of Jewish immigrants who become gangsters in 1920s and 1930s New York and the American Midwest. The film is renowned for its stunning cinematography, operatic score, and exploration of themes like friendship, betrayal, and the corrupting influence of power.

Here's a more detailed look at the movie:

*   **Plot Summary:** The film chronicles the rise and fall of a gang of Jewish gangsters, starting with their childhood friendship and culminating in their violent conflicts and eventual disbandment.
*   **Key Themes:** Friendship, betrayal, the allure of power, the cyclical nature of violence, and the loss of innocence.
*   **Notable Elements:**
    *   **Cinematography:** Robert De Niro’s use of long takes and sweeping vistas creates a sense of epic scope and atmosphere.
    *   **Score:** Ennio Morricone’s iconic score is a central part of the film’s identity.
    *   **Characters:** The film features a memorable ensemble cast, each with complex motivations and flaws.
*   **Critical Reception:** While controversial

llama_perf_sampler_print:    sampling time =      39.76 ms /   269 runs   (    0.15 ms per token,  6764.91 tokens per second)
llama_perf_context_print:        load time =     964.26 ms
llama_perf_context_print: prompt eval time =     359.24 ms /    13 tokens (   27.63 ms per token,    36.19 tokens per second)
llama_perf_context_print:        eval time =   14672.73 ms /   255 runs   (   57.54 ms per token,    17.38 tokens per second)
llama_perf_context_print:       total time =   15712.05 ms /   268 tokens
google
/

gemma-3n-E4B-it-litert-preview

gemma 3n , It's good