--- license: apache-2.0 language: - en pipeline_tag: question-answering --- # litert-community/Gecko-110m-en This model provides a few variants of the embedding model published in the [Gecko paper](https://arxiv.org/abs/2403.20327) that are ready for deployment on Android or iOS using [LiteRT stack](https://ai.google.dev/edge/litert) or [google ai edge RAG SDK](https://ai.google.dev/edge/mediapipe/solutions/genai/rag). ## Use the models ### Android * Try out the gecko embedding model in the [google ai edge RAG SDK](https://ai.google.dev/edge/mediapipe/solutions/genai/rag). You can find the SDK on [GitHub](https://github.com/google-ai-edge/ai-edge-apis/tree/main/local_agents/rag) or follow our [android guide](https://ai.google.dev/edge/mediapipe/solutions/genai/rag/android) to install directly from Maven. We have also published a [sample app](https://github.com/google-ai-edge/ai-edge-apis/tree/main/examples/rag). * Use the sentencepiece model as the tokenizer for the Gecko embedding model. ## Performance ### Android Note that all benchmark stats are from a Samsung S23 Ultra.

	Backend	Max sequence length	Init time (ms)	Inference time (ms)	Memory (RSS in MB)	Model size (MB)
dynamic_int8	GPU	256	1306.06	76.2	604.5	114
dynamic_int8	GPU	512	1363.38	173.2	604.6	120
dynamic_int8	GPU	1024	1419.87	397	871.1	145
dynamic_int8	CPU	256	11.03	147.6	126.3	114
dynamic_int8	CPU	512	30.04	353.1	225.6	120
dynamic_int8	CPU	1024	79.17	954	619.5	145

* Model Size: measured by the size of the .tflite flatbuffer (serialization format for LiteRT models) * Memory: indicator of peak RAM usage * The inference is run on CPU is accelerated via the LiteRT [XNNPACK](https://github.com/google/XNNPACK) delegate with 4 threads * The inference on GPU is accelerated via LiteRT GPU delegate. * Benchmark is done assuming XNNPACK cache is enabled * dynamic_int8: quantized model with int8 weights and float activations.