|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- HuggingFaceFW/fineweb |
|
- PleIAs/YouTube-Commons |
|
- allenai/WildChat-1M |
|
- Salesforce/xlam-function-calling-60k |
|
- ShareGPT4Video/ShareGPT4Video |
|
- OpenGVLab/ShareGPT-4o |
|
- TempoFunk/webvid-10M |
|
- MBZUAI/VideoInstruct-100K |
|
- Isaak-Carter/j.o.s.i.e.v4.0.1o |
|
- NousResearch/dolma-v1_7-c4 |
|
- NousResearch/dolma-v1_7-cc_en_head |
|
- nyu-visionx/Cambrian-10M |
|
- LargeWorldModel/ultrachat_qa_mix_1M |
|
- LargeWorldModel/ultrachat_qa_mix_512K |
|
- LargeWorldModel/ultrachat_qa_mix_256K |
|
- LargeWorldModel/ultrachat_qa_mix_128K |
|
- nkp37/OpenVid-1M |
|
- HuggingFaceFV/finevideo |
|
language: |
|
- de |
|
- en |
|
tags: |
|
- moe |
|
- multimodal |
|
- any-to-any |
|
- vision |
|
- audio |
|
- endtoend |
|
- j.o.s.i.e. |
|
--- |
|
|
|
Project JOSIE: Just One Super Intelligent Entity |
|
|
|
Overview: |
|
|
|
Project JOSIE aims to create a next-generation, multimodal AI assistant designed to operate in real-time. The ultimate goal of JOSIE is to offer comprehensive support for personal assistance and smart home management, closely resembling the functionality of popular fictional AI assistants like JARVIS. JOSIE’s architecture is designed to handle complex, multi-sensory input, processing diverse data formats such as text, speech, images, and video. The initial implementation focuses on text and speech-to-text capabilities, with future iterations planned to introduce robust visual processing through both image and video inputs. |
|
|
|
The system includes a real-time speech module capable of handling diverse accents, emotions, and thoughtful responses. This means JOSIE is not only fast but also considerate in her replies, similar to the way GPT-4o manages nuanced interactions or Moshi integrates a thinking pause before responding. JOSIE can adapt her tone, emphasize empathy, and match conversational flow to create a natural, engaging dialogue experience with the user. |
|
|
|
Use Case: |
|
|
|
JOSIE’s primary use case is real-time personal assistance, with an emphasis on home automation and management. She is intended to autonomously handle routine smart home tasks, with the capability to initiate conversations or prompt user interaction when necessary (e.g., identifying unrecognized individuals in security footage). While this core capability is geared toward managing an interconnected smart home, the system’s multimodal foundation allows JOSIE to adapt across other applications, potentially extending to health monitoring, environment mapping, and real-time AI-driven decision-making. |
|
|
|
Model Card for JOSIE |
|
|
|
Model Details: |
|
• Model Name: JOSIE-v4o |
|
• Version: 4.0 |
|
• Model Type: Multimodal, real-time assistant |
|
• Primary Use: Smart home management, personal assistance |
|
• Current Modalities: |
|
• Input: Text, Speech |
|
• Output: Text, Speech |
|
Upcoming Modalities: |
|
• Input: Image, Video, Depth, Thermal imaging |
|
• Output: Enhanced audiovisual feedback |
|
• Target Audience: Authorized primary users for full capabilities (smart home management and advanced AI interactions); limited assistance mode available for other authorized users. |
|
|
|
Architecture: |
|
|
|
![Main architecture](https://cdn-lfs-us-1.hf.co/repos/1c/93/1c93eaf591b503a88961f61de2407cb6c3bb3453cb5ad121c01c928d0e3706f7/23666d2de95a4b3ae0e1c764a61a36e96a4341eeda2fd75815bf32a577838272?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27Drawing%25202024-10-28%252019.59.08.excalidraw.png%3B+filename%3D%22Drawing+2024-10-28+19.59.08.excalidraw.png%22%3B&response-content-type=image%2Fpng&Expires=1730493421&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczMDQ5MzQyMX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzFjLzkzLzFjOTNlYWY1OTFiNTAzYTg4OTYxZjYxZGUyNDA3Y2I2YzNiYjM0NTNjYjVhZDEyMWMwMWM5MjhkMGUzNzA2ZjcvMjM2NjZkMmRlOTVhNGIzYWUwZTFjNzY0YTYxYTM2ZTk2YTQzNDFlZWRhMmZkNzU4MTViZjMyYTU3NzgzODI3Mj9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=RziqPEZhFwb91rf8zha6-BIj0drXjx3gdjhWig37QkVroYnnWU%7E-ovFy2Kn2J%7EuTncxSZ10Yz3-qCJrvTkCJjRzbf1k7LoCekGumzfkdgKHQkrbXhbdaOhU31jWoRWLdaQL-z74a9UyhXrEmTmDq-iUdflDnF1wHEQLBWnfyE9GqJv8nHukisqEokIut60UuuYroaMLwJFQLipVir7MRtrO2UoO8pvMcA0OepYK88nU%7EpYZcFPXK99zOK3-CGgsIi4q5aYdZt0u-W1VI3lhjMtlHQbyHyDV5KJ32PFrE7VXxu0Xxe7lIN2F-hOe2l0l3GVrXSVnVG-EO%7ENZU8941OQ__&Key-Pair-Id=K24J24Z295AEI9) |
|
|
|
• Core Framework: A central general-purpose LLM (LLaMA/Qwen) processes discrete tokens generated from various sensory inputs. |
|
• Audio Processing: Employs RQ-Transformers with temporal and depth transformers, encoding raw audio into discrete tokens that the LLM processes. The tokens are then decoded back into audio responses, with the RQ-Transformer converting output tokens into Mel spectrograms that a vocoder renders into audio. |
|
• Vision Processing (Planned): Image and video input capabilities will employ a separate vision transformer. This will create discrete tokens for processing, merged with audio and text embeddings for unified interpretation. |
|
• Quantization and Tokenization: Implements residual quantization (RQ) with sequential codebooks for efficient tokenization of audio and depth data. Chunked and normalized embeddings are iteratively refined through RQ to produce a compact, final token representation. |
|
|
|
Model Use: |
|
|
|
• Input Formats: Currently accepts text and audio. The audio input is converted to discrete tokens via RQ-Transformer, allowing for efficient encoding and storage. |
|
• Output Formats: Generates text and audio responses, with audio decoded back into speech using the RQ-Transformer’s vocoder output pipeline. |
|
• Inference Speed: Real-time processing optimized for low-latency responses, employing batch processing techniques for fast generation, allowing seamless interaction in time-sensitive environments like smart home control. |
|
|
|
Intended Use Cases: |
|
|
|
1. Smart Home Management: Autonomously controls and manages smart home devices, providing alerts and requesting interaction when necessary. |
|
2. Security and Monitoring: Identifies and distinguishes authorized from unauthorized individuals using upcoming vision capabilities. |
|
3. Personal Assistance: Engages in general-purpose conversations, provides reminders, and assists with basic daily tasks through conversational interaction. |
|
|
|
Model Capabilities: |
|
|
|
• Real-Time Processing: Continuous data input with second-by-second updates, capable of issuing commands and engaging in dialogue. |
|
• Autonomous Behavior: Responds to certain triggers (e.g., security concerns or abnormal events) autonomously, yet requests user input for actions requiring confirmation. |
|
• Proactive Interactivity: Acts as both a responsive assistant and a proactive agent, initiating conversations when a task, anomaly, or user behavior warrants attention. |
|
|
|
Limitations: |
|
|
|
• Current Modalities: Limited to text and speech, with vision functionality forthcoming. |
|
• Authorized Access Only: Full capabilities are limited to the primary user, with a restricted, general-purpose assistant mode for other authorized users. |
|
• Data-Intensive: Real-time processing requires significant data bandwidth and computational resources, particularly for multimodal and high-frequency tasks. |
|
|
|
Future Enhancements: |
|
|
|
• Vision and Depth Modalities: Planned addition of image and video input, enabling JOSIE to analyze visual data for a broader range of use cases. |
|
• Expanded Memory and Interaction Context: Potential for an expanded memory module to increase contextual awareness and allow for longer interactions without losing track of prior exchanges. |
|
• Enhanced Security and Recognition: Deep learning algorithms for security and monitoring applications, especially for facial recognition, gesture detection, and other high-stakes, real-time tasks. |
|
|
|
Ethical Considerations: |
|
|
|
• Data Privacy: Ensures that any visual, auditory, or environmental data collected remains private to the authorized user. |
|
• Bias and Fairness: JOSIE’s behavior is trained to provide unbiased support. Future model improvements will address potential biases in visual and auditory data processing. |
|
|
|
Project JOSIE’s roadmap is set on delivering a true multimodal experience, with the real-time integration of various sensory inputs and outputs at its core. This combination positions JOSIE to transform everyday interactions into seamless, responsive, and secure AI-powered experiences. |