## Setting Up

In [None]:
%%capture
%pip install langchain
%pip install langchain-community 
%pip install langchainhub 
%pip install langchain-chroma 
%pip install langchain-groq
%pip install langchain-huggingface
%pip install unstructured[docx]

## Groq Python API

In [2]:
import os
from groq import Groq

groq_api_key = os.environ.get("GROQ_API_KEY")

client = Groq(
 api_key=groq_api_key,
)


chat_streaming = client.chat.completions.create(
 messages=[
 {"role": "system", "content": "You are a professional Data Engineer."},
 {"role": "user", "content": "Can you explain how the data lake works?"},
 ],
 model="meta-llama/llama-4-scout-17b-16e-instruct",
 temperature=0.3,
 max_tokens=1200,
 top_p=1,
 stop=None,
 stream=True,
)

for chunk in chat_streaming:
 print(chunk.choices[0].delta.content, end="")

As a Data Engineer, I'd be happy to explain how a data lake works.

**What is a Data Lake?**

A data lake is a centralized repository that stores raw, unprocessed data in its native format. It's a scalable and flexible storage solution that allows you to store and process large amounts of structured, semi-structured, and unstructured data. The data lake is often used as a precursor to data warehousing, data analytics, and machine learning.

**Key Components of a Data Lake**

1. **Storage**: The storage layer is the foundation of a data lake. It's typically a distributed file system, such as Hadoop Distributed File System (HDFS), Amazon S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (GCS). This layer stores raw data in its native format, without any transformation or processing.
2. **Data Ingestion**: Data ingestion is the process of collecting data from various sources and loading it into the data lake. This can be done through various methods, such as batch processing, st

## Initiating LLM and Embedding

In [3]:
from langchain_groq import ChatGroq

llm = ChatGroq(model="meta-llama/llama-4-scout-17b-16e-instruct", api_key=groq_api_key)

In [11]:
from langchain_huggingface import HuggingFaceEmbeddings
embed_model = HuggingFaceEmbeddings(model_name="mixedbread-ai/mxbai-embed-large-v1")

## Loading and spliting the data

In [5]:
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the text splitter
text_splitter = RecursiveCharacterTextSplitter(
 chunk_size=1000,
 chunk_overlap=100,
 separators=["\n\n", "\n"]
)

# Load the .docx files
loader = DirectoryLoader("./", glob="*.docx", use_multithreading=True)
documents = loader.load()

# Split the documents into chunks
chunks = text_splitter.split_documents(documents)

# Print the number of chunks
print(len(chunks))


29


## Creating the Vector Store

In [6]:
from langchain_chroma import Chroma

vectorstore = Chroma.from_documents(
 documents=chunks,
 embedding=embed_model,
 persist_directory="./Vectordb",
)

In [9]:
query = "What this tutorial about?"
docs = vectorstore.similarity_search(query)
print(docs[0].page_content)

Learn how to Fine-tune Stable Diffusion XL with DreamBooth and LoRA on your personal images. 

Let’s try another prompt:

Prompt:


## Creating the RAG pipeline

In [12]:
# Create retriever
retriever = vectorstore.as_retriever()

# Import PromptTemplate
from langchain_core.prompts import PromptTemplate

# Define a clearer, more professional prompt template
template = """You are an expert assistant tasked with answering questions based on the provided documents.
Use only the given context to generate your answer.
If the answer cannot be found in the context, clearly state that you do not know.
Be detailed and precise in your response, but avoid mentioning or referencing the context itself.

Context:
{context}

Question:
{question}

Answer:"""

# Create the PromptTemplate
rag_prompt = PromptTemplate.from_template(template)


In [13]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

rag_chain = (
 {"context": retriever, "question": RunnablePassthrough()}
 | rag_prompt
 | llm
 | StrOutputParser()
)

In [14]:
from IPython.display import display, Markdown

response = rag_chain.invoke("What this tutorial about?")
Markdown(response)

This tutorial is about setting up and using the Janus project, specifically Janus Pro, a multimodal model that can understand images and generate images from text prompts, and building a local solution to use the model privately on a laptop GPU. It covers learning about the Janus Series, setting up the Janus project, building a Docker container to run the model locally, and testing its capabilities with various image and text prompts.