File size: 6,494 Bytes
249b6ff
aa3fe34
 
249b6ff
aa3fe34
 
 
249b6ff
 
aaeb925
cf0a3bb
aaeb925
cf0a3bb
aaeb925
cf0a3bb
aaeb925
cf0a3bb
aaeb925
cf0a3bb
aaeb925
 
 
aa3fe34
aaeb925
 
 
 
 
 
aa3fe34
aaeb925
 
 
aa3fe34
aaeb925
aa3fe34
aaeb925
aa3fe34
aaeb925
 
 
aa3fe34
aaeb925
aa3fe34
aaeb925
aa3fe34
aaeb925
 
 
 
 
aa3fe34
aaeb925
aa3fe34
aaeb925
aa3fe34
aaeb925
 
 
 
 
 
aa3fe34
aaeb925
aa3fe34
aaeb925
 
 
 
aa3fe34
aaeb925
aa3fe34
aaeb925
 
 
 
 
aa3fe34
aaeb925
 
 
 
 
 
aa3fe34
aaeb925
aa3fe34
aaeb925
 
 
 
 
aa3fe34
aaeb925
aa3fe34
aaeb925
aa3fe34
aaeb925
aa3fe34
aaeb925
aa3fe34
aaeb925
aa3fe34
aaeb925
 
aa3fe34
aaeb925
aa3fe34
aaeb925
 
 
 
aa3fe34
aaeb925
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
---
license: mit
title: Customer Experience Bot Demo
sdk: gradio
colorFrom: purple
colorTo: green
short_description: CX AI LLM
---

# Internal RAG CX Data Preprocessing Demo

A robust data preprocessing pipeline for Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG) systems, deployed on Hugging Face as a Model repository (free tier). Built with over 5 years of AI expertise since 2020, this demo focuses on cleaning and preparing call center datasets for enterprise-grade CX applications in SaaS, HealthTech, FinTech, and eCommerce. It integrates advanced data wrangling with Pandas, ensuring high-quality FAQs for downstream RAG/CAG pipelines, and is compatible with Amazon SageMaker and Azure AI for scalable modeling.

## Technical Architecture

### Data Preprocessing Pipeline

The core of this demo is a comprehensive data preprocessing pipeline designed to clean raw call center datasets:

- **Data Ingestion**:
  - Parses CSVs with `pd.read_csv`, using `io.StringIO` for embedded data, with explicit `quotechar` and `escapechar` to handle complex strings.
  - Handles datasets with columns: `call_id`, `question`, `answer`, `language`.

- **Junk Data Cleanup**:
  - **Null Handling**: Drops rows with missing `question` or `answer` using `df.dropna()`.
  - **Duplicate Removal**: Eliminates redundant FAQs via `df[~df['question'].duplicated()]`.
  - **Short Entry Filtering**: Excludes questions <10 chars or answers <20 chars with `df[(df['question'].str.len() >= 10) & (df['answer'].str.len() >= 20)]`.
  - **Malformed Detection**: Uses regex (`[!?]{2,}|\b(Invalid|N/A)\b`) to filter invalid questions.
  - **Standardization**: Normalizes text (e.g., "mo" to "month") and fills missing `language` with "en".

- **Output**:
  - Generates `cleaned_call_center_faqs.csv` for downstream modeling.
  - Provides cleanup stats: nulls removed, duplicates removed, short entries filtered, malformed entries detected.

### Enterprise-Grade Modeling Compatibility

The cleaned dataset is optimized for:

- **Amazon SageMaker**: Ready for training BERT-based models (e.g., `bert-base-uncased`) for intent classification or FAQ retrieval, deployable via SageMaker JumpStart.
- **Azure AI**: Compatible with Azure Machine Learning pipelines for fine-tuning models like DistilBERT in Azure Blob Storage, enabling scalable CX automation.
- **LLM Integration**: Supports fine-tuning LLMs (e.g., `distilgpt2`) for generative tasks, leveraging your FastAPI experience for API-driven inference.

## Performance Monitoring and Visualization

The demo includes a performance monitoring suite:

- **Processing Time Tracking**: Measures data ingestion, cleaning, and output times using `time.perf_counter()`, reported in milliseconds.
- **Cleanup Metrics**: Tracks the number of nulls, duplicates, short entries, and malformed entries removed.
- **Visualization**: Uses Matplotlib to plot a bar chart (`cleanup_stats.png`):
  - Bars: Number of entries removed per category (Nulls, Duplicates, Short, Malformed).
  - Palette: Professional muted colors for enterprise aesthetics.

## Gradio Interface for Interactive Demo

The demo is accessible via Gradio, providing an interactive data preprocessing experience:

- **Input**: Upload a sample call center CSV or use the embedded demo dataset.
- **Outputs**:
  - **Cleaned Dataset**: Download `cleaned_call_center_faqs.csv`.
  - **Cleanup Stats**: Detailed breakdown (e.g., “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”).
  - **Performance Plot**: Visual metrics for processing time and cleanup stats.
- **Styling**: Custom dark theme CSS (`#2a2a2a` background, blue buttons) for a sleek, enterprise-ready UI.

## Setup

- Clone this repository to a Hugging Face Model repository (free tier, public).
- Add `requirements.txt` with dependencies (`gradio==4.44.0`, `pandas==2.2.3`, `matplotlib==3.9.2`, etc.).
- Upload `app.py` (includes embedded demo dataset for seamless deployment).
- Configure to run with Python 3.9+, CPU hardware (no GPU).

## Usage

- **Upload CSV**: Provide a call center CSV in the Gradio UI, or use the default demo dataset.
- **Output**:
  - **Cleaned Dataset**: Download the processed `cleaned_call_center_faqs.csv`.
  - **Cleanup Stats**: “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”.
  - **Performance Plot**: Visual metrics for processing time and cleanup stats.

**Example**:
- **Input CSV**: Sample dataset with 10 FAQs, including 2 nulls, 1 duplicate, 1 short entry.
- **Output**:
  - Cleaned Dataset: 6 FAQs in `cleaned_call_center_faqs.csv`.
  - Cleanup Stats: “Cleaned FAQs: 6; removed 4 junk entries: 2 nulls, 1 duplicates, 1 short, 0 malformed”.
  - Plot: Processing Time (Ingestion: 50ms, Cleaning: 30ms, Output: 10ms), Cleanup Stats (Nulls: 2, Duplicates: 1, Short: 1, Malformed: 0).

## Technical Details

**Stack**:
- **Pandas**: Data wrangling and preprocessing for call center CSVs.
- **Gradio**: Interactive UI for real-time data preprocessing demos.
- **Matplotlib**: Performance visualization with bar charts.
- **FastAPI Compatibility**: Designed with API-driven preprocessing in mind, leveraging your experience with FastAPI for scalable deployments.

**Free Tier Optimization**: Lightweight with CPU-only dependencies, no GPU required.

**Extensibility**: Ready for integration with RAG/CAG pipelines, and cloud deployments on AWS Lambda or Azure Functions.

## Purpose

This demo showcases expertise in data preprocessing for AI-driven CX automation, focusing on call center data quality. Built on over 5 years of experience in AI, data engineering, and enterprise-grade deployments, it demonstrates the power of Pandas-based data cleaning for RAG/CAG pipelines, making it ideal for advanced CX solutions in call center environments.

## Latest Update

**Status Update**: Placeholder update - January 01, 2025 📝  
- Placeholder update text.

## Future Enhancements

- **Real-Time Streaming**: Add support for real-time data streaming from Kafka for live preprocessing.
- **FastAPI Deployment**: Expose preprocessing pipeline via FastAPI endpoints for production-grade use.
- **Advanced Validation**: Implement stricter data validation rules using machine learning-based outlier detection.
- **Cloud Integration**: Enhance compatibility with AWS Glue or Azure Data Factory for enterprise data pipelines.

**Website**: https://ghostainews.com/  
**Discord**: https://discord.gg/BfA23aYz