Last updated by: Sumit, Last updated on: 22/09/2024

Crowd monitoring query system

Overview

The Crowd monitoring query system is a FastAPI-based application that allows users to query room occupancy records based on specific criteria using an instruction-tuned language model. This system integrates a MongoDB database to store and retrieve occupancy data, while leveraging a pre-trained language model to generate natural language responses based on the queried data.

Key Features

Query Occupancy: Post room occupancy queries using time and room_id as criteria, and retrieve a natural language response generated by the instruction-tuned language model.
Language Model: Uses a locally hosted language model (e.g., GPT-2 or others) to generate responses in a human-friendly way.
Database Integration: Connects to a MongoDB collection to fetch and filter occupancy data.
Health Check: API health check to ensure the service is up and running.
Chat Interface: A chat endpoint to test the language model with custom prompts.

File Structure

llms.py: Contains the logic for loading the language model, generating responses, and querying occupancy records.
main.py: Implements the FastAPI application, with endpoints for querying occupancy and generating responses from the language model.
mongo_connector.py: Provides a connection to MongoDB and functionality to query occupancy records based on provided criteria.
.env: Holds environment variables such as MongoDB connection URI, database name, collection name, and the language model path.

Dependencies

Ensure you have the following installed:

Python 3.12+
FastAPI for building the API
pydantic for data validation
torch for PyTorch
transformers for using the pre-trained language models
pymongo for MongoDB integration
dotenv for loading environment variables
unsloth for loading the models
langchain for interfacing with databases

Create a Conda environment and activate it:

conda env create -f environment.yml
conda activate llms

Configure environment variables:

Create a .env file at the root of the project and add the following variables:

MONGO_URI=<Your MongoDB URI>
DB=<Your MongoDB Database Name>
COLLECTION=<Your MongoDB Collection Name>
MODEL=<Path or name of the pre-trained language model>
HF_TOKEN=<Your Hugging Face token if needed>

Run the FastAPI server:
```
fastapi dev main.py
```

Configure environment variables:

Create a .env file at the root of the project and add the following variables:

MONGO_URI=<Your MongoDB URI>
DB=<Your MongoDB Database Name>
COLLECTION=<Your MongoDB Collection Name>
MODEL=<Path or name of the pre-trained language model>
HF_TOKEN=<Your Hugging Face token if needed> // if you are wants to work with unmodified LLAMA3.1 models

Important: Ensure you have CUDA installed if you wish to run the model on a GPU.

Here's how you can update the README to include the minimum system requirements for running the API:

Minimum System Requirements

To run the Occupancy Query Service API with the language model, the following hardware specifications are recommended:

CPU: Intel Core i5 or equivalent
RAM: 16 GB
GPU: CUDA-compatible GPU (required for running larger models efficiently)
- For 3.5B parameter models: Minimum 4 GB VRAM
- For 8B parameter models: Minimum 6 GB VRAM

Note: If you don't have access to a compatible GPU, the API can still run on CPU, but with significantly slower inference times for larger models. please refer to https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu

API Endpoints

1. `/query_occupancy` (POST)

Query the occupancy database and receive a natural language response.

Request Body:

{
    "time": "2023-09-21T15:00:00",
    "room_id": 101
}

Response:

{
    "response": "The occupancy records are [details of the records]."
}

2. `/` (GET)

Returns a welcome message.

Response:

{
    "message": "Welcome to the occupancy query service!"
}

3. `/health` (GET)

Returns the health status of the API.

Response:
```
{
    "status": "healthy"
}
```

4. `/chat` (GET)

Chat with the language model by passing a custom prompt.

Query Parameter: prompt: String prompt to be processed by the model.

Response:

{
    "message": "Response from the language model based on the prompt."
}

How it Works

Occupancy Querying: The service accepts a query in the form of time and room_id (optional). It retrieves matching records from MongoDB and passes the records as a prompt to the language model to generate a human-readable response.
Language Model: A pre-trained language model (loaded using FastLanguageModel) is used for inference. It generates a response based on the occupancy records and any additional prompts.
Garbage Collection: To optimize GPU memory usage, garbage collection and GPU memory management are handled using PyTorch’s empty_cache and ipc_collect methods.

Model Configuration

Quantization: The model is loaded with 4-bit quantization (set by load_in_4bit=True) to reduce memory usage.
Max Sequence Length: The maximum sequence length for the model is set to 2048, and automatic scaling for RoPE (Rotary Positional Embedding) is enabled.
Model Customization: You can replace the current model (e.g., GPT-2) with any compatible transformer model by setting the MODEL environment variable to the desired model name or path.

Overview​

Key Features​

File Structure​

Dependencies​

Minimum System Requirements​

API Endpoints​

1. /query_occupancy (POST)​

2. / (GET)​

3. /health (GET)​

4. /chat (GET)​

How it Works​

Model Configuration​

To training your custom models please use the notebook from train_llms/Train_llms.ipynb​

Overview

Key Features

File Structure

Dependencies

Minimum System Requirements

API Endpoints

1. `/query_occupancy` (POST)

2. `/` (GET)

3. `/health` (GET)

4. `/chat` (GET)

How it Works

Model Configuration

To training your custom models please use the notebook from train_llms/Train_llms.ipynb