Crowd monitoring query system
Overview
The Crowd monitoring query system is a FastAPI-based application that allows users to query room occupancy records based on specific criteria using an instruction-tuned language model. This system integrates a MongoDB database to store and retrieve occupancy data, while leveraging a pre-trained language model to generate natural language responses based on the queried data.
Key Features
- Query Occupancy: Post room occupancy queries using
time
androom_id
as criteria, and retrieve a natural language response generated by the instruction-tuned language model. - Language Model: Uses a locally hosted language model (e.g., GPT-2 or others) to generate responses in a human-friendly way.
- Database Integration: Connects to a MongoDB collection to fetch and filter occupancy data.
- Health Check: API health check to ensure the service is up and running.
- Chat Interface: A chat endpoint to test the language model with custom prompts.
File Structure
llms.py
: Contains the logic for loading the language model, generating responses, and querying occupancy records.main.py
: Implements the FastAPI application, with endpoints for querying occupancy and generating responses from the language model.mongo_connector.py
: Provides a connection to MongoDB and functionality to query occupancy records based on provided criteria..env
: Holds environment variables such as MongoDB connection URI, database name, collection name, and the language model path.
Dependencies
Ensure you have the following installed:
- Python 3.12+
FastAPI
for building the APIpydantic
for data validationtorch
for PyTorchtransformers
for using the pre-trained language modelspymongo
for MongoDB integrationdotenv
for loading environment variablesunsloth
for loading the modelslangchain
for interfacing with databases
-
Create a Conda environment and activate it:
conda env create -f environment.yml
conda activate llms -
Configure environment variables:
Create a
.env
file at the root of the project and add the following variables:MONGO_URI=<Your MongoDB URI>
DB=<Your MongoDB Database Name>
COLLECTION=<Your MongoDB Collection Name>
MODEL=<Path or name of the pre-trained language model>
HF_TOKEN=<Your Hugging Face token if needed> -
Run the FastAPI server:
fastapi dev main.py
-
Configure environment variables:
Create a
.env
file at the root of the project and add the following variables:MONGO_URI=<Your MongoDB URI>
DB=<Your MongoDB Database Name>
COLLECTION=<Your MongoDB Collection Name>
MODEL=<Path or name of the pre-trained language model>
HF_TOKEN=<Your Hugging Face token if needed> // if you are wants to work with unmodified LLAMA3.1 models -
Important: Ensure you have CUDA installed if you wish to run the model on a GPU.
Here's how you can update the README to include the minimum system requirements for running the API:
Minimum System Requirements
To run the Occupancy Query Service API with the language model, the following hardware specifications are recommended:
- CPU: Intel Core i5 or equivalent
- RAM: 16 GB
- GPU: CUDA-compatible GPU (required for running larger models efficiently)
- For 3.5B parameter models: Minimum 4 GB VRAM
- For 8B parameter models: Minimum 6 GB VRAM
Note: If you don't have access to a compatible GPU, the API can still run on CPU, but with significantly slower inference times for larger models. please refer to https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
API Endpoints
1. /query_occupancy
(POST)
Query the occupancy database and receive a natural language response.
-
Request Body:
{
"time": "2023-09-21T15:00:00",
"room_id": 101
} -
Response:
{
"response": "The occupancy records are [details of the records]."
}
2. /
(GET)
Returns a welcome message.
- Response:
{
"message": "Welcome to the occupancy query service!"
}
3. /health
(GET)
Returns the health status of the API.
- Response:
{
"status": "healthy"
}
4. /chat
(GET)
Chat with the language model by passing a custom prompt.
-
Query Parameter:
prompt
: String prompt to be processed by the model. -
Response:
{
"message": "Response from the language model based on the prompt."
}
How it Works
-
Occupancy Querying: The service accepts a query in the form of
time
androom_id
(optional). It retrieves matching records from MongoDB and passes the records as a prompt to the language model to generate a human-readable response. -
Language Model: A pre-trained language model (loaded using
FastLanguageModel
) is used for inference. It generates a response based on the occupancy records and any additional prompts. -
Garbage Collection: To optimize GPU memory usage, garbage collection and GPU memory management are handled using PyTorch’s
empty_cache
andipc_collect
methods.
Model Configuration
- Quantization: The model is loaded with 4-bit quantization (set by
load_in_4bit=True
) to reduce memory usage. - Max Sequence Length: The maximum sequence length for the model is set to
2048
, and automatic scaling for RoPE (Rotary Positional Embedding) is enabled. - Model Customization: You can replace the current model (e.g., GPT-2) with any compatible transformer model by setting the
MODEL
environment variable to the desired model name or path.