Real-time Object Detection with OpenCV and YOLO v7

Please note that some of the text and code for this blog post was generated by ChatGPT, under my guidance. While ChatGPT was instrumental in the process, I exercised direction and judgment, wrote and adapted code, and carefully considered how to simplify the content for the user. The collaboration between ChatGPT and myself has hopefully resulted in a superior outcome than either of us could have achieved in isolation.

Welcome to the first article in the Python Quick Wins (PQW) blog series! The goal of this series is to deliver straightforward and digestible tutorials on loading and utilizing a wide range of Python libraries, with a focus on machine learning and computer vision. We aim to bridge the gap for beginners or those unfamiliar with specific libraries by offering clear, concise introductions without overwhelming the reader with complex examples.

In this inaugural tutorial, we'll be focusing on creating a simple notebook that captures video from a webcam, identifies objects in each frame using YOLO v7, and adds labels to the live video. This tutorial serves as a gentle introduction to object detection with YOLO v7 and OpenCV, demonstrating how to integrate these powerful tools for real-time object detection in a live video stream.

To support this tutorial and the entire PQW series, we've created an accompanying GitHub repository containing working Jupyter Notebooks for each tutorial. For this YOLO v7 blog post, you can find the corresponding repository directory and the Jupyter Notebook containing all the code from this blog post.

The code in this blog post is derived from the YOLO v7 detect.py example, which is released under the GPL-3 license. As a result, all code snippets in this blog post and the accompanying notebook are also covered by the GPL-3 license. Now, let's dive in and explore the world of Python Quick Wins!

1. Setting Up the Environment

Before diving into the implementation, we need to set up a suitable environment for the project. To do so, follow these steps:

Clone the YOLO v7 repository: Clone the YOLO v7 repository from https://github.com/WongKinYiu/yolov7 to your local machine. git clone https://github.com/WongKinYiu/yolov7.git
Download the yolov7.pt file: Download the yolov7.pt file from https://github.com/WongKinYiu/yolov7/releases and place it in the base directory of the cloned YOLO v7 repository.
Create a new environment using the environment.yml file: Use the environment.yml file from https://github.com/edparcell/python-quick-wins/blob/main/yolov7/environment.yml to create a new environment. This is my personal environment, and contains more dependencies than are needed for this project. conda env create -f environment.yml
Activate the new environment: Activate the newly created environment using the following command: conda activate pqw-20230430

Testing Your Environment

To ensure your environment is set up correctly, perform the following tests:

Run YOLO v7 on sample images: Run the detect.py script in the base directory of the YOLO v7 repository. This will run YOLO v7 on sample images. python detect.py
Test YOLO v7 with your webcam: Run the detect.py script with the --source flag set to 0. This will display a video capture window with labeled regions from YOLO v7. If you have multiple webcams or virtual webcam devices, you may need to experiment with other numbers for the source. python detect.py --source 0

If the above tests work as expected, you're ready to proceed with the tutorial. In the next section, we'll show you how to create a simple OpenCV capture and display loop.

2. Capturing Video with OpenCV

OpenCV (Open Source Computer Vision Library) is an open-source computer vision and machine learning software library. It provides a wide range of tools for image and video processing, including capturing video from cameras, reading and writing video files, and displaying images.

In this section, we'll guide you through creating a simple OpenCV capture and display loop to obtain video from your webcam.

Creating a Simple OpenCV Capture and Display Loop

Below is sample code to create a basic OpenCV capture and display loop:

import cv2

# Open the default camera for capturing video.
cap = cv2.VideoCapture(0)

# Loop until the camera is closed.
try:
    while cap.isOpened():
        # Read a frame from the camera and ensure successfully read
        ret, img = cap.read()
        assert ret, "Failed to read"

        cv2.imshow("YOLO v7 Demo", img)

        # Exit the loop if the user presses the 'q' key.
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
finally:
    # Release the camera and close all windows.
    cap.release()
    cv2.destroyAllWindows()

In this example, we follow a common template for approaching computer vision tasks using OpenCV. We begin by opening a video capture device, then run a loop to capture a frame, process it, and display the result. During the loop, we also check if the user wants to quit the application. Finally, when the loop ends, we clean up resources by releasing the capture device and closing any open windows. This template can be easily adapted for various computer vision tasks by modifying the processing step.

Now that we can capture and display video using OpenCV, let's move on to integrating YOLO v7 for object detection.

3. Import YOLO v7 and Loading Models

YOLO v7 (You Only Look Once version 7) is a state-of-the-art object detection model that can quickly and accurately detect objects in images. In this section, we'll explain how to import YOLO v7 .

Adding the YOLO v7 Repo to the Python Path

First, we need to add the YOLO v7 repository to the Python path to make it accessible for import. Replace [Path to yolo v7 repo] with the actual path to the cloned YOLO v7 repository:

import pathlib
import sys

pth_yolov7 = pathlib.Path(r'[Path to yolo v7 repo]')

init_file = pth_yolov7 / "__init__.py"
if not init_file.exists():
    init_file.touch()

if not str(pth_yolov7) in sys.path:
    sys.path.append(str(pth_yolov7))

Creating an __init__.py file is necessary so we can import modules from the YOLO v7 repo. This file also ensures that when we load the model, it can create objects with the classes defined in the YOLO v7 repo.

Loading the YOLO v7 Model

Next, we'll create a CUDA device object and load the YOLO v7 model onto it:

import cv2
import torch
import torch.nn as nn
import torch.backends.cudnn as cudnn

cudnn.benchmark = True

device = torch.device('cuda:0')

model_path = str(pth_yolov7 / 'yolov7.pt')
ckpt = torch.load(model_path, map_location=device)
model = ckpt['model'].float().fuse().eval()
for m in model.modules():
    if type(m) in [nn.Hardswish, nn.LeakyReLU, nn.ReLU, nn.ReLU6, nn.SiLU]:
        m.inplace = True
    elif type(m) is nn.Upsample:
        m.recompute_scale_factor = None
model.half()

We're using a CUDA device for faster processing, but you can also use a CPU device, albeit with slower performance. Because we're using a CUDA device, we need to use half-precision, as indicated by model.half(). The loop over the modules with changes to some is necessary to ensure the loaded objects are valid in more recent versions of PyTorch. Setting cudnn.benchmark to True enhances performance.

We also extract the stride of the model, which is used to ensure our input images are appropriately sized as a multiple of the stride:

stride = int(model.stride.max().item())

Now that we've created a CUDA device and loaded the YOLO v7 model into our project, we'll move on to processing the image data for the model and displaying the results.

4. Preparing the Image for Object Detection

Before feeding an image into the YOLO v7 model for object detection, it's crucial to resize and crop the image to ensure optimal performance and accurate results. Resizing the image to a smaller size reduces computational complexity, while cropping ensures each dimension is a multiple of the model's stride.

The following function, letterbox, resizes the input image while maintaining its aspect ratio, and then trims it to ensure its height is a multiple of the model's stride:

def letterbox(im, new_width, stride):
    """Resizes image to new width while maintaining aspect ratio, and trims to ensure height is a multiple of stride."""
    new_width = int(new_width)
    h, w = im.shape[:2]
    r = new_width / w
    scaled_height = int(r * h)
    im = cv2.resize(im, (new_width, scaled_height), interpolation=cv2.INTER_LINEAR)
    trim_rows = scaled_height % stride
    if trim_rows != 0:
        final_height = scaled_height - trim_rows
        offset = trim_rows // 2
        im = im[offset:(offset + final_height)]
    return im

This function first calculates the aspect ratio and resizes the image accordingly. Then, it trims the image to ensure its height is a multiple of the stride, by removing a balanced number of rows from the top and bottom of the image if necessary.

With the letterbox function ready, we can now preprocess our images before passing them to the YOLO v7 model for object detection, by adding the following line to our OpenCV capture loop:

img = letterbox(im0, 640, stride)

5. Processing Image Data and Model Outputs

In this section, we will explain how to send the preprocessed image data to the YOLO v7 model for object detection and how to interpret and process the model outputs for display (e.g., object bounding boxes, labels, confidence scores).

Running the Model on Preprocessed Image Data

After preprocessing the image, we can use the following function run_model to run the YOLO v7 model on the input image tensor:

def run_model(model, img, device):
    """Runs a PyTorch model on the input image tensor after preprocessing it."""
    img = np.expand_dims(img, 0)
    img = img[:, :, :, ::-1].transpose(0, 3, 1, 2)
    img = np.ascontiguousarray(img)
    img = torch.from_numpy(img).to(device).half()
    img /= 255.0

    with torch.no_grad():
        return model(img)[0]

This function handles additional preprocessing and loads the image data to the CUDA device. We use torch.no_grad() to avoid calculating gradients during evaluation, as they are only needed for training and would cause a GPU memory leak here.

Add the following line to the OpenCV capture loop to run the model on the image:

pred = run_model(model, img, device)

Interpreting and Processing Model Outputs

Now, we'll apply Non-Maximum Suppression (NMS) to the model output to remove overlapping boxes and filter out low-confidence detections:

from utils.general import non_max_suppression

pred = non_max_suppression(pred)

Add this line to the OpenCV capture loop as well.

The non_max_suppression function from the YOLO v7 repo filters the predicted boxes based on their confidence scores and suppresses overlapping boxes that have a high Intersection over Union (IoU). The function returns a list of detections, with one (n, 6) tensor per image, where n is the number of remaining detections for that image, and the 6 columns represent the bounding box coordinates in (xmin, ymin, xmax, ymax) format, the objectness score, and the predicted class index.

With the model outputs processed, we can now display the object detection results on the live video stream.

6. Adding Labels to the Frame Image

In this section, we'll show you how to add labels to the frame image for each detected object. To do this, we'll use two functions: plot_one_box and plot_boxes.

Drawing Bounding Boxes and Labels

plot_one_box is a function that takes the bounding box coordinates, the image, a label, and a color. It draws a rectangle on the image, adds the label with the given color, and writes it on the image.

def plot_one_box(x, img, label, color):
    """Draws a rectangle on the input image, adds a label with the given color, and writes it on the image."""
    c1, c2 = (int(x[0]), int(x[1])), (int(x[2]), int(x[3]))
    cv2.rectangle(img, c1, c2, color, thickness=1, lineType=cv2.LINE_AA)
    t_size = cv2.getTextSize(label, 0, fontScale=1/3, thickness=1)[0]
    c2 = c1[0] + t_size[0], c1[1] - t_size[1] - 3
    cv2.rectangle(img, c1, c2, color, -1, cv2.LINE_AA)
    cv2.putText(img, label, (c1[0], c1[1] - 2), 0, 1/3, [225, 255, 255], thickness=1, lineType=cv2.LINE_AA)

plot_boxes is a function that takes an image, a list of detection predictions, a list of class names, and a list of colors. It draws rectangles and writes labels on the input image for each detection prediction.

def plot_boxes(img, pred, names, colors):
    """Draws rectangles and writes labels on an input image for each detection prediction from a list."""
    for det in pred:
        for *xyxy, conf, cls in reversed(det):
            label = f'{names[int(cls)]} {conf:.2f}'
            plot_one_box(xyxy, img, label, colors[int(cls)])

Integrating the Functions into the OpenCV Loop

Add the following line to the OpenCV capture loop to draw the bounding boxes and labels on the image:

plot_boxes(img, pred, names, colors)

To get the class names from the model and generate random colors for the bounding boxes, use the following code at initialization, outside the OpenCV loop:

names = model.names
colors = [[random.randint(0, 255) for _ in range(3)] for _ in names]

With these additions to the capture loop, the object detection results, including bounding boxes and labels, will be displayed on the live video stream.

In this tutorial, we have shown you how to integrate YOLO v7 with OpenCV for object detection in a live video stream. The final edited code for initialization and the OpenCV loop with object detection is provided below:

# Set the input image size and enable benchmark mode for CuDNN to speed up inference.
imgsz = 640
cudnn.benchmark = True

# Get the class names for the model and generate random colors for drawing boxes on the image.
names = model.names
colors = [[random.randint(0, 255) for _ in range(3)] for _ in names]

# Open the default camera for capturing video.
cap = cv2.VideoCapture(0)

# Loop until the camera is closed.
try:
    while cap.isOpened():
        # Read a frame from the camera and ensure successfully read
        ret, im0 = cap.read()
        assert ret, "Failed to read"

        # Resize and pad the image to the specified size while maintaining the aspect ratio.
        img = letterbox(im0, imgsz, stride)

        # Run the model on the preprocessed image.
        pred = run_model(model, img, device)

        # Perform non-maximum suppression to remove overlapping boxes.
        pred = non_max_suppression(pred)

        # Draw the boxes on the image and display it.
        plot_boxes(img, pred, names, colors)
        cv2.imshow("YOLO v7 Demo", img)

        # Exit the loop if the user presses the 'q' key.
        if cv2.waitKey(1) & 0xFF == ord('q'):
            break
finally:
    # Release the camera and close all windows.
    cap.release()
    cv2.destroyAllWindows()

Conclusion

In this tutorial, we've covered the following steps:

Installing necessary dependencies
Capturing video with OpenCV
Integrating YOLO v7 for object detection
Preparing the image for object detection
Processing image data and model outputs
Adding labels to the frame image

The full notebook with all the code can be found at this link. We encourage you to experiment with the notebook and explore further applications of OpenCV and YOLO v7 for object detection in various domains. By doing so, you'll gain a deeper understanding of how to adapt and expand these techniques to fit your specific use cases.