Data Processing

It is very easy to display the bounding box with YOLO. However, to be able to really do something with the data, you have to process it further.

My Use Case

My use case is as follows: I want to be able to recognize people with a first aid drone with a camera and fly directly towards them. The drone should then bring relief supplies as close as possible to the person and set them down with a winch. Of course, it is also important to pay attention to other objects, as these can be obstacles.

YOLO Box Coordinates

The bounding boxes are included in the result tensor. To make them easier to process, I convert them into Python lists:

for box in results[0].boxes.data.tolist():
    print(box)    

The results could look similar to this example:

0: 640x640 1 person, 2 cars, 112.0ms
Speed: 5.0ms preprocess, 112.0ms inference, 2.5ms postprocess per image at shape (1, 3, 640, 640)
[0.5114593505859375, 236.54718017578125, 221.8626251220703, 476.11029052734375, 0.7665080428123474, 2.0]
[538.58203125, 243.6549072265625, 639.53662109375, 474.279541015625, 0.2758972942829132, 2.0]
[345.9886474609375, 202.54571533203125, 392.410400390625, 288.2127685546875, 0.2729227542877197, 0.0]

Each box contains 6 values: The first four are the X and Y coordinates, then comes the calculated probability and the last value is the recognized class, where 2 stands for “car” and 0 for “person”.

Test

The following test is a simulation of the use case described above: a person (among other objects) is to be recognized. then the relative position of the bounding box in the image is to be used to calculate the direction in which the drone should fly: if the person is far to the left in the image, the drone must fly to the left, if the person is far to the right in the image, the drone must fly to the right.

Of course, the whole thing is just a simulation: instead of actually controlling a drone, the directional information is simply displayed in the image.

Test Program

import cv2
from ultralytics import YOLO

# Load the YOLO model
model = YOLO("yolo11n_ncnn_model")

# Open the video streamq
cam = cv2.VideoCapture(0)

# Loop through the video frames
while cam.isOpened():
    # Read a frame from the video
    success, frame = cam.read()

    if success:
        # Look for bottles only, in YOLO class 39 stands for bottles
        results = model.predict(frame)
        for box in results[0].boxes.data.tolist():
            print(box)

        # Visualize the results on the frame
        annotated_frame = results[0].plot()

        # Check all boxes ->
        for box in results[0].boxes.data.tolist():
            xmin, ymin, xmax, ymax, conf, id = box
            # in case that a person got detected: Display directional information
            if id == 0:
                if xmax < 250:
                    cv2.putText(annotated_frame, "<----left>", (400, 50),
                                cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255), 2)
                if xmin > 400:
                    cv2.putText(annotated_frame, "right--->", (40, 50),
                                cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 0, 255),2)
        # Display the annotated frame
        cv2.imshow("YOLO", annotated_frame)


        # Break the loop if 'q' is pressed
        if cv2.waitKey(1) == ord("q"):
            break
    else:
        # Break the loop if the end of the video is reached
        break

# Release the video capture object and close the display window
cam.release()
cv2.destroyAllWindows

Test Video