Stitching non max suppression (NMS) to YOLOv8n on exported ONNX model

Stephen Cow Chau
7 min readMar 30, 2023

--

Background

Following my previous post on exploring YOLOv8, I have been stuck at using the YOLOv8 model other than PyTorch, because the direct export model give result of dimension like [batch size, 5, 8400], which does encapsulated the result of overlapped bounding boxes and confidence score. But using this result in other format (e.g. TF Lite with object detection API) would require post process of this result into bounding boxes that’s not overlapped and corresponding confidence scores.

A bit more in YOLO class forward function in PyTorch

As I observed, the YOLO class is initialized with member “model”, which is the core model that would output that [batch size, 5, 8400] liked result.

https://github.com/ultralytics/ultralytics/blob/main/ultralytics/yolo/engine/model.py

And the forward function is calling a predict function (that’s depends on what task, be it object detection, classification or segmentation)

And drill down to the object detection core predict call would comes to this postprocess function that add the non max suppression:

https://github.com/ultralytics/ultralytics/blob/main/ultralytics/yolo/v8/detect/predict.py

The stitching of Non Max Suppression (NMS) to the exported ONNX model

Given the code’s export towards Tensorflow related model (TF Lite, Tensorflow.js…) are through ONNX -> Tensorflow saved model (through onnx2tf package) and then to the target.

So ONNX seems to be a good target to add the necessary nms operation, a second reason would be the ONNX nms operation could be more optimized (compare to torchvision’s nms converted operation).

Here is the ONNX model output from export code from ultralytic.

Visualize using netron.app

Let’s also look at what YOLOv7’s output, we see it comes from some output of 1x25200x4, 1x1x25200 which look like bounding boxes and confidence scores, and add the non max suppression and break it down into 4 different outputs expected from TF Lite object detection API.

So that’s what I am trying to do (similarly) on YOLOv8n model. The very first action is remove the concat step at last and stitch a NonMaxSuppression operation at the end. The intermediate target as follow:

The code as follow (credit to onnx issue 2216)

# * model is the YOLOv8n trained (YOLO class)
# * batch = 1 is important here,
# even export support dynamic axis using dynamic=True,
# sometimes it just fail to export
model.export(format='onnx', simplify=True, imgsz=[640,640], batch=1)

# load the model and manipulate it
onnx_model = onnx.load_model(onnx_model_path)
onnx_fpath = f"{weight_folder}/best_nms.onnx"

graph = onnx_model.graph

# operation to transpose bbox before pass to NMS node
transpose_bboxes_node = onnx.helper.make_node("Transpose",inputs=["onnx::Concat_458"],outputs=["bboxes"],perm=(0,2,1))
graph.node.append(transpose_bboxes_node)

# make constant tensors for nms
score_threshold = onnx.helper.make_tensor("score_threshold", TensorProto.FLOAT, [1], [0.25])
iou_threshold = onnx.helper.make_tensor("iou_threshold", TensorProto.FLOAT, [1], [0.45])
max_output_boxes_per_class = onnx.helper.make_tensor("max_output_boxes_per_class", TensorProto.INT64, [1], [200])

# create the NMS node
inputs=['bboxes', 'onnx::Concat_459', 'max_output_boxes_per_class', 'iou_threshold', 'score_threshold',]
# inputs=['onnx::Concat_458', 'onnx::Concat_459', 'max_output_boxes_per_class', 'iou_threshold', 'score_threshold',]
outputs = ["selected_indices"]
nms_node = onnx.helper.make_node(
'NonMaxSuppression',
inputs,
["selected_indices"],
# center_point_box=1 is very important, PyTorch model's output is
# [x_center, y_center, width, height], but default NMS expect
# [x_min, y_min, x_max, y_max]
center_point_box=1,
)

# add NMS node to the list of graph nodes
graph.node.append(nms_node)

# append to the output (now the outputs would be scores, bboxes, selected_indices)
output_value_info = onnx.helper.make_tensor_value_info("selected_indices", TensorProto.INT64, shape=["num_results",3])
graph.output.append(output_value_info)

# add to initializers - without this, onnx will not know where these came from, and complain that
# they're neither outputs of other nodes, nor inputs. As initializers, however, they are treated
# as constants needed for the NMS op
graph.initializer.append(score_threshold)
graph.initializer.append(iou_threshold)
graph.initializer.append(max_output_boxes_per_class)

# remove the unused concat node
last_concat_node = [node for node in onnx_model.graph.node if node.name == "Concat_291"][0]
onnx_model.graph.node.remove(last_concat_node)

# remove the original output0
output0 = [o for o in onnx_model.graph.output if o.name == "output0"][0]
onnx_model.graph.output.remove(output0)

# output keep for downstream task
graph.output.append([v for v in onnx_model.graph.value_info if v.name=="onnx::Concat_458"][0])
graph.output.append([v for v in onnx_model.graph.value_info if v.name=="onnx::Concat_459"][0])

# check that it works and re-save
onnx.checker.check_model(onnx_model)
onnx.save(onnx_model, onnx_fpath)

( [Start] 2023–05–11 EDIT )

Note that after receiving comments of unable to locate the onnx::Concat_458 and onnx::Concat_459, I do check the latest yolov8n model and the graph being export is different.

The onnx::Concat_458 is the original ONNX graph’s output node of Mul operation before the last Concat, while the onnx::Concat_459 is the output of Sigmoid node. So using netron.app site:

Tetsing on Colab using fresh download of ultralytics pip package and yolov8n model, with updated node name (to test the line refering node Concat_291):

instead of using “Concat_291”, use the updated node name “/mode.22/Concat_5”

I wish this make it clear.

To make it more flexible (as long as the graph itself doesn’t change the structure having (Mul,Sigmoid) => Concat => “output0”, one can go back from graph “output:0" (assuming output name not change):

Here you can further use op_type to distinguish which one is the “458” or “459”

( [End] 2023–05–11 EDIT )

The output “selected_indices” is of format [num_selected_indices, 3], the selected index format is [batch_index, class_index, box_index]. (According to ONNX document)

This result is still not equal to expected object detection API results:

https://www.tensorflow.org/lite/inference_with_metadata/task_library/object_detector#model_compatibility_requirements

So additional operation would be needed to add to process the “selected_indices” output, I would go another path, instead of directly manipulate the ONNX model, I am going to use PyTorch to generate the post process and merge it into the previous ONNX model.

class Transform(nn.Module):
def forward(self, idxTensor, boxes, scores):
bbox_result = self.gather(boxes, idxTensor)
score_intermediate_result = self.gather(scores, idxTensor).max(axis=-1)
score_result = score_intermediate_result.values
classes_result = score_intermediate_result.indices
num_dets = torch.tensor(score_result.shape[-1])
return (bbox_result, score_result, classes_result, num_dets)

'''
Input:
boxes: [bs=1, 4, 8400]
indices: [N, 3]

expect output
'''
def gather(self, target, idxTensor):
pick_indices = idxTensor[:,-1:].repeat(1,target.shape[1]).unsqueeze(0)
return torch.gather(target.permute(0,2,1),1,pick_indices)
'''
Export the model
'''
torch_boxes = torch.tensor([
[91.0,2,3,4,5,6],
[11,12,13,14,15,16],
[21,22,23,24,25,26],
[31,32,33,34,35,36],
]).unsqueeze(0)

torch_scores = torch.tensor([
[0.1,0.82,0.3,0.6,0.55,0.6],
[0.9,0.18,0.7,0.4,0.45,0.4],
]).unsqueeze(0)

torch_indices = torch.tensor([[0,0,0], [0,0,2], [0,0,1]])

t_model = Transform()

torch.onnx.export(t_model, (torch_indices, torch_boxes, torch_scores), "NMS_after.onnx",
input_names=["selected_indices", "boxes", "scores"],
output_names=["det_bboxes", "det_scores", "det_classes", "num_dets"],
dynamic_axes={
"boxes":{0:"batch",1:"boxes",2:"num_anchors"},
"scores":{0:"batch",1:"classes",2:"num_anchors"},
"selected_indices":{0:"num_results"},
"det_bboxes":{1:"num_results"},
"det_scores":{1:"num_results"},
"det_classes":{1:"num_results"},
})


nms_postprocess_onnx_model = onnx.load_model("NMS_after.onnx")
nms_postprocess_onnx_model_sim, check = onnxsim.simplify(nms_postprocess_onnx_model)
onnx.save(nms_postprocess_onnx_model, "NMS_after_sim.onnx")

This result in following model (to be subgraph in the final model):

Finally, we would merge the 2 onnx models, but before doing so, there might be error mentioning the inferred shape cannot be determined around the NMS operation.

So to play safe, we can add code to update to first model’s output, note that I tried to use the naming same as the torch.onnx.export(…) dynamic axis [trying to match output “onnx::Concat_458” to another graph’s input “boxes” and similarly for other corresponding outputs and inputs]

from onnx.tools import update_model_dims
input_dims = {
"images": ["batch",3,640,640],
}

output_dims = {
"selected_indices": ["num_results",3],
"onnx::Concat_458": ["batch","boxes", "num_anchors"],
"onnx::Concat_459": ["batch","classes","num_anchors"],
}

updated_onnx_model = update_model_dims.update_inputs_outputs_dims(onnx_model, input_dims, output_dims)

Finally the merge operation, there are quite some hacky things in following code, because the 2 subgraph model is generated in different IR version, and the merge operation require the 2 models to be merged is with the same IR version (note that some operations in ONNX is introduced in different IR version).

from onnx.compose import merge_models
from onnx.version_converter import convert_version
combined_onnx_path = f"{weight_folder}/best_nms_extended.onnx"

target_ir_version = 18
core_model = convert_version(updated_onnx_model, target_ir_version)
# this output is weird, it still say it's version 8, even after convert
print(f"core_model version : {core_model.ir_version}")
onnx.checker.check_model(core_model)
# force to pass the version check, the convert seems success but the ir_version does NOT change
core_model.ir_version = 8

# core_model = updated_onnx_model
post_process_model = convert_version(nms_postprocess_onnx_model_sim, target_ir_version)
# this output is weird, it still say it's version 7, even after convert
print(f"post_process_model version : {post_process_model.ir_version}")
onnx.checker.check_model(post_process_model)
# force to pass the version check, the convert seems success but the ir_version does NOT change
post_process_model.ir_version = 8

combined_onnx_model = merge_models(core_model, post_process_model, io_map=[
('onnx::Concat_458', 'boxes'),
('onnx::Concat_459', 'scores'),
('selected_indices', 'selected_indices')
])

onnx.save(combined_onnx_model, combined_onnx_path)

And finally we have:

Conclusion

It’s nice working out the exercise to stitch the NMS and postprocess operation on ONNX level, even it’s a tough exercise.

WARNING: This approach work for ONNX runtime, but tested with TF Lite and Tensorflow.js, seems the conversion misunderstood the class output (instead of dimension [1, num_result], it misunderstood to be [1,1])

The above is fixed in following follow up article:

Also if you want to use TensorRT, there is a another package that already perform the NMS and postprocess: https://github.com/triple-Mu/YOLOv8-TensorRT, this is discovered from the ultralytics issue 643

WRITER at MLearning.ai // AI ART DISCORD🗿/imagine AI 3D Models

--

--