Train your own object detector with Faster-RCNN & PyTorch

A guide to object detection with Faster-RCNN and PyTorch

Creating a human head detector

Getting images

Training, validation & test data. Image by author

Annotating

Image by author
annotator.add_class(label='head', color='red')
Image by author
annotator.add_class(label='eye', color='blue')
Image by author
annotator.export(pathlib.Path('.../some_directory')))
annotator = Annotator(image_ids=image_files,
annotation_ids=annotation_files)
annotator.export_all(pathlib.Path('.../Heads/target'))
keys:
dict_keys(['labels', 'boxes'])
labels:
array(['head', 'head', 'head', 'head', 'head', 'head'], dtype='<U4')
boxes:
[array([ 14.32894795, 217.18092301, 277.02631195, 531.98354928]),
array([199.95394483, 81.49013583, 396.43420467, 287.74013235]),
array([386.66446799, 2.24671611, 588.57235932, 247.57565934]),
array([306.33552198, 251.91776453, 510.41446591, 521.12828631]),
array([525.61183407, 266.0296064 , 741.63156727, 554.77960153]),
array([723.17762021, 116.22697735, 925.08551155, 432.11512991])]

Dataset building

Builds a dataset with images and their respective targets. A target is expected to be a pickled file of a dict and should contain at least a ‘boxes’ and a ‘labels’ key. inputs and targets are expected to be a list of pathlib.Path objects. In case your labels are strings, you can use mapping (a dict) to int-encode them. Returns a dict with the following keys: ‘x’, ‘x_name’, ‘y’, ‘y_name’
sample = dataset[1]
sample['x'].shape
-> torch.Size([3, 710, 1024])
datasetviewer.gui_text_properties(datasetviewer.shape_layer)
Image by author

Faster R-CNN in PyTorch

The input to the model is expected to be a list of tensors, each of shape [C, H, W], one for each
image, and should be in 0-1 range. Different images can have different sizes.
batch = next(iter(dataloader))
min_size (int): minimum size of the image to be rescaled before feeding it to the backbone
max_size (int): maximum size of the image to be rescaled before feeding it to the backbone
image_mean (Tuple[float, float, float]): mean values used for input normalization.
They are generally the mean values of the dataset on which the backbone has been trained
on
image_std (Tuple[float, float, float]): std values used for input normalization.
They are generally the std values of the dataset on which the backbone has been trained on
Image by author
dict_keys(['image_height', 'image_width', 'image_mean', 'image_std', 'boxes_height', 'boxes_width', 'boxes_num', 'boxes_area'])
stats['image_height'].max() -> tensor(1200.)
stats_transform['image_height'].max()-> tensor(1024.)
stats['image_height'].min() -> tensor(333.)
stats_transform['image_height'].min() -> tensor(576.)

Anchor boxes

Image by author — anchor positions and anchor boxes with aspect ratios (1.0). Input size: (3, 1024, 1024). Feature map size: (512, 32, 32)

About the anchor_size and aspect_ratios parameters

Training

The loss function

model.train()
loss_dict = model(x, y)
loss = sum(loss for loss in loss_dict.values())

The optimizer and learning rate scheduling

Logging and Neptune.ai

Training script

Image by author
Image by author
Image by author
Image by author

Inference

Image by author
d = datasetviewer_prediction
d.gui_score_slider(d.shape_layer)
Image by author
d = datasetviewer_prediction
d.gui_nms_slider(d.shape_layer)
Image by author

Conclusion

Master student in Biology with experience in artificial neural networks & deep learning, in particular with 2D/3D image data for object segmentation/detection.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store