This chapter will focus on the detailed introduction of camera installation, image processing techniques and tools, and computer vision algorithms in autonomous mode of FTC competition. I will use the autonomous mode task in the 2021-2022 season as an example to showcase these technologies and tools. Most of the details in this chapter were borrowed from my publication: David Liu and Mike Callinan, Few-shot Object Detection for Robotics, International Journal of High School Research, 5(5), 2023.
During the autonomous period of the 2021-2022 FTC robotics competition, alliances earn points by delivering pre-loaded freight to a randomly selected level of the Alliance Shipping Hub (ASH). Also, there is an autonomous bonus, where if a robot uses the Team Shipping Element (TSE) to detect the correct level and delivers the pre-loaded freight to the correct level of the ASH, it earns twenty points. More specifically, the onboard camera needs to localize the TSE in the scene image and then determine which spot it is to be randomly placed. Using this information, the robot can decide its action plan to deliver the pre-loaded freight to the correct level of the ASH. Figure 74(a) provides an overall illustration of this challenging task of TSE localization and ASH level determination.
The FTC object detection API offers a TensorFlow Lite based solution to detect game elements (e.g., ducks and cubes) for the 2021-2022 Freight Frenzy challenge. This FTC solution relies on Google's TensorFlow deep learning technology that is designed to run on mobile devices such as an Android smartphone or Rev Robotics control hub. Although the FTC object detection API and its models have reasonably good performance in detecting those pre-designed game elements, as shown in Figure 74(b), it cannot be readily applied to detect any user-designed TSE, as shown in Figure 74(c). The main reason is that user-designed TSE has its own shape pattern and is not included in the FTC object detection model library. Instead, different TensorFlow Lite models should be defined and developed for user’s specific TSEs. A major challenge in developing such user-specific TSE detection model is that there are very few example images available. Therefore, the traditional approach of training deep learning models for object detection that needs many training samples will not be applicable in this robotics scenario.
In the computer vision (CV) and artificial intelligence (AI) fields, researchers have developed an innovative and effective few-shot learning approach for object detection. The basic idea of few-shot object detection is that a backbone deep neural network model is leveraged and fine-tuned by learning from a few samples such that the last few layers of the model can be transferred and continually trained for the specific new category of object. Since its invention3, few-shot object detection has been extensively used in the CV and AI fields, and a variety of related open-source software packages are available. In our FTC robotics project, we adopted the few-shot object detector based on the RetinaNet backbone pretrained on COCO checkpoint to detect the TSE (Figure 74(c)), which determines the ASH level in autonomous mode.
Figure 74. (a) Illustration of the challenging task of TSE localization and ASH level determination. For instance, if the TSE (the cone here) is randomly placed in the middle spot, the pre-loaded freight should be delivered to the second level of ASH. If the TSE is randomly placed in the left spot or right spot, the pre-loaded freight will then be delivered to the first or third level of the ASH. (b) An example of FTC TensorFlow object detection result. The duck object is detected by the TensorFlow Lite model provided by FTC. (c) The TSE (yellow hat) that we made to replace the duck.
Few-shot Object Detection
Few-shot learning, such as few-shot image classification and few-shot object detection, has received increasing attention and witnessed significant advances in recent years. The basic idea of few-shot learning is that it can classify or detect new objects when there are only a few training samples with annotated information. The main methodological innovation of few-shot learning is that it typically fine-tunes the last layer of existing CNN models (such as image classifier or object detector) on new datasets, as illustrated in Figure 75. This simple yet effective approach outperforms the traditional meta-learning methods by a significant margin on current benchmarks.
Our FTC robot was designed to enable and facilitate both autonomous and manual driving modes, as shown in Figure 76(a). For the autonomous mode, a webcam is installed in the front frame and connected to the Rev Robotics control hub. We employed the few-shot object detector based on the RetinaNet (https://colab.research.google.com/github/keras-team/keras-io/blob/master/examples/vision/ipynb/retinanet.ipynb) backbone pretrained on the COCO checkpoint to detect the TSE to determine the shipping hub level in autonomous mode, as shown in Figure 76(b) and Figure 76(c). Here, 5 example images of TSE were taken from different views and used as the few-shot training samples (Figure 76(b)), and the trained model was then deployed using the TensorFlow object detector API on the Rev Robotics control hub. Afterwards, the detected TSE, illustrated in Figure 3c, will be measured by bounding box dimensions provided by the TensorFlow Object Detector (TFOD) API and used to determine the ASH level. Afterwards, the determined ASH level is used to decide the robot’s action plan to deliver the pre-loaded freight to the correct alliance shipping hub, as demonstrated in Figure 74(a).
Figure 75. Illustration of the few-shot robotics object detection model of FS-Robotics-RetinaNet based on RetinaNet. Here, the RetinaNet is used as the backbone network, and the region proposal network (RPN), region of interest (ROI) pooling, and ROI feature extractors are used to extract features for bounding box classification and regression.
Figure 76. Illustration of few-shot TSE object detection used in our FTC robotics project. (a) Our FTC robot and its mounted Logitech camera for object detection. The hardware configuration screenshot is shown in Figure 77. (b) An example of a training image of TSE (the yellow hat). (c) An example of the detected yellow hat TSE through the onboard robot webcam.
Figure 77. Web camera configuration.
Implementation
In the Colab environment, all datasets and code are stored and run on the Google cloud environment including Google drive, cloud GPU, and TensorFlow. Conveniently, all of these tools and environments are free to people once they have a Gmail account. We had a very good learning experience in coding with Colab, as everything is in the Jupyter Notebook style and interactive. One of the most prominent features of Colab to us is that we do not need to worry about the complex and time-consuming configuration steps of setting up an environment for developing and using deep learning models such as those based on TensorFlow. The environment in Colab is also naturally compatible with the FTC TensorFlow object detection API, meaning that all models developed in Colab can be easily converted to the FTC’s TensorFlow Lite environment supported by Rev Robotics control hub.
We would like to acknowledge that our few-shot object detection Colab was largely dependent on the RetinaNet. Our main work in this project is to add our own images of the TSE to the existing Colab and retrain the few-shot object detection model. Here, we included a few important steps we used. The code that adds our own images is below in Figure 78.
Figure 78. The Colab code that is used to add our own TSE images (five ones at the bottom).
After the TSE images are added into the model, the next key step is to manually annotate the target bounding box in each image. This step is important for the TensorFlow object detection model to train the specific model for TSE. Therefore, this step is user-specific, and each FTC team can develop their own TSE and train their own TensorFlow object detection model, which will be deployed when it is uploaded to the Rev Robotics control hub. As examples, Figure 79 shows the code and annotated bounding boxes of the TSE images.
Figure 79. Illustration of the manually annotated bounding boxes of the TSE target (bottom images).
Figure 80. The Colab code used to fine-tune the general TensorFlow object detection model to specific TSE model by few-shot learning.
Once the training samples for object detection model training are available, the few-shot learning module will start to fine-tune pre-trained backbone models, e.g., RetinaNet for object detection, to transfer the generic model to a specific model for TSE. Figure 80 shows the code that performs the model fine-tuning step for TSE model generation. This is probably one of the most important pieces of code for few-shot object detection in this FTC robotics project’s autonomous mode. Once the few-shot TSE model was trained and generated, it will be converted into the TensorFlow Lite format and uploaded to the Rev Robotics control hub for deployment. To assess the accuracy and performance of the few-shot TSE object detection model, 30 runs of autonomous mode operations (half for red alliance and another half for blue alliance, respectively) were conducted in a real FTC competition field.
Robotics Integration
The FTC TFOD library provides an excellent support for integrating webcam, TensorFlow object detection model, and other related modules. There is an abundance of online resources to learn TFOD library and its integration and application in FTC robotics, such as initialization of TFOD and detection of FTC game elements. Here, we focus on the introduction of how our specific few-shot TSE detection model was employed and integrated into our robot to accomplish the task of delivering the pre-loaded freight to the correct level of the ASH. The code used to determine the ASH level based on the detected TSE bounding box is shown in Figure 81. It is important to point out that the key parameters used to make such determination will be highly dependent on the specific robot figuration, the webcam installation, and their geometric relationships. All the optimal parameters in our robotics project were empirically decided by extensive experiments in the robotics competition field.
Figure 81. The Java code for determining ASH level based on the bounding box of the detected TSE object in the webcam video scene.
Once the ASH level is determined by live webcam video stream, the robot needs to make an action plan to deliver the pre-loaded freight to the correct level of the ASH. The code in Figure 82 shows how the robot plans its motion to conduct such delivery. This piece of code is also very dependent on the robot’s configuration, and more significantly, its robotic delivery arm. Additional details of the integration of webcam, few-shot TSE object detection, ASH level determination, pre-loaded freight delivery, and other related robot operations are referred to my above-mentioned paper. The effectiveness and performance of robotics integration were quantitatively evaluated using the same 30 runs of autonomous mode operations mentioned above.
Figure 82. The Java code for our robot’s motion plan to deliver the pre-loaded freight to the correct level of the ASH20. Here, the situation of first ASH level is used as an example. The Java code for other ASH levels is referred to our GitHub link.
Results and Discussion
To visualize and quantitatively evaluate the accuracy and performance of the developed few-shot TSE object detection model FS-Robotics-RetinaNet, we used a large-screen projector to see the results, as shown in Figure 83. The Rev Robotics control hub has a USB port to connect the projector to the hub via HDMI cable. In this way, the screen on the Android driver hub cell phone can be effectively visualized. Figure 83 shows an example of the visualization via large-screen projector. Many experiments were performed and visualized in this way, and our statistical result showed that the TSE detection accuracy is 100%, showing the effectiveness and accuracy of our developed FS-Robotics-RetinaNet model. Also, in the real robotics practices and competitions, the success rates of ASH level determination and pre-loaded freight delivery are very high, approximately 95%. Those unsuccessful cases were mainly due to the variation of the robot’s initial location and the ASH’s location before the autonomous mode competition in the field. In each robotics match, the locations of both the robot and the ASH are reset and thus are slightly different, and such variation results in the uncertainties to the hard-coded key parameters in the robot’s motion plan. To solve these uncertainty problems in robotics, other advanced robotics and AI technologies, such as simultaneous localization and mapping (SLAM) and integration of multimodal sensory data such as camera and LiDAR, are needed to have more accurate robot/environment localization and mapping in the future.
Figure 83. Visualization of few-shot TSE object detection result by a large-screen projector.
Notably, we explored various methods of video object detection during this FTC robotics project, such as using the FTC-compatible EasyOpenCV library (https://github.com/OpenFTC/EasyOpenCV) and training a TSE object detection TensorFlow model from scratch. However, due to the various factors of uncertainties and variations such robot-TSE distance, webcam/TSE viewing angles, room lighting, surrounding objects, among others, it has been challenging to develop a robust object detection model that works well in a real-world dynamic environment. Our experience is that the proposed few-shot object detection model FS-Robotics-RetinaNet works significantly better than other explored methods. We hope this FS-Robotics-RetinaNet model with open-sources can be useful for other FTC teams to explore and evaluate in their specific FTC robotics application scenarios and beyond.
For the FTC 2022-2023 season, we employed two cameras in the autonomous mode for different purposes, as shown in Figure 84. The first camera aimed to locate the cone within the gripper’s working zone, calculate the robot’s action plan to approach the cone, and guide the gripper to complete the grasping task, as shown in Figures 83-84.
Figure 84. Dual camera installation for 2022-2023 season. The top camera looks down to detect and locate the cone.
Figure 85. A top-down view of the grasping scene from the camera’s perspective.
Once the cone is grasped and lifted to the right height, the robot needs to decide how to deliver it to the pole and drop the cone down. For this purpose, it is important to use the camera to locate the pole’s top and measure its relative location to the robot. We used the OpenCV library integrated in the FTC software package for automated detection of the pole, as shown in Figure 86. Here, the detected pole was lighted by colored circle, which was detected and visualized by the OpenCV function (Hough transform). Please refer to our released codes (shared in Appendix) for more details.
Figure 86. Automatically deliver the second cone to the junction.
The second camera aims to locate the randomly placed team-supplied signal sleeve, recognize its pattern, and determine the randomized signal parking zone, as illustrated in Figures 87-88. Again, the few-shot object detection methods in Figures 78-83 could be used here for cone detection. In addition, a colored cover can be added to the cone as signal sleeve, as shown on the top row of Figure 88. My experience is that it is relatively easy to cognize the three colors of red, green and blue in real FTC competition fields.
Figure 87. The second camera deals with automated recognition of the randomized team-supplied signal sleeve.
Figure 88. Examples of the team-supplied signal sleeves, and automatic recognition of them. Based on these recognitions, the robot determines the randomized signal parking zone.