1. Introduction
Automatic license plate recognition (ALPR) is a highly demanded feature of video analytics in smart cities, playing a key role in enhancing security-related operations. Vast range of applications benefit from this technology such as police surveillance, parking management, access control, tolling, and intelligent transportation systems. Automating the vehicle identification process creates a significant added value for security and autonomy standards. Hence, ALPR research field is in a fast-pace exploration and investigation and constantly proposes innovative systems for effective and real-time processing achievements. Compared to traditional ALPR configurations (fixed specialized cameras and processing hardware), the in-the-wild scenario has recently received growing attention [Reference Chen and Wang1–Reference Zhang, Wang, Li, Li, Shen and Zhang10].
The in-the-wild scenario refers to the use of non-specialized video surveillance systems and settings to detect and transcribe license plates (LPs). This scenario gets its complexity from a number of factors including non-specialized camera positioning and setup, lighting imperfection, and a wide field of view. Also known as complex natural scenes, unrestricted scenarios, real-world scenarios, or open scenarios, these setups significantly increase the difficulty and challenges associated with ALPR (e.g. defocusing, variations in distance, perspective distortion, and multiple vehicles), making it a notably tough task. For instance, Liu et al. [Reference Liu and Chang5] proposed a hybrid cascade structure for detecting small and vague LPs in large and complex visual surveillance scenes. Wang et al. [Reference Wang, Lu, Zhang, Yuan and Li11] introduced MFLPR-Net for reading LPs captured by driving recorders. Similarly, Laroca et al. [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12] presented a YOLO-based ALPR system to read LPs in images taken from inside a vehicle driving through regular traffic on Brazilian streets. Mokayed et al. [Reference Mokayed, Shivakumara, Woon, Kankanhalli, Lu and Pal13] presented DCT-PCM method for detecting LP numbers in drone images. Their method combines a phase congruency-based model [Reference Chen, Xue, Zhang, Lu and Xia14], to extract robust features and propose LP region candidates, with fully connected layers to sort out false positive candidates. However, to the best of our knowledge, none of these methods have been custom-tailored to work with security mobile robots.
In this context, our work addresses the in-the-wild scenario by proposing an efficient ALPR system implemented on a mobile security robot, named PGuard.Footnote 1 PGuard, a security patroller shown in Fig. 1, is a rugged platform dedicated to security and safety. Its main mission is to autonomously patrol and secure high-risk sites (e.g. industrial plants, nuclear plants, military areas, airports, and logistic warehouses) through in-site or perimeter patrols with random stakeouts and doubt-clearing interventions. PGuard is equipped with centimeter precision navigation system and security-embedded payload. For recurrent patrols, PGuard maintains a speed of 4 km/h, with the capacity to reach a maximum speed of about 12 km/h if required. Communication with PGuard is established through its secure Wi-Fi Mesh network or via public or private mobile networks. During navigation, PGuard is equipped with obstacle detection and avoidance capabilities for both static and dynamic objects. Secure remote operation is also possible using joysticks. The PGuard’s security payload includes two panoramic cameras that provide 360 $^{\circ }$ immersive vision. In addition, PGuard is equipped with one thermal camera and one optical camera. The latter is an AXIS Q1806-LE camera featuring 32 $\times$ optical zoom and adaptive infrared lighting with up to 90 FPS @ $2880\times 1620$ . With its versatile features, the PGuard is well-suited for a wide spectrum of security and surveillance tasks.
Therefore, the proposed ALPR system will enable the security robot to automatically identify and monitor unauthorized vehicles within restricted areas, providing real-time alerts to the security team. Furthermore, the mobility of the robot allows it to cover larger areas and varied terrains, extending the reach of the surveillance system.
Despite the wide range of benefits of in-patrol ALPR systems, several challenges could be raised. Primarily, environmental conditions such as low light, excessive sunlight, rain, snow, or fog could hamper image capture, LP detection (LPD), and LP recognition (LPR). Second, these systems must effectively deal with a range of angles and distances, as the robot will encounter vehicles from different perspectives while patrolling. The movement of both the robot and vehicles could lead to motion blur. Furthermore, the diversity in LP designs, including colors, sizes, and fonts could complicate the LPR task. Thus, while integrating ALPR with security robots offers significant advantages, it also requires overcoming various operational and environmental challenges. To address these challenges, we propose in this article an ALPR system, and we detail its deployment on a security robot to achieve efficient inference. The proposed ALPR system consists of YOLOv7x model to detect LPs, and then an optimized ViTLPR-based engine, to complete the LPR task. In particular, ViTLPR can manage LP images under various conditions and does not require any additional processing on the detected LP images. The conducted experiments ensure the generalization ability of the proposed system across several benchmark datasets. As an additional contribution, we collected a Tunisian LP dataset using PGuard. Our dataset contains more realistic scenes compared to existing datasets. With this work, we introduce a novel single-object tracking benchmark and aim to advance the literature through several contributions:
-
• An efficient two-stage ALPR system is introduced. It is based on a novel segmentation-free LPR model, named the vision transformer-based LP recognizer (ViTLPR). ViTLPR addresses the LPR task as a sequence labeling problem. It extends vision transformer to predict character sequences [Reference Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit and Houlsby15].
-
• An extended version of the PGTLP dataset [Reference Ismail, Mehri, Sahbani and Amara16], named PGTLP-v2 is presented. It consists of $5$ k vehicle images acquired using PGuard. The dataset is made accessible to the research community upon request through this https://emirismail.github.io/repo.
-
• A series of thorough experiments was carried out on the proposed dataset and five benchmark LP datasets to assess the effectiveness of the proposed system (YOLOv7x-ViTLPR). The results show that the two modules perform competitively compared to recent baselines.
-
• An enhancement of the efficiency of ViTLPR is proposed through an optimization strategy that involves profiling and fine-tuning with the TensorRT framework, leading to an optimized engine with lower latency and faster processing speed.
The remainder of this article is organized as follows. Section 2 provides a review of recent state-of-the-art ALPR systems. Section 3 introduces the proposed ALPR system and describes its implementation on the PGuard robot, followed by the presentation of the achieved results in Section 5. In section 6, we illustrate the optimization strategy used to achieve real-time inference in our ALPR system and discuss the impact of data and deblurring on the LPR performance. Finally, Section 7 concludes the article and suggests directions for future work.
2. Related work
2.1. ALPR systems
ALPR systems consist of two primary components: LPD and LPR. In the literature, various studies have proposed ALPR systems that address these two sub-tasks either through a single end-to-end trainable network or a two-stage system. The former typically leverages a multi-task learning approach, using shared features between the LPD and LPR tasks. For instance, Qin et al. [Reference Qin and Liu6] used feature pyramid networks (FPNs) [Reference Lin, Dollár, Girshick, He, Hariharan and Belongie17] to improve the feature extraction process of convolutional neural networks (CNNs). Li and al. [Reference Li, Wang and Shen18] used a region proposal network [Reference Ren, He, Girshick, Sun, Cortes, Lawrence, Lee, Sugiyama and Garnett19] on top of CNN extracted features to generate region proposals. These two methods are based on region of interest pooling to extract a fixed-size feature map, which is then processed through two branches: fully connected layers for bounding box regression, and recurrent neural networks (RNNs) with connectionist temporal classification (CTC) loss for LPR. For step-wise or two-module ALPR systems, for example, Kessentini et al. [Reference Kessentini, Besbes, Ammar and Chabbouh20] combined YOLOv2 and bidirectional RNNs. Laroca et al. [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12] also used YOLOv2 to detect LPs and CR-NET [Reference Montazzolli and Jung21] to segment and recognize LP characters. Wang et al. [Reference Wang, Bian, Zhou and Chau22] proposed two cascaded CNNs: VertexNet for LPD and SCR-Net for LPR, with the main focus of achieving high inference speed.
2.2. LPR systems
As our main contribution to this work is a novel LPR engine, recent approaches related to this sub-task are presented in subsections 2.3 and 2.4.
Over the last decade, most of the LPR systems have relied essentially on deep learning (DL). Numerous DL-based solutions continually push the boundaries and achieve new state-of-the-art results across many standards [Reference Shashirangana, Padmasiri, Meedeniya and Perera23]. Considering the perspective from which LP characters can be seen – namely as objects or texts – we classify recent solutions into two categories: object recognition and sequence labeling.
2.3. Object recognition
Recent works have approached LPR by framing it as an object recognition problem, with each character being processed as a separate object. For instance, Rizvi et al. [Reference Rizvi, Patti, Björklund, Cabodi and Francini24] presented a 12-layer fully CNN with two branches (detector and localizer). Using shared connections of convolutional layers, the detector predicts character class, while the localizer outputs bounding box coordinates. Montazzolli et al. [Reference Montazzolli and Jung21] presented LPS/CR-NET, which is a FAST-YOLO based detector. First, minor changes were introduced to the original architecture to make it suitable for Brazilian LP layouts. They then followed their module predictions using two heuristic rules to eliminate confusion between letters and digits. Their architecture was later used by Silva and al. [Reference Silva, Jung, Ferrari, Hebert, Sminchisescu and Weiss7] and Laroca et al. [Reference Laroca, Zanlorensi, Gonçalves, Todt, Schwartz and Menotti25] to recognize distorted LP in unconstrained scenarios. Relying on YOLOv2, Kessentini et al. [Reference Kessentini, Besbes, Ammar and Chabbouh20] adjusted the original architecture to fit the Tunisian LP specifications. Similarly, Henry et al. [Reference Henry, Ahn and Lee26] used an improved version of the YOLOv3 detector that integrates a spatial pyramid pooling (SPP) block [Reference He, Zhang, Ren and Sun27], called YOLOv3-SPP. Their contribution perfectly fits the double-line plates scenario across many countries. Selmi et al. [Reference Selmi, Halima, Pal and Alimi28] adopted Mask-RCNN to segment and extract character candidates. To improve detection, they applied a set of rules to remove non-character regions.
The key benefit of these solutions is their ability to provide a rapid response time while maintaining a satisfactory rate of prediction. However, to train these models, an enormous amount of training data with costly character-level annotation is required. In addition, these solutions have been proven to be ineffective for distorted and tilted LPs.
2.4. Sequence labeling
Researchers have also addressed LPR as a standard character sequence labeling task. In earlier research works [Reference Kessentini, Besbes, Ammar and Chabbouh20,Reference Cao, Fu and Ma29,Reference Wang, Huang, Qian, Cao and Dai30], a scheme of convolution layers combined with recurrent ones, followed by CTC layer [Reference Graves, Liwicki, Fernández, Bertolami, Bunke and Schmidhuber31], was adopted. Specifically, LP sequence features are encoded through CNNs and RNNs, and decoded by CTC. Bidirectional long short-term memory (Bi-LSTM) networks [Reference Hochreiter and Schmidhuber32] were commonly used. For instance, Li et al. [Reference Li, Wang and Shen18] proposed a solution that can handle oblique and normal LPs. However, integrating the CTC layer in their model for sequence decoding yielded a weak performance. Wang et al. [Reference Wang, Huang, Qian, Cao and Dai30] proposed a DL architecture similar to the regular CNN-RNN-CTC stack, but before the CNN, they fed the detected LPs to spatial transformer networks [Reference Jaderberg, Simonyan, Zisserman, Kavukcuoglu, Cortes, Lawrence, Lee, Sugiyama and Garnett33] to adjust it (i.e. to obtain aligned characters of uniform heights and widths). Lately, the CTC layer has been replaced by an attention mechanism. For instance, Xu et al. [Reference Xu, Yang, Meng, Lu, Huang, Ying, Huang, Ferrari, Hebert, Sminchisescu and Weiss34] proposed a model, called RPnet, that uses the attention mechanism in the detection module to direct the recognition module to a specific area from which it should gather features that are useful for predicting the LP number. Most recently, a scheme of encoder-decoder was highly adopted. Precisely, the encoder is a CNN structure, while the decoder is an attention-based RNN. For example, He et al. [Reference He and Hao35] and Zou et al. [63] used a 1D-attention decoder to only focus on useful character features. Whereas these methods perform well in single-line LPs, they can not deal with double-line LPs. To cope with these limitations, 2D attention was applied instead. Both Zhang et al. [Reference Zhang, Wang, Li, Li, Shen and Zhang10] and Xu et al. [Reference Xu, Zhou, Li, Liu, Li and Shi36] integrated a 2D-attention mechanism to perform decoding and retrieve characters on the 2D feature maps. This technique yielded promising results on multi-line LPs, and unaligned LPs as well. Kumar et al. [Reference Kumar, Shivakumara, Chowdhury, Pal and Liu37] replaced the 2D attention with a transformer-based decoder to generate feature maps using an eight-head attention mechanism. They showed that their module had faster training and inference compared to RNN, but it relied on CNN for feature extraction.
3. Proposed ALPR system
The initial phase in the pipeline of the proposed ALPR system deployed on PGuard involves transmitting the video feeds from the PGuard’s onboard cameras through the recording server to the Gstreamer Sink. The Sink function consists of compiling a batch of frames from three different input cameras. The Sink ensures that frames from these different cameras are synchronized and ready for further processing. Once the Sink has successfully formed a batch of frames, it forwards them to the LPD module to detect LPs in the batched frames.
Our ALPR system is composed of two main stages: LPD using YOLOv7x [Reference Wang, Bochkovskiy and Liao38] and LPR using a novel segmentation-free LPR model introduced in this article. The proposed LPR model, called ViTLPR, is specifically designed to extract and recognize characters from LP regions identified by the LPD module based on the YOLOv7x model [Reference Wang, Bochkovskiy and Liao38].
The recognition results of ViTLPR are then packaged together as metadata. This metadata is subsequently consumed by an event manager. The event manager is tasked with controlling alarms and generating recordings based on a set of predefined rules. An example of such a rule is a white list, which contains LPs that are given access to a particular area. Finally, the results are saved in the database, which includes LP region and its transcription.
These results are processed in two manners. They can either be shown as live feeds, providing real-time updates, or they can be stored as recorded videos. The recorded videos can be accessed and reviewed at a later time, providing a valuable resource for retrospective analysis or investigation.
The metadata is also consumed in the Milestone XProtect video management softwareFootnote 2 through third-party integration. Our plugin, currently in development, provides features such as a live dashboard, weekly summaries, search capabilities, etc. Figure 2 depicts the pipeline of the proposed ALPR system (YOLOv7x + ViTLPR) deployed on the PGuard robot
3.1. YOLOv7x
YOLOv7, one of the official releases in the YOLO series of object detection models [Reference Wang, Bochkovskiy and Liao38], is known for its efficiency-speed trade-off, making it popular for applications where real-time detection is crucial, such as in video surveillance, particularly LPD. At the time of its publication, YOLOv7 surpassed all known object detectors in speed and accuracy and was able to reach up to 160 FPS (GPU V100). The major upgrades in YOLOv7 are mainly:
-
• Extended efficient layer aggregation network (E-ELAN) [Reference Wang, Liao and Yeh39]: E-ELAN combines the features of different groups by shuffling and merging cardinality to enhance the network’s learning. It allows for feature extraction, allowing for better gradient flow, and feature reuse, improving performance without increasing complexity.
-
• Model Scaling: YOLOv7 applies compound model scaling techniques, balancing depth, width, and resolution to improve performance across various computing resources and image sizes.
-
• Bag-of-freebies: YOLOv7 uses RepConv [Reference Ding, Zhang, Ma, Han, Ding and Sun40], which allows it to have more efficient inference by converting multi-branch convolutions into a single branch during inference, boosting speed without losing in terms of accuracy.
In our proposed ALPR system, we use the “extend” version YOLOv7x. This variant uses the compound scaling method to perform scaling-up of the depth of computational block by 1.5 times and width of transition block by 1.25 times [Reference Wang, Bochkovskiy and Liao38].
3.2. ViTLPR
ViTLPR is a recurrence- and convolution-free model based on the self-attention mechanism. It leverages advancements in using attention mechanisms for various vision tasks. It is an extension of vision transformer that was originally introduced for image classification [Reference Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit and Houlsby15]. It approaches the LPR task as a sequence labeling problem for multiple reasons: internal variability of LPs (e.g. font style, orientation, shape, size, color, texture, illumination) and external influences (e.g. camera sensor orientation, location, and imperfections resulting in blur, noise, or distortions). The ViTLPR architecture is depicted in Fig. 3.
The ViTLPR workflow is detailed in what follows. The LP image $x\in \mathbb{R}^{H \times W \times 3}$ ( $H$ and $W$ denote the image height and width, respectively) is partitioned into patches $x_{p}\in \mathbb{R}^{N\times (P \times P\times 3)}$ , where $N=\frac{H \times W}{P \times P}$ denotes the effective input sequence length for the encoder block. $P \times P$ denotes the resolution of each patch. Each patch is unrolled into a vector, then linearly projected to $d_{model}$ dimension. The projection is performed using a matrix $E$ . Positional embeddings are added to patch embeddings. $L$ ViT encoder blocks process the patch embeddings and produce, using a prediction head, the LP transcription. In what follows, we describe the three building blocks of ViTLPR.
-
1. Linear embedding: First, LP image is divided into $P \times P \times 3$ patches. Each patch is reshaped to $(P \times P \times 3)\times 1$ vector. A dense layer is applied to each vector $z_{i}=Ex_{i}+b$ , where $E$ and $b$ are parameters to be learned during the training phase. Vectors $z_{i\in [1 \dots N]}$ are patch embeddings linearly projected into a new space with a dimension $d_{model}$ . To add positional information, learnable 1D position embeddings are accordingly associated with patch embeddings.
-
2. Encoder: Each layer of the encoder block uses the multi-head self-attention (MHSA) and 2-layer perceptron (MLP) techniques. The skip connections are used in between to maintain lower-level features. In this work, the embedding sequence $Z \in \mathbb{R}^{(N+1) \times d_{model}}$ is linearly mapped into a new representation subspace using three trainable parameter matrices: $W^{Q}\in \mathbb{R}^{d_{model} \times d_{q}}$ , $W^{K}\in \mathbb{R}^{d_{model} \times d_{k}}$ , and $W^{V}\in \mathbb{R}^{d_{model} \times d_{v}}$ . This projection produces a triplet: query ( $Q$ ), key ( $K$ ), and value ( $V$ ), where $Q=Z \times W^{Q}$ , $K=Z \times W^{K}$ , and $V=Z \times W^{V}$ . The attention operation is defined by Equation 1.
(1) \begin{equation} \text{Z}=\text{Attention(Q, K, V)}=\text{Softmax}\!\left(\frac{Q\cdot K^{T}}{\sqrt{d_{model}}}\right)\cdot \text{V} \end{equation}The output of the Softmax function is called an attention filter. To improve the performance of the Vanilla self-attention layer, the idea is to perform in parallel $h$ self-attention operations independently [Reference Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin, Guyon, von Luxburg, Bengio, Wallach, Fergus, Vishwanathan and Garnett41]. The single-head self-attentions have the same inputs. However, they do not share parameter matrices. For each head, different $Q$ , $K$ , and $V$ matrices are learned. The MHSA operation is defined by Equation 2.
(2) \begin{equation} \text{MultiHead}(Q', K', V') = \text{Concat}(\text{head}_{1}, \ldots , \text{head}_{h}) \cdot W^{o} \end{equation}where:(3) \begin{align} \text{head}_{i} = \text{Attention}(Q_{i}, K_{i}, V_{i}). \end{align}The MHSA outputs are fed into a 2-layer perceptron. It has the form GELU $((W_{1}Z + b_{1})W_{2}+ b_{2})$ , where $W_{1}\in \mathbb{R}^{d_{model} \times d_{latent}}$ and $W_{2}\in \mathbb{R}^{d_{latent} \times d_{model}}$ are two parameter matrices shared across the sequence $z_{i\in [0..N]}$ . The encoder block is repeated $L$ times to output “feature sequence” of shape ( $N \times d_{model}$ ). Each position ( $1, \ldots , N$ ) is a $d_{model}$ -dimensional vector that carries rich, context-aware representations of its corresponding LP patch.
-
3. Prediction head: The “feature sequence” is transformed into class scores in order to produce a sequence of characters. First, we slice the ( $N \times d_{model}$ ) matrix to obtain ( $M \times d_{model}$ ) matrix, where $M$ is the maximum number of characters in the LP (variable). We add two additional rows (embeddings) for the start ( $start$ ) and end ( $e$ ) tokens and learn them during training. Then, we apply a linear layer to transform it to ( $M+2$ , $C=36$ alphanumeric characters) sequence. This layer will map each of the $d_{model}$ -dimensional embeddings to a $C$ -dimensional vector and predict the character sequence.
4. Experimental setup
This section describes the experimental analysis conducted to evaluate the performance of the proposed system. First, we outline the experimental protocol, including a brief description of the benchmark datasets used in our experiments and the implementation details of the proposed ALPR system.
4.1. Datasets
We validate the robustness of the proposed system by showing experimental results on the PGTLP-v2 dataset and five benchmark ALPR datasets: PGTLP-v2, LSV-LP [Reference Wang, Lu, Zhang, Yuan and Li11], RodoSol-ALPR [Reference Laroca, Cardoso, Lucio, Estevam, Menotti, Farinella, Radeva and Bouatouch42], UFPR-ALPR [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12], CCPD [Reference Xu, Yang, Meng, Lu, Huang, Ying, Huang, Ferrari, Hebert, Sminchisescu and Weiss34], and AOLP [Reference Hsu, Chen and Chung43]. Image samples from the six benchmark datasets and their respective plates extracted based on the provided ground truth annotations are shown in Fig. 5. The specifications of the six datasets used in our experiments are listed in Table I.
$^\ast$ 170,999 images do not contain LPs. $^{\ast \ast }$ 47,055 LPs without transcription.
PGTLP-v2 is a Tunisian image dataset that counts $5$ k high-resolution images captured by cameras mounted on PGuard. Images were extracted from videos after patrolling parking, entrance gates, roundabouts, and driveways. This dataset provides large and diverse set of variations in templates, resolutions, and weather conditions. One particular feature of this dataset compared to existing datasets is that it contains several LPs per image. This dataset is an extension to previous version of PGTLP dataset that was initially introduced in [Reference Ismail, Mehri, Sahbani and Amara16] for LPD [Reference Ismail, Mehri, Sahbani and Amara44]. PGTLP-v2 includes three annotation levels: rectangular boxes for the LP, four vertices of the LP region, and labels for the full LP sequence (i.e. transcription). Figure 4 depicts few samples in the PGTLP-v2 dataset.
LSV-LP is a Chinese large-scale video-based dataset recorded using driving recorders, street camera shooting, and mobile phone shooting. The shooting locations include highways, streets, parking lots, and other scenes, covering $27$ provinces of China mainland. The LSV-LP dataset contains $1,402$ video clips of $300$ frames resulting in more than $400$ k frames and $364$ k LPs. The annotations include the bounding box around the vehicles, four vertices around the LP, and LP transcription [Reference Wang, Lu, Zhang, Yuan and Li11].
RodoSol-ALPR is a Brazilian image dataset collected by static cameras placed at pay tolls where the distance from the vehicle to the camera varies slightly. It contains $20$ k images that have a resolution of $1,280 \times 720$ pixels and are divided as follows: $5$ k images of cars with Brazilian LPs, $5$ k images of motorcycles with Brazilian LPs, $5$ k images of cars with Mercosur LPs, and $5$ k images of motorcycles with Mercosur LPs. The annotations cover the LP location and its related transcription [Reference Laroca, Cardoso, Lucio, Estevam, Menotti, Farinella, Radeva and Bouatouch42].
UFPR-ALPR includes $4,500$ images of $150$ Brazilian vehicles. It is divided into three subsets: training, validation, and test, where each subset contains $60$ , $30$ , and $60$ tracks, respectively. Each track has $30$ frames capturing the same vehicle. The UFPR-ALPR images have the following annotations: LP positions, LP numbers, and positions of its characters [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12].
CCPD (v2019) is a large-scale Chinese LP dataset that was collected from roadside parking using handheld devices. Hence, it exhibits strong variations in vehicle distance, shooting angle, light condition, and image background. Images in CCPD have at most one annotated LP in the foreground. We used the latest available version released in 2019. The entire dataset counts $355,003$ images. Annotations in CCPD are mainly LP numbers, LP bounding boxes, four vertices locations, and tilt degrees [Reference Xu, Yang, Meng, Lu, Huang, Ying, Huang, Ferrari, Hebert, Sminchisescu and Weiss34].
AOLP is a widely used benchmark for evaluating LPR approaches. This dataset contains $2,049$ images of Taiwanese LPs. It is divided into three subsets: access control (AC), law enforcement (LE), and road patrol (RP). The subsets contains $681$ , $757$ , and $611$ images, respectively. The RP subset is considered the most challenging since it has many samples with oblique LPs. Ground-truth annotations contain coordinates of LP bounding boxes and their sequence labels [Reference Hsu, Chen and Chung43].
4.2. Evaluation protocol
All the experiments were conducted on a desktop with Intel® Xeon® CPU E5-2660 v3@2.60 GHz (20cores), 32 GB DDR4 RAM, and one Nvidia® TITAN® RTXTM GPU with 24 GB GDDR6 RAM. Table II presents the number of images and LPs used for training, testing, and validation in each dataset used in our experiments.
To evaluate the performance of the stage of detecting LPs, we computed precision and recall metrics. The detected LP is considered correct if the intersections over union (IoU) between the detection and ground truth is greater than $0.5$ ( $0.7$ for CCPD dataset [Reference Xu, Yang, Meng, Lu, Huang, Ying, Huang, Ferrari, Hebert, Sminchisescu and Weiss34]). To evaluate the performance of the stage of recognizing LPs, we calculate the LPR rate (LP-RR) [Reference Kessentini, Besbes, Ammar and Chabbouh20]. LP-RR represents the amount of correctly recognized LPs over all LPs in the dataset (see Equation 4). Prediction is considered correct if and only if all LP characters are correctly recognized.
To evaluate the performance of the stage of recognizing LPs on the LSV-LP dataset, we computed two metrics: Accuracy_6C and Accuracy_7C, as defined in the original protocol by Wang et al. [Reference Wang, Lu, Zhang, Yuan and Li11]. Accuracy_6C measures the capability of the model to correctly recognize the 6 last characters (i.e. we exclude the first character representing the Chinese code region). Accuracy_7C measures the model’s ability to correctly recognize all seven characters of an LP (i.e. we include the first character representing the Chinese code region).
In our ALPR system, we use ViTLPR, which has similar properties to the base version of the vision transformer (ViT-Base [Reference Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit and Houlsby15]). Table III lists the architectural parameters of ViTLPR. Parameter $M$ is selected based on the number of LP characters in each dataset.
The training setup of the two stages of the proposed ALPR system is presented in Table IV. To train the LPD model YOLOv7x, we used the following setup (epochs = 40, batch size = 16, optimizer = SGD, initial learning rate = 0.01). Since ViT-based models require large data volumes to perform well, we initialized the ViTLPR encoder with weights provided by Touvron et al. [Reference Touvron, Cord, Douze, Massa, Sablayrolles, Jegou, Meila and Zhang45] before being fine-tuned on downstream datasets. We applied a few data augmentation techniques [Reference Cubuk, Zoph, Shlens and Le46] to raw images. The selected augmentation functions are projective distortion, blurring, noise, and rotation. To train ViTLPR, we only used LP regions as input and their character sequences as labels.
5. Results
In this section, we discuss the results achieved by the selected baselines and our ALPR modules (YOLOv7x-ViTLPR) on each dataset individually.
5.1. PGTLP-v2
For the PGTLP-v2 dataset, we partitioned our dataset into training and testing sets (80/20): 4k/1k images for LPD, resulting accordingly in $5,796$ / $1,448$ LPs for LPR. The images in the testing set were carefully selected while preserving the original ratios of the image distribution. In Table V, the precision/recall and recognition rates are presented in terms of LPD and LPR, respectively.
In terms of LPD, we note that YOLOv7x achieves higher performance. In particular, YOLOv7x outperforms the WPOD-NET model (ranked second) by $2.56\%$ and $2.09\%$ on precision and recall, respectively. The precision and recall rates achieved by YOLOv7x are $94.60\%$ and $91.20\%$ , respectively, indicating its ability to accurately detect LPs with minimal false positives.
In terms of LPR, ViTLPR is able to recognize $82.32\%$ of $1,448$ LPs, outperforming the baselines with a margin of $8.68\%$ compared to the second-best model (DAN), indicating a clear improvement in performance. Although CRNN with its ResNet-18 backbone and AttentionOCR are the smallest models, they reached only $47.31\%$ and $66.16\%$ of LP-RR, respectively. This result proves the robustness of ViTLPR to handle LPR in the wild. The recognition results on PGTLP-v2 are shown in Fig. 6(a).
5.2. LSV-LP
For the LSV-LP dataset, we followed the experimental protocol provided by Wang et al. [Reference Wang, Lu, Zhang, Yuan and Li11]. However, we observe that there are frames without LPs ( $170,999$ frames), and LPs annotated with a placeholder “##-#####”( $43,613$ LPs). After removing them, the new splits are presented in Table II. We recall that since the LSV-LP datasets are divided into three subsets (S2M, M2S, and M2M), each subset was processed separately and has its proper training, validation, and test splits. In Table VI, the precision/recall and recognition rates are presented in terms of LPD and LPR, respectively.
In terms of LPD, we note that YOLOv7x outperforms all the baselines in both precision and recall rates. Compared with WPOD-Net, ranked second, YOLOv7x achieves a $9.38\%$ and $3.97\%$ precision and recall gains, respectively. In particular, YOLOv7x reports the best performance ( $85.00\%$ precision, $87.40\%$ recall) on the Move vs. Move subset (M2M), which proves its superiority in detecting LPs when vehicles and cameras are moving. We point out through visual inspection that images in the test set contain LP instances that are not in ground truth. This leads mainly to false positives that affect the precision rates.
In terms of LPR, ViTLPR outperforms three models (AttentionOCR [Reference Wojna, Gorban, Lee, Murphy, Yu, Li and Ibarz49], CRNN [Reference Shi, Bai and Yao47], and DAN [Reference Wang, Zhu, Jin, Luo, Chen, Wu, Wang and Cai51]) mainly designed for scene text recognition, as well as the initial LSV-LP baseline, MFLPR-Net [Reference Wang, Lu, Zhang, Yuan and Li11]. It is worth mentioning that MFLPR-Net is similar to DAN and applies LP orientation correction through affine transformation. Although the move vs. static subset (M2S) is considered the most challenging since it contains distorted LPs, ViTLPR is able to correctly recognize $14,120$ LPs out of $19,103$ LPs ( $73.92\%$ ). Furthermore, ViTLPR shows robustness against Chinese characters compared to the considered models, with only a minor decrease ( $1.06\%$ ) in accuracy when recognizing entire LPs.
5.3. RodoSol-ALPR
For the RodoSol-ALPR dataset, we used the standard split defined by Laroca et al. [Reference Laroca, Cardoso, Lucio, Estevam, Menotti, Farinella, Radeva and Bouatouch42]. We note that images in RodoSol-ALPR feature a single LP instance. In Table VII, the precision/recall and recognition rates are presented in terms of LPD and LPR, respectively.
In terms of LPD, YOLOv7x reports the highest precision and recall rates, $99.90\%$ and $99.90\%$ , respectively. The high precision and recall values replicate the fact that YOLOv7x is able to correctly detect one LP in each image without almost any false negatives.
In terms of LPR, ViTLPR achieves the highest LP-RR ( $75.00\%$ ), while AttentionOCR places second. Out of $8$ k LPs, the ViTLPR model is able to correctly recognize $6014$ LPs. Meanwhile, AttentionOCR and DAN recognized $5,682$ ( $71.03\%$ ) and $5,458$ ( $68.23\%$ ) LPs, respectively. The CRNN-18 model has the lowest LP-RR rate ( $52.46\%$ ), resulting in around half of the images being correctly recognized. It is worth mentioning that in the testing set, $4$ k LPs are two-line, which is challenging for CRNN-based models.
5.4. UFPR-ALPR
For the UFPR-ALPR dataset, we followed the standard split defined by Laroca et al. [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12]. We observe that each image has one LP instance. In Table VIII, the precision/recall and recognition rates are presented in terms of LPD and LPR, respectively.
In terms of LPD, the results show that YOLOv7x has the best precision rate compared to the baselines, and is able to achieve a precision of $89.60\%$ . Compared to the second-ranked model (WPOD-NET), there is a gain of $3,08\%$ . Through a visual inspection of the detection results, we note that YOLOv7x detects some background objects as LPs (they are quite similar to LPs), which led to false detections (i.e. limited precision rate). Regarding recall, YOLOv7x is ranked second, losing to WPOD-NET by a margin of $0,13\%$ . The relatively low recall ( $82.80\%$ ) confirms that YOLOv7x fails to detect some LPs, resulting in false negatives.
In terms of LPR, ViTLPR correctly recognizes $94.00\%$ of LPs, outperforming Sighthound and OpenALPR by $31.00\%$ and $11.80\%$ , respectively. Additionally, ViTLPR achieves competitive results compared to the two baselines proposed by Laroca et al. [Reference Laroca, Severo, Zanlorensi, Oliveira, Gonçalves, Schwartz and Menotti12,Reference Laroca, Zanlorensi, Gonçalves, Todt, Schwartz and Menotti55], with respective margins of $29.11\%$ and $4.00\%$ . Notably, although the UFPR-ALPR dataset contains motorcycle images (two-line LPs), ViTLPR remains robust. The recognition results on UFPR-ALPR are shown in Fig. 6(b).
5.5. CCPD
For the CCPD dataset, we used the standard split available with the last version (2019) to guarantee fair comparison. To the best of our knowledge, no one has reported results on the currently available CCPD version. In Table IX, the precision and recognition rates are presented in terms of LPD and LPR, respectively.
In terms of LPD, YOLOv7x achieves an average precision of $98.96\%$ . YOLOv7x is able to outperform the baselines for all the subsets in the CCPD dataset. In particular, YOLOv7x shows high precision results ( $100.00\%$ ) on Base-Test and Weather subsets with zero false positives. We notice that the DB subset seems to be challenging to all baselines, including YOLOv7x, which was able to reach only $96.80\%$ of precision. This might be explained by the uneven illumination conditions in this subset.
In terms of LPR, we compared the proposed ViTLPR with three widely adopted models in scene text recognition: CRNN [Reference Shi, Bai and Yao47], DAN [Reference Wang, Zhu, Jin, Luo, Chen, Wu, Wang and Cai51], and AttentionOCR [Reference Wojna, Gorban, Lee, Murphy, Yu, Li and Ibarz49]. The CRNN-18 and CRNN-50 models employ ResNet-18 and ResNet-50 as their respective CNN backbone for feature extraction [Reference He, Zhang, Ren and Sun56]. The recognition rates are listed in Table IX. ViTLPR shows top-tier LP-RR results across all subsets of CCPD with an average LP-RR of $78.06\%$ . In particular, it is able to handle challenging conditions, such as extreme angles ( $78.40\%$ ) and blur ( $51.71\%$ ). On the CCPD-DB subset, where LPs have dark or extremely bright illuminations, all models show poor performance, with ViTLPR being relatively the best ( $60.25\%$ ). Similarly to the CCPD-Blur subset, where CRNN-18 is able to only recognize $333$ out of $20,611$ LPs and ViTLPR reaches an LP-RR of $51.71\%$ . This suggests room for improvement against illumination and blur conditions. The CCPD-Weather subset, where rainy, snow, or fog conditions are present, seems to be not challenging to all models. Based on the reported results, we demonstrate that ViTLPR is able to handle various weather conditions. Qualitative recognition instances from the CCPD-Challenge subset are shown in Fig. 6(c).
5.6. AOLP
For the AOLP dataset, we trained the models on any two subsets and tested them on the remaining subset. In Table X, the precision and recognition rates are presented in terms of LPD and LPR, respectively.
In terms of LPD, compared to the baselines, YOLOv7x detector achieves the best precision on LE and RP subsets. On the AC subset, WPOD-NET [Reference Silva, Jung, Ferrari, Hebert, Sminchisescu and Weiss7] has the highest precision with a gap of $0.47\%$ to YOLOv7x (ranked second).
In terms of LPR, we compared the performance of ViTLPR with three baselines: Bi-LSTM + CTC [Reference Li, Wang and Shen18], CRNN [Reference Kessentini, Besbes, Ammar and Chabbouh20], and Bi-LSTM + 1D-Attention [Reference Zou, Zhang, Yan, Jiang, Huang, Fan and Cui57]. These baselines share the same key building block: an LSTM module followed by either a CTC layer or a 1D-attention module. As shown in Table X, ViTLPR has the best LPR performance ( $97.19\%$ ) compared to all baselines. The overall recognition accuracy is increased by $1.53\%$ compared to the second-best results [Reference Zou, Zhang, Yan, Jiang, Huang, Fan and Cui57]. In particular, given that the RP subset is the most challenging (most LPs are distorted), ViTLPR achieves the highest performance gain ( $2.35\%$ ). This confirms the robustness of ViTLPR in successfully identifying irregular LPs. A recognition result achieved on a sample of the AOLP dataset is shown in Fig. 6(d).
6. Discussion
6.1. Inference optimization process
In order to make the proposed ALPR system suitable for efficient inference and deploy it on PGuard, we focus this work on proposing an optimization strategy. For this purpose, we used TensorRT, in particular, Torch-TensorRT, a compiler that allows us to export TensorRT-accelerated engines. The proposed optimization strategy consists of the following steps.
Architectural constraints: Due to its architecture, ViTLPR can not be directly supported by TensorRT. Therefore, several modifications were applied to the model design, following guidelines proposed in [Reference Xia, Li, Wu, Wang, Wang, Xiao, Zheng and Wang58], to create TensorRT-compatible engine. The modifications adopted are as follows: (1) reduce the number of heads in MHSA layer from $12$ to $3$ , (2) reduce the number of encoder blocks from $16$ to $8$ , and (3) add a ResNet bottleneck block after the encoder block to form a new encoder block. We note that while reducing the number of heads and encoder blocks results in a decrease in performance, adding a bottleneck block increases efficiency [Reference Xia, Li, Wu, Wang, Wang, Xiao, Zheng and Wang58].
Implementation constraints: We retrained ViTLPR to be quantified at the inference stage using quantization-aware training [Reference Nagel, Fournarakis, Amjad, Bondarenko, Van Baalen and Blankevoort59]. After training, ViTLPR was converted into an engine using TensorRT. Three different precision levels (FP32, FP16, and INT8) were examined for the weights. The final outcome is an optimized engine file ready for runtime execution. After making the aforementioned adjustments, the performance of ViTLPR is re-assessed in terms of accuracy and latency, and the results are presented in Table XI.
Observations: In what follows, we analyze the results of the proposed ALPR system (YOLOv7x+ViTLPR) on the PGTLP-v2 dataset under different optimization settings. As expected, a decrease in accuracy for both the LPD (YOLOv7x) and LPR (ViTLPR) models is observed when the precision level is reduced. When using the FP32 and FP16 precision levels, a shift in precision/recall balance is observed, although significant improvements in speed are noted ( $59$ / $162$ FPS). The precision dropped to $71.41\%$ , but the recall increased to $96.56\%$ . This suggests that while ViTLPR is better at identifying most of the LPs, its detection accuracy has decreased (more false detections). The INT8 optimization provides a slight accuracy improvement ( $71.90\%$ ) over FP32/FP13 but at the cost of reduced ability to detect all LPs (more missed LPs). As with YOLOv7x, optimization of ViTLPR enhances speed at the cost of accuracy. Each subsequent optimization (FP32, FP16, and INT8) increases the speed significantly ( $26$ , $54$ , and $107$ FPS, respectively), but this comes with a remarkable decline in LP-RR ( $80.32\%$ , $79.44\%$ , and $77.64\%$ ). In particular, with INT8 optimization, despite its drop in accuracy ( $77.64\%$ ), ViTLPR still performs better than the baselines (see Table V) with an improved speed of $107$ FPS.
To reflect the real-world deployment of the two deep models combined in sequence in the proposed ALPR system, we use only the LPs detected by YOLOv7x as input to the ViTLPR. We compute ACC metric defined by the number of correctly recognized LPs divided by the number of annotated LPs to measure the accuracy of the proposed ALPR system. When the models are used in their native forms, ViTLPR accurately recognizes $1,091$ out of the $1,326$ LPs, that are correctly detected by YOLOv7x, resulting in a system accuracy of $75.07\%$ . Interestingly, with FP32 optimization, the system achieves the best accuracy ( $77.56\%$ ), even though ViTLPR does not have the best LP-RR. This can be explained by the highest recall of YOLOv7x ( $96.56\%$ ).
6.2. Cross-dataset validation protocol
To evaluate the robustness and generalizability of the proposed system across different datasets, additional experiments were conducted using the Try-One-Dataset-Out validation protocol, where one dataset was held out at a time as unseen data for testing. The testing subsets of the held-out datasets are used to evaluate performance, providing insight into the models’ ability to handle data from various sources. In Table XII, the recall and recognition rates are presented in terms of LPD and LPR, respectively.
Regarding the LPD stage, the results show no significant impact on the recall rates when using the Try-One-Dataset-Out validation protocol compared to the traditional split. For the PGTLP-v2 dataset, a slight decrease of 0.98% is noted. This may be attributed to the nature of Tunisian LPs, as they are not represented in the other datasets. On the opposite, the LSV-LP and UFPR-ALPR datasets show an increase in recall of 0.25% and 0.95%, respectively. This may be due to highly similar instances of LPs in the CCPD and RodoSol-ALPR datasets. Overall, we suggest that deep models trained for LPD on various datasets are arguably able to perform reliably on images from datasets not previously encountered.
Regarding the LPR stage, we observe a decrease in recognition rates across all datasets when using the Try-One-Dataset-Out validation protocol compared to the traditional split. For instance, the LP-RR drops from 82.32% to 54.14% (−28.18%) when using PGTLP-v2 testing set. Furthermore, AOLP is the most impacted dataset and notices a significant performance drop of −44.05%. This decline suggests that the ViTLPR’s ability to generalize across different datasets is limited. When trained on all but one dataset, the model struggles to perform as well on the unseen dataset compared to when it is trained and tested on the same dataset.
6.3. Deblurring step
To address the issue of motion blur caused by the movement of both the robot and vehicles, a thorough set of experiments was conducted on three different datasets by adding a deblurring module as a pre-processing step in the proposed system. The goal is to investigate the impact of a deblurring step on both the recognition accuracy of the proposed system and its inference time.
In this work, we used two recent deep models: LaKDNet [Reference Ruan, Bemana, Seidel, Myszkowski and Chen60] and NAFNet [Reference Chen, Chu, Zhang, Sun, Avidan, Brostow, Cissé, Farinella and Hassner61], both designed for image deblurring. LaKDNet is a UNet-like lightweight CNN model with a special block LaKD that has large depth-wise convolution, providing larger effective receptive field. On the other hand, NAFNet is a non-linear activation-free model that was built by investigating state-of-the-art models through empirical evaluation and integrating them. Pretrained weights were used for both models since they require paired images for training.
Figure 7 illustrates the qualitative results of the detected LPs without a deblurring step (w/o) and with a deblurring module applied using the LaKDNet model [Reference Ruan, Bemana, Seidel, Myszkowski and Chen60] (ViTLPR w/ LaKDNet) and the NAFNet model [Reference Chen, Chu, Zhang, Sun, Avidan, Brostow, Cissé, Farinella and Hassner61] (ViTLPR w/ NAFNet). We note that the two models demonstrate significant improvement in deblurring in the context of LPR.
We applied these two deep models to images from PGTLP-v2 and two subsets from LSV-LP and CCPD, which are predominantly blurry. Table XIII presents a comparison of the results obtained with and without the deblurring step. Both models have shown positive impact by improving the recognition rates. In particular, NAFNet increases the LP-RR by 7.64% and 1.93% on the CCPD-Blur subset and PGTLP-v2 testing set, respectively. A minor impact is noted by both models on the LSV-LP (M2M) subset. For instance, LaKDNet is able to only improve the LP-RR by 0.06%. Although both deblurring modules improve recognition accuracy, an increase in inference time is observed, with LaKDNet adding 0.6239s and NAFNet 0.0793s, respectively. NAFNet has lower latency due to its lightweight design, achieving a better accuracy/latency balance. To address the issue of increased inference time due to the deblurring step, an interesting alternative could be integrating hardware-accelerated deblurring techniques into the PGuard security payload, which can significantly improve recognition accuracy and ensure faster inference time.
7. Conclusion and further work
In this article, we tackle the problem of ALPR in the wild. We present a dual-stage ALPR system designed for a mobile security robot named PGuard. The proposed system provides 7/7 monitoring, quick response, and accurate identification of vehicles, thus improving security measures. It integrates the off-the-shelf YOLOv7x model for LPD with a novel LPR model, called ViTLPR. ViTLPR predicts LP numbers using self-attention mechanism. Extensive experiments show that our system achieves competitive performance on different benchmark datasets using a simple encoder model without any pre-/post-processing steps. In particular, ViTLPR consistently outperforms its CNN/RNN-based counterparts with high-performance boost. The second major contribution is proposing an optimization strategy for ViTLPR using TensorRT. ViTLPR is optimized with different precision levels of FP32, FP16, and INT8 to reduce latency and improve throughput, resulting in a faster inference speed. As a supplement, we introduce an updated version of the PGTLP dataset, PGTLP-v2, which counts $5$ k annotated images of Tunisian LPs collected using PGuard made available for researchers upon request.
In future studies, there are essentially two main research directions. (i) A major bottleneck in ALPR systems is their low performance, at night, largely due to the lack of nighttime data. Thermal-to-visible domain adaptation for ALPR is a potential solution that leverages the robot’s thermal camera [Reference Marnissi, Fradi, Sahbani and Amara62,Reference Marnissi, Fradi, Sahbani and Amara63]. (ii) ALPR systems, like our system, commonly contain two stages (LPD and LPR). A unified one-stage module that simultaneously handles LPD and LPR using attention mechanism may reduce computational costs.
Author contributions
Conceptualization and software, A.I.; methodology, data curation, investigation, formal analysis, visualization, and writing – original draft preparation, A.I. and M.M.; formal analysis and writing – review and editing, A.S. and N.EBA.; supervision, project administration, and funding acquisition, N.EBA. All authors have read and agreed to the published version of the manuscript.
Financial support
This work has been supported by the MOBIDOC scheme, funded by the EU through the EMORI program and managed by the ANPR, which is gratefully acknowledged. It is also carried out under the research results valorization program (VRR) of the Tunisian Ministry of Higher Education and Scientific Research, which is gratefully acknowledged.
Competing interests
The authors declare no conflicts of interest exist.
Ethical approval
Not applicable.
Research data policy and data availability
The LSV-LP, RodoSol-ALPR, UFPR-ALPR, CCPD, and AOLP datasets used in this article are publicly available (some upon request).
LSV-LP: https://github.com/Forest-art/LSV-LP
RodoSol-ALPR: https://github.com/raysonlaroca/rodosol-alpr-dataset
UFPR-ALPR: https://github.com/raysonlaroca/ufpr-alpr-dataset
CCPD: https://github.com/detectRecog/CCPD
AOLP: https://github.com/AvLab-CV/AOLP
The PGTLP-v2 dataset used in this article is available on request for research purposes and scientific use.