Trends in Computer Science and Information Technology
1Department of Electronics and Information Technologies, Faculty of Architecture and Engineering, Nakhchivan State University, AZ 7012, Nakhchivan, Azerbaijan
2Department of Computer Engineering, Faculty of Engineering, Igdir University, 76000, Igdir, Turkey
3Turkey Department of Computer Engineering, Faculty of Engineering and Architecture, Fenerbahçe University, Istanbul, Turkey
Cite this as
Zeynalov J, Cakmak Y, Pacal I, Aliyev M. Deep Learning for Tomato Leaf Disease Classification: Comparative Benchmarking of CNN and Vision Transformer Architectures. Trends Comput Sci Inf Technol. 2026;11(1):027-034. Available from: 10.17352/tcsit.000106
Copyright License
© 2026 Zeynalov J, et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.Tomato production is highly vulnerable to foliar diseases that can reduce yield, increase management costs, and complicate timely intervention. Automated image-based diagnosis has therefore become an important research direction for precision agriculture. In this study, we present a comparative evaluation of four representative deep learning backbones for tomato leaf disease classification on the Plant Village dataset: EfficientNetV2-S, ConvNeXt-Base, DeiT3-Base, and Swin-Base. The dataset comprised 18,160 images from ten classes, including nine disease categories and one healthy class, and was divided into training, validation, and test sets using a 70:15:15 split. All models were trained under a standardized transfer learning pipeline with identical preprocessing, augmentation, and optimization settings to enable a fair comparison across architectures. Performance was assessed using accuracy, precision, recall, F1-score, parameter count, and GFLOPs. All evaluated models achieved very high classification performance, with test accuracies of at least 0.9985. Among them, Swin-Base yielded the best overall predictive performance, reaching an accuracy of 0.9989 and an F1-score of 0.9987. In contrast, EfficientNetV2-S provided the most favorable efficiency profile, achieving 0.9985 accuracy with only 20.19 million parameters and 5.4193 GFLOPs. These findings indicate that both convolutional and transformer-based models can deliver highly reliable tomato leaf disease classification under controlled benchmark conditions, while the final model choice should be guided by the application scenario. Swin-Base is preferable when maximum predictive performance is prioritized, whereas EfficientNetV2-S offers a more practical option for computationally constrained deployments.
As a vital component of global food security, the tomato (Solanum lycopersicum) ranks as one of the most widely cultivated and consumed horticultural crops, with global yields exceeding 186 million tons in 2020 [1,2]. Beyond their immense economic footprint, tomatoes deliver essential vitamins, antioxidants, and nutrients that are crucial to human diets worldwide. Nevertheless, the cultivation of this critical crop faces constant threats from a variety of bacterial, fungal, and viral pathogens, which frequently culminate in substantial financial deficits [3-5]. Viral infections alone, for instance, are responsible for an estimated annual yield reduction of 2% to 10%, while severe outbreaks possess the destructive potential to wipe out entire harvests entirely [6,7].
Historically, identifying these histopathological threats has relied heavily on the naked eye of farmers or agricultural experts [8,9]. This traditional approach, however, is inherently flawed; it is not only labor-intensive but also highly subjective and susceptible to human error, particularly when disease symptoms are faint or closely resemble nutritional disorders and other localized infections [10-12]. Because visual assessments are intrinsically tied to the varying expertise levels of the observer, they frequently lead to incorrect or delayed conclusions. Such diagnostic failures cascade into severe consequences, including the over-application of agrochemicals, escalating financial burdens for growers, and significant ecological harm, thereby highlighting an urgent necessity for rapid, precise, and fully automated disease detection systems [13-15].
Deep learning has emerged as one of the most influential paradigms in artificial intelligence [16], demonstrating remarkable success across a wide range of domains, including medical image analysis, object recognition, scene understanding, industrial inspection, and decision-support systems [17-19]. Its ability to automatically learn discriminative and hierarchical feature representations directly from raw data has substantially reduced the need for handcrafted feature engineering and has enabled major advances in complex pattern recognition tasks [20-24]. Recently, the integration of deep learning methodologies [25,26] into automated plant disease detection has gained considerable momentum, offering a robust and viable solution for disease identification and categorization [27–29]. Within this domain, Convolutional Neural Networks (CNNs) [30,31] have proven exceptionally effective due to their inherent capacity to autonomously derive and process hierarchical features straight from intricate visual data [32,33]. Concurrently, the emergence of Vision Transformers (Vitus), driven by sophisticated self-attention mechanisms, has redefined the benchmarks of state-of-the-art performance across broader computer vision tasks, and these architectures are now being heavily adapted to enhance diagnostic precision within smart agriculture [34,35].
Motivated by these advancements, this study proposes and rigorously evaluates a holistic deep learning framework engineered to automatically classify ten distinct tomato leaf diseases utilizing the widely accessible Plant Village dataset [36]. We conduct a comprehensive performance benchmarking across four cutting-edge neural network architectures: the efficiency-optimized EfficientNetV2-S, the advanced pure transformer DeiT3-Base, the modernized convolutional network ConvNeXt-Base, and the hierarchical vision transformer Swin-Base. The diagnostic efficacy of each model is systematically assessed utilizing standard classification metrics, namely accuracy, precision, recall, and F1-score. Ultimately, by facilitating the creation of accessible, highly accurate, and automated disease management instruments, the insights generated from this research strive to bolster global food security and promote sustainable agricultural practices.
To provide agricultural workers with an accessible diagnostic utility, Kebir, et al. [37] engineered a highly efficient, bespoke 20-layer Convolutional Neural Network (CNN). By training and validating this model on the Plant Village repository, the researchers successfully classified ten distinct foliage-based tomato pathologies. Ultimately, this streamlined architectural approach delivered a commendable diagnostic accuracy of 97.5%, establishing itself as a highly reliable and practical instrument for targeted disease intervention in farming communities.
Shifting the focus to a previously under-researched pathogen within the deep learning literature, Vásconez, et al. [38] targeted the visual manifestations of bacterial wilt (Ralstonia solanacearum). The authors conducted an extensive comparative assessment of fourteen distinct CNN architectures utilizing a proprietary dataset annotated for varying levels of disease severity. Their comprehensive analysis revealed that MobileNet-v2 and Xception yielded the most optimal results, each attaining an accuracy of 97.7% while maintaining an excellent equilibrium between computational speed and diagnostic precision.
To simultaneously capture intricate local details and broader global contexts within foliar images, Chen, et al. [39] devised an innovative hybrid framework that synergizes Convolutional Neural Networks (CNNs) with Transformer-based architectures. Furthermore, the researchers tackled inherent dataset limitations, such as severe class imbalances and data scarcity, by implementing a modified cycle-consistent generative adversarial network (CyTrGAN) to augment their training data. This dual-faceted strategy produced a highly compact yet exceptionally accurate model, registering a 99.45% success rate on the Plant Village dataset.
Similarly exploring architectural fusion, Tiwari, et al. [40] proposed a sophisticated hybrid system integrating a Vision Transformer (ViT) with a traditional Deep Neural Network (DNN) to bolster both the transparency and precision of tomato disease detection. The fundamental innovation in their research lies in an upgraded multi-head self-attention module integrated with an L1-norm attention mechanism, which empowers the network to isolate and prioritize essential morphological traits with greater efficacy. Consequently, this advanced configuration outpaced numerous contemporary methodologies, securing an outstanding accuracy of 99.74% across an extensive experimental dataset.
Addressing the frequent drop in Transformer model efficacy when deployed in uncontrolled, real-world environments, Shehu, et al. [41] formulated three distinct transfer learning paradigms utilizing Vision Transformers (ViTs). To rigorously evaluate the adaptability of their approach across fluctuating environmental parameters, the models were benchmarked against both the standardized Plant Village repository and a novel, field-acquired dataset designated as Tomato Ebola. Empirical outcomes confirmed that their ViT-Base variant not only achieved a 99.17% accuracy on the baseline dataset but also exhibited vastly superior resilience and generalization on the field-collected images when contrasted with alternative deep learning baselines.
Finally, pushing the boundaries of data efficiency, Sun, et al. [42] introduced a pioneering identification framework dubbed EMA-DeiT, which amalgamates a Data-efficient Image Transformer (DeiT) with self-distillation protocols and an exponential moving average (EMA) mechanism. By harnessing these advanced deep learning strategies, the proposed methodology substantially augmented both the stability and the diagnostic accuracy of the network. Ultimately, their architecture yielded exemplary results, realizing a 99.6% accuracy rate on the Plant Village benchmark while concurrently demonstrating remarkable adaptability across an array of heterogeneous datasets.
This empirical investigation leveraged the widely recognized, open-access PlantVillage repository, a comprehensive database featuring an extensive collection of both pathological and healthy foliage imagery [43]. For the purposes of this study, our analysis was strictly delimited to the tomato leaf subset, which encompasses ten distinct categories: nine specific pathogenic afflictions and a single healthy control group. To visually contextualize the morphological variations and symptomatic expressions inherent to these classifications, a curated selection of representative images is provided in Figure 1. Furthermore, to ensure a robust and unbiased evaluation of the proposed deep learning architectures, the entire image corpus was systematically partitioned into training, validation, and testing subsets utilizing a rigorous 70:15:15 split. A granular breakdown detailing the precise numerical distribution of images across all ten categories and their respective data partitions is systematically cataloged in Table 1.
The preprocessing of raw data constitutes a fundamental prerequisite for engineering robust and highly effective deep learning architectures [44,45]. To guarantee seamless integration with the specific input requirements of the selected pre-trained networks, the entire image corpus was uniformly standardized to a resolution of 224 × 224 pixels. Preceding the computational training phase, pixel intensity values were systematically normalized down to a continuous range of [0, 1], a critical calibration designed to expedite algorithmic convergence. Moreover, to proactively suppress over fitting tendencies and substantially elevate the models' predictive generalization on unseen data, a rigorous suite of data augmentation strategies was deployed. This augmentation pipeline enriched the training environment by subjecting the original imagery to a variety of stochastic morphological transformations, specifically incorporating random rotational shifts, horizontal inversions, and dynamic zooming operations.
Convolutional Neural Networks (CNNs): Convolutional Neural Networks (CNNs) have long served as the foundational pillar of modern computer vision, primarily due to their exceptional capacity to autonomously extract and process hierarchical feature representations directly from complex visual data. By employing a series of localized convolutional filters and pooling mechanisms, these networks mimic the human visual cortex, progressively identifying simple geometric edges and textures in their initial layers before synthesizing them into complex, high-level semantic patterns deeper within the architecture. This inherent spatial invariance makes them highly adept at recognizing subtle morphological anomalies in agricultural imagery, such as localized lesions or irregular discoloration on plant foliage. Recent evolutionary steps in this domain have yielded highly optimized, next-generation architectures; for instance, EfficientNetV2-S [46] maximizes diagnostic accuracy while minimizing parameter bloat through progressive learning techniques, whereas ConvNeXt-Base [47] modernizes the standard convolutional framework by integrating advanced structural principles derived from transformers, thereby ensuring robust, state-of-the-art feature extraction while maintaining computational tractability.
Vision Transformers (ViTs): In contrast to the localized, grid-based processing of traditional convolutions, Vision Transformers (ViTs) [48] process images by dividing them into discrete sequences of flattened patches, relying fundamentally on sophisticated self-attention mechanisms to capture complex, global dependencies across the entire visual field. This monumental paradigm shift has redefined the benchmarks of state-of-the-art performance across broader computer vision tasks, and these architectures are now being heavily adapted to enhance diagnostic precision within smart agriculture. By simultaneously weighing the relative importance of all image patches regardless of their spatial distance, ViTs can contextually link disparate disease symptoms across a leaf that might otherwise be overlooked by spatially constrained convolutional filters. Within the scope of this research, we harness advanced transformer variants such as DeiT3-Base [49], a pure transformer model engineered for exceptional data-efficient training, alongside the hierarchical Swin-Base [50], which intelligently reintroduces localized inductive biases through shifted window mechanisms to efficiently and accurately manage high-resolution psychopathological imagery.
The computational training regimen was executed over a maximum span of 100 epochs utilizing the Adaptive Moment Estimation (Adam) optimizer, strategically chosen for its efficient handling of sparse gradients and adaptive learning rate capabilities. The optimizer was configured with a base learning rate of 1x 10-4 and a weight decay of 2.0 x 10-5 to rigorously penalize complex weights and suppress over fitting. To ensure optimal algorithmic convergence and prevent the models from converging on suboptimal local minima, a cosine annealing learning rate scheduler was integrated into the pipeline. This scheduling included a 5-epoch warm-up phase, initiating at a reduced learning rate of 1.0 x 10-5 alongside a label smoothing factor of 0.1 to meticulously mitigate overconfidence in the network's predictive outputs. Initially, the core computational blocks of the pre-trained convolutional and transformer models were strictly frozen, isolating the back propagation process to train only the newly appended classification layers. Subsequently, these freezing constraints were lifted, allowing the entire network architecture to undergo a meticulous fine-tuning process; this critical phase perfectly aligned the pre-learned, generalized feature extraction mechanisms with the subtle morphological intricacies specific to tomato leaf pathologies.
In synergy with the transfer learning framework, a sophisticated data augmentation pipeline was deployed to artificially amplify the volume and heterogeneity of the training corpus, thereby fortifying the models' diagnostic generalizability against unseen imagery. Moving beyond basic spatial adjustments, the augmentation protocol was meticulously governed by the experimental hyper parameters: stochastic horizontal flipping was applied with a 50% probability, while vertical flipping was deliberately restricted. To simulate the highly variable lighting conditions encountered in real-world agricultural settings, the images were subjected to dynamic color jittering (with a factor of 0.4) to modify brightness, contrast, and saturation. Furthermore, random resized cropping was instituted, dynamically scaling images between 8% and 100% of their original dimensions with fluctuating aspect ratios (ranging from 0.75 to 1.33), supplemented by random spatial interpolations. These computationally generated permutations effectively precluded the network from memorizing superficial spatial orientations or localized background artifacts, compelling the model to strictly assimilate the fundamental, definitive visual biomarkers of the targeted phytopathological conditions.
To ensure a rigorous and equitable comparative analysis, the experimental framework was meticulously standardized across all selected deep learning architectures. The entire computational pipeline, encompassing model training, validation, and evaluation, was executed utilizing the Python programming environment integrated with the Tensor Flow backend. To accelerate the computationally intensive training phases and manage complex matrix operations efficiently, all empirical simulations were deployed on a high-performance workstation equipped with a state-of-the-art NVIDIA GeForce RTX 5090 graphic processing unit (GPU).
Maintaining strict methodological consistency, an identical training protocol was uniformly applied to all evaluated networks. Optimization was governed by the Adaptive Moment Estimation (Adam) algorithm, initialized with a learning rate of 1 × 10-4, to facilitate the comprehensive fine-tuning of the entire architectural pipeline. A batch size of 16 was systematically chosen to establish an optimal equilibrium between memory utilization and algorithmic accuracy over a maximum training duration of 100 epochs. Furthermore, to preemptively counteract over fitting and isolate the most robust iteration of each model, an early stopping mechanism was integrated. This functional safeguard continuously monitored validation loss, terminating the training sequence automatically if no statistical improvement was observed for 10 consecutive epochs. Ultimately, the network weights that yielded the absolute minimum validation loss were preserved and subsequently deployed for the final, unbiased performance appraisal on the isolated testing dataset.
To comprehensively quantify both the diagnostic efficacy and the operational viability of the proposed deep learning architectures, a dual-faceted evaluation strategy was implemented. The primary analytical framework relied upon four fundamental statistical indicators, accuracy, precision, recall, and the F1-score, to provide a holistic assessment of the models' predictive performance. In this context, overall accuracy represents the total proportion of correctly classified instances across all tomato leaf categories, whereas precision specifically gauges the system's reliability by measuring the exactness of its positive disease identifications. Concurrently, recall, or sensitivity, ascertains the network's effectiveness in capturing the entirety of actual positive cases. To harmonize the inherent trade-off between precision and recall, a critical necessity for evaluating datasets with potential internal variance, the F1-score is utilized as their harmonic means. The mathematical formulations defining these standard classification criteria are systematically detailed in Equations 1 through 4 below, where the foundational variables represent True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives (FN), respectively. Furthermore, because the practical deployment of these diagnostic tools in real-world agricultural settings heavily depends on resource consumption, the models were also strictly evaluated on their computational footprint. Consequently, the architecture was benchmarked against their total number of trainable parameters (Params) and the required Giga Floating-Point Operations per Second (GFLOPs). These specific metrics serve as vital markers of computational efficiency, allowing for a highly nuanced comparison of the inherent trade-offs between a network's structural complexity, its processing cost, and its overall feasibility for deployment.
The comprehensive evaluation of the selected deep learning architectures reveals exceptional diagnostic capabilities across all tested models, as systematically detailed in Table 2. The empirical results demonstrate a highly competitive baseline, with EfficientNetV2-S, DeiT3-Base, and ConvNeXt-Base all achieving an identical, outstanding accuracy of 0.9985.
Despite this parity in overall accuracy, subtle variations emerge when examining more granular performance metrics. EfficientNetV2-S recorded a precision of 0.9984, a recall of 0.9983, and an F1-score of 0.9983, showcasing a robust balance in its predictive reliability. In comparison, the pure transformer architecture, DeiT3-Base, yielded a precision of 0.9981, a recall of 0.9972, and an F1-score of 0.9976. Meanwhile, the modernized convolutional network, ConvNeXt-Base, demonstrated slightly superior harmonization with a precision of 0.9984, a recall of 0.9985, and an F1-score of 0.9984. These remarkably high scores across distinct architectural paradigms highlight the profound efficacy of leveraging transfer learning combined with rigorous data augmentation protocols to extract critical Phyto pathological features.
While diagnostic accuracy is paramount, the practical deployment of these models in resource-constrained agricultural environments necessitates a critical evaluation of their computational footprint. As outlined in Table 2, EfficientNetV2-S emerges as the undisputed leader in operational efficiency, requiring a mere 20.19 million parameters and 5.4193 GFLOPs to achieve its near-perfect classification metrics. This extraordinary equilibrium between low computational cost and high predictive power makes it an ideal candidate for integration into mobile or edge-computing diagnostic devices. Conversely, the heavier architectures incur significantly larger computational expenses. DeiT3-Base operates with 85.82 million parameters and 33.6955 GFLOPs, while ConvNeXt-Base utilizes 87.58 million parameters and 30.7075 GFLOPs. Although these advanced models deliver exceptional accuracy, their substantial parameter counts and higher computational processing requirements pose potential integration challenges for real-time, low-power field applications.
Amidst the highly competitive models evaluated, the hierarchical Vision Transformer, Swin-Base, distinguished itself by achieving the absolute highest performance metrics across the board. The model registered an unparalleled accuracy of 0.9989, complemented by a precision of 0.9988, a recall of 0.9987, and an F1-score of 0.9987. This superior performance requires 86.75 million trainable parameters and operates at 30.3375 GFLOPs. The exceptional diagnostic precision of the Swin-Base framework can be directly attributed to its unique shifted-window mechanism, which brilliantly synthesizes the global contextual awareness inherent to transformers with the localized feature extraction efficiency typical of convolutional networks. To provide a more granular, class-specific understanding of this optimal model's predictive behavior, its performance on the held-out test set is visually represented in Figure 2.
The confusion matrix corresponding to the Swin-Base architecture's predictions (Figure 2) vividly illustrates the model's formidable capacity to discriminate among complex and visually similar tomato leaf pathologies. An analysis of the matrix reveals virtually flawless classification across the vast majority of the ten categories. Specifically, the model achieved perfect recognition for the Healthy class (240 instances), Leaf Mold (144 instances), Septoria Leaf Spot (267 instances), Two-spotted Spider Mite (252 instances), Tomato Mosaic Virus (57 instances), and Tomato Yellow Leaf Curl Virus (805 instances). The negligible error margin is confined to a mere three misclassifications across the entire test set: one instance of Bacterial Spot (out of 320) was erroneously identified as Late Blight, one Early Blight sample (out of 150) was misclassified as Target Spot, and one Late Blight instance (out of 287) was incorrectly predicted as a Healthy leaf. These exceptionally rare misdiagnoses likely stem from highly ambiguous or overlapping visual symptoms at specific stages of disease progression. Ultimately, these results underscore the remarkable robustness and clinical viability of the Swin-Base architecture for automated agricultural diagnostics.
This study presented a standardized comparative analysis of four contemporary deep learning architectures for tomato leaf disease classification using the PlantVillage dataset. Across all experiments, the evaluated models achieved consistently high performance, confirming that both modern CNN-based and transformer-based backbones are highly effective for this task under controlled imaging conditions. Among the tested models, Swin-Base delivered the strongest predictive results, achieving the highest accuracy and F1-score, while EfficientNetV2-S provided the most attractive balance between classification performance and computational efficiency. These findings suggest that the selection of a model for automated tomato disease diagnosis should not rely on accuracy alone, but should also consider deployment constraints such as parameter count, computational cost, and target hardware.
From a practical perspective, the results support two complementary conclusions. First, hierarchical transformer models such as Swin-Base can provide marginal but meaningful gains in predictive performance when maximum classification accuracy is required. Second, lightweight and efficient CNN-based models such as EfficientNetV2-S remain highly competitive and may be more suitable for edge-oriented or resource-constrained agricultural applications. Accordingly, the present work provides a useful benchmark for selecting appropriate architectures according to different operational needs.
Despite these promising results, the study should be interpreted within the scope of its experimental setting. The models were evaluated on the PlantVillage dataset, which is widely used and suitable for controlled benchmarking, but it does not fully reflect the variability of real field environments, including illumination changes, background complexity, occlusion, and symptom overlap under natural conditions. Future work should therefore focus on external validation using field-acquired datasets, robustness analysis under domain shift, explain ability-based assessment of model decisions, and deployment-oriented testing on mobile or embedded hardware. Such extensions would strengthen the translational value of the proposed benchmarking framework and further support the development of reliable AI-assisted tools for practical disease management in agriculture.
Conflict of interest: The authors declare that there are no conflicts of interest regarding the publication of this research.
Declaration of AI and AI-assisted technologies in the writing process: AI-assisted tools were used only for language translation, grammar correction, and improvement of readability during the preparation of this manuscript. No AI tools were used in the study design, methodology, data analysis, model development, result generation, or interpretation of the findings. The authors take full responsibility for the scientific content of the manuscript.
Data availability: The empirical data utilized in this investigation were sourced from the open-access PlantVillage repository.

PTZ: We're glad you're here. Please click "create a new query" if you are a new visitor to our website and need further information from us.
If you are already a member of our network and need to keep track of any developments regarding a question you have already submitted, click "take me to my Query."