Background: Accurate and timely segmentation of brain tumors in MRI images is essential for optimal treatment planning. While convolutional neural networks (CNNs) have achieved extensive success in medical image segmentation, they have limited ability to capture long-range spatial dependencies and often require high computational resources to achieve reasonable accuracy. Vision Transformers (ViTs), which utilize global self-attention, offer a promising alternative but are computationally expensive for high-resolution 3D medical images. In this study, we propose SegViTBT, a lightweight hybrid architecture combining a vision transformer encoder with a convolutional decoder for efficient brain tumor segmentation. The model integrates sparse attention to reduce computational load and learnable 2D positional embeddings to enhance spatial representation, delivering high accuracy with reduced resource demands.
Methods: The model is trained on MRI images from the BraTS benchmark dataset. Key performance metrics, including dice coefficient, accuracy, and loss, are evaluated over 25 epochs during training and validation. A comparison is made against conventional CNN and ViT models.
Results: The proposed SegViTBT model demonstrates a stable learning curve with rapid convergence. It achieves a dice score of 78.06% on the BraTS dataset, outperforming baseline CNNs and standard ViT implementations while using less than 60% of the computational resources. Visual results confirm the model’s ability to delineate tumor boundaries with high precision, even for irregularly shaped lesions.
Conclusion: SegViTBT successfully closes the performance gap between CNNs and ViTs in medical imaging by introducing a computationally efficient, pixel-accurate architecture. The model is suitable for deployment in low-resource clinical settings, enabling real-time, practical diagnostic support for brain tumor assessment.