Distracted Driver Detection:
Can computer vision spot distracted drivers?

In today's fast-paced world, it is not uncommon to witness drivers engaging in multiple activities behind the wheel. Whether it's talking on the phone, texting, munching on a quick snack, or even applying makeup, the allure of multitasking while driving can prove irresistible. However, what may seem like harmless multitasking can have severe consequences, compromising the safety of not just the driver but also passengers, bystanders, and countless others sharing the road.

According to the United States Department of Transportation, a staggering one in five car accidents is attributed to distracted driving. This means that every year, distracted driving causes injuries to around 425,000 people and tragically claims the lives of 3,000 others. These alarming statistics underscore the urgent need to address the issue of distracted driving and find effective solutions to mitigate its impact.

Recognizing the importance of driver safety, State Farm, a renowned nationwide insurance company, has taken a proactive stance. In an effort to better serve and protect their customers, they released a comprehensive dataset on Kaggle, challenging data scientists and computer vision experts to develop innovative methods using dashboard cameras. The aim? To automatically detect and identify drivers engaging in distracted behaviors, and ultimately improve insurance practices to enhance road safety.

In this blog post, we delve into the world of distracted driving, examining the most prevalent activities that divert driver attention from the road. We explore the consequences of these distractions and shed light on the critical role that computer vision and advanced technologies can play in detecting and preventing such behaviors. Join us as we unravel the complexities of distracted driving and uncover the potential solutions that could save lives and create safer roads for everyone.

In today's age of advanced technology, the ability to understand and classify driver behavior has become increasingly crucial for ensuring road safety. By analyzing snapshots from in-car cameras, we can gain valuable insights into whether drivers are focused on the road, wearing their seatbelts, or even indulging in capturing selfies with friends in the backseat. This fascinating dataset opens up a realm of possibilities for leveraging deep learning techniques to classify driver behavior accurately.

However, one significant challenge posed by this dataset lies in its unique composition. The subjects captured in the video footage do not appear in both the training and test datasets, making the deep learning process even more complex. The images were generated from videos, each capturing the distinct driving behaviors of different subjects, further adding to the diversity of the dataset.

The training dataset comprises approximately 22.4k images belonging to 26 different subjects. These images are evenly distributed across ten distinct classes representing various driver behaviors. From safe driving to texting on the right or left, talking on the phone, operating the radio, drinking, reaching behind, and even engaging in activities like hair and makeup or conversing with passengers - the dataset encompasses a wide range of real-world scenarios.

The 10 classes to predict are:

  • c0: safe driving

  • c1: texting - right

  • c2: talking on the phone - right

  • c3: texting - left

  • c4: talking on the phone - left

  • c5: operating the radio

  • c6: drinking

  • c7: reaching behind

  • c8: hair and makeup

  • c9: talking to passenger

The testing dataset, on the other hand, consists of a staggering 79.7k images belonging to a separate subset of drivers, ensuring the evaluation of models on unseen data for accurate performance assessment.

The above dataset can be downloaded here.

Also included in the dataset are the following files:

The image displayed showcases two plots, revealing valuable insights about the dataset.

Plot 1: Class Label Distribution The first plot represents the count of class labels within the dataset. Among the nine class labels, class c0, which signifies safe driving, is the most frequently occurring. Classes c3, c4, c6, c2, c5, and c1 have similar counts, suggesting a relatively balanced distribution among these classes. However, class c8 is the least occurring class, indicating that instances of drivers engaging in hair and makeup are relatively rare in the dataset.

Plot 2: Subject Distribution The second plot depicts the count of each subject present in the dataset. Subjects 21, 22, and 24 are the most frequently occurring subjects within the dataset. On the other hand, subject 72 has the least representation, implying that there are fewer instances of this particular subject's driving behavior captured in the dataset.

These two plots provide valuable insights into the class label distribution and subject occurrence within the dataset, offering a deeper understanding of the dataset's composition and potential implications for further analysis and modeling.

The image displayed presents the distribution of different class images within each subject throughout the dataset.

Upon analysis, it is observed that the distribution of class images appears to be relatively even for most subjects. However, subject p072 stands out with an uneven distribution of classes. Notably, classes c5 (operating the radio) and c7 (reaching behind) are significantly underrepresented or barely present within the images associated with subject p072.

This observation suggests that subject p072 may exhibit distinct driving behavior patterns, focusing less on operating the radio and reaching behind compared to other subjects in the dataset. Such disparities in class distribution within a subject can provide valuable insights into the variations and nuances of driver behavior captured by the dataset.

By considering the subject-level distribution of class images, researchers and data analysts can better understand the dataset's characteristics and potentially address any bias or imbalance that may arise during model training and evaluation.

The image displayed showcases all 26 subjects from the training dataset, with each subject captured while talking on the phone. This image serves to highlight the significant variations and diversity among the subjects in terms of their appearances, expressions, and driving contexts.

By presenting a visual representation of the subjects engaged in the specific behavior of talking on the phone, it becomes evident that each individual brings their unique characteristics and nuances to the dataset. The varying demographics, facial expressions, and driving environments depicted by the subjects emphasize the complexity and richness of real-world driver behavior.

This image underscores the importance of considering the individuality of subjects and their behaviors when analyzing and modeling driver actions.

Based on the exploratory data analysis (EDA), several key observations can be made:

  1. Similarity within Class Images: Since the images are extracted from video clips, it is noticed that multiple images belonging to the same class often exhibit strong similarities. This characteristic presents an opportunity to leverage techniques such as data augmentation and regularization to enhance the model's performance and generalization capabilities.

  2. Imbalanced Train-Test Image Distribution: The number of test images is approximately four times larger than the number of train images. This significant disparity in image quantities between the train and test sets raises concerns about potential overfitting. To address this issue, careful consideration must be given to strategies like data augmentation, balancing class representation, or implementing techniques to prevent overfitting during model training.

  3. Distinct Driver Composition: It is observed that the drivers present in the test images are entirely different from those in the train images. This divergence in driver composition necessitates the need for robust training methodologies that account for driver-specific behaviors. Transfer learning techniques, model regularization, or incorporating driver-specific features could potentially help address this challenge and ensure improved model performance across unseen drivers in the test set.

  1. Training and Validation Sets:

    • The dataset was split into training and validation sets to assess the model's performance during training.

    • Initially, a random train-test-split was used, but it resulted in significant overfitting, where the model performed well on the training data but poorly on unseen data.

    • To mitigate overfitting, GroupKFold was employed, where the dataset was grouped based on the unique driver ID.

    • This ensured that each driver was exclusively present in either the training or validation set, preventing data leakage and improving the model's generalization.

  2. Custom CNN Model:

    • Convolutional Neural Networks (CNNs) were selected as the primary architecture due to their effectiveness in image classification tasks.

    • A custom CNN model was designed, tailored specifically for the driver behavior classification problem.

    • The model consisted of two CNN layers, each comprising convolutional operations followed by max-pooling to extract relevant features.

    • The number of filters and kernel size in each CNN layer were determined to capture essential characteristics from the input images effectively.

    • Subsequently, two fully connected (Dense) layers were added to learn high-level representations of the extracted features.

    • The first dense layer contained 128 neurons, followed by a second dense layer with 64 neurons, both utilizing the Rectified Linear Unit (ReLU) activation function.

    • The output layer comprised 10 neurons representing the 10 driver behavior classes, employing the softmax activation function for multi-class classification.

  3. Pre-trained Models:

    • In addition to the custom CNN model, pre-trained models were incorporated into the experimentation for transfer learning purposes.

    • VGG-16, VGG-19, and EfficientNetB0 were chosen as pre-trained models known for their strong performance in image classification tasks.

    • Transfer learning involves utilizing the learned features from these pre-trained models as a starting point for our specific problem.

    • For VGG-16 and VGG-19, the pre-trained models' convolutional layers were frozen, and additional fully connected layers were appended, followed by a SoftMax activation layer for classification.

    • EfficientNetB0 was employed with a GlobalAveragePooling2D layer, BatchNormalization, Dropout, and a SoftMax activation layer.

    • Leveraging pre-trained models allowed the models to benefit from the pre-existing knowledge and feature extraction capabilities acquired from large-scale datasets like ImageNet.

  4. Image Augmentation:

    • Image augmentation techniques were applied to increase the diversity and size of the training dataset, aiding in improved model generalization.

    • Augmentation involved applying various transformations, such as rotation, shear, horizontal and vertical shifts, to the original images.

    • By generating multiple augmented copies of each image, the training dataset's variability was enhanced, reducing the risk of overfitting.

    • Image augmentation was selectively applied only to the training set using the ImageDataGenerator class in the Keras library.

    • Augmentation was not performed on the validation set to ensure an unbiased evaluation of model performance on unseen data.

By incorporating these modeling methodologies, such as careful train-validation splitting, customized CNN architecture design, utilization of pre-trained models for transfer learning, and image augmentation techniques, the objective was to develop robust models capable of accurately classifying driver behavior based on the given dataset. These strategies were implemented to address challenges like overfitting, limited training data, and the need for enhanced generalization in the models, ultimately improving their effectiveness in real-world applications.

Loss and Accuracy:

The results of the different models are as follows:

  1. Base Model with Random Train-Test Split:

    • This model achieved perfect accuracy of 1.0 on both the training and validation sets.

    • However, the log loss on the testing dataset was very high at 26.643.

    • This indicates that the model is severely overfit and does not generalize well to unseen data.

  2. Base Model with GroupKFold Train-Test Split:

    • The accuracy of the model dropped when using GroupKFold train-test split.

    • The training accuracy was 0.855, and the validation accuracy was 0.315.

    • The log loss on the testing dataset was marginally lower at 23.763 compared to the random train-test split.

    • Although the model shows slightly better generalization, the accuracy scores are still not satisfactory.

  3. VGG16 Model with Single Dense Layer and Dropout:

    • This model achieved the lowest log loss of 0.73 on the testing dataset.

    • The training accuracy was 93%, indicating good performance on the training data.

    • However, the testing accuracy was lower at 69%, suggesting that the model is overfitting.

    • The model may not generalize well to unseen data due to the large gap between training and testing accuracy.

  4. VGG19 Model:

    • The VGG19 model showed slightly improved training and validation accuracy scores compared to VGG16.

    • However, the log loss on the testing dataset was 0.806, higher than the VGG16 model.

    • This indicates that although the model performs well on the training and validation sets, it struggles to generalize to unseen data.

  5. EfficientNetB0 Model:

    • The EfficientNetB0 model achieved the lowest log loss value for the training dataset.

    • However, it had the highest log loss on the testing dataset.

    • This suggests that the model is overfitting and does not generalize well to unseen data.

    • Despite this, the EfficientNetB0 model runs significantly faster due to its lower number of trainable parameters.

Overall, the VGG16 model with a single dense layer and dropout achieved the lowest log loss on the testing dataset. However, it exhibited signs of overfitting. The other models, including VGG19 and EfficientNetB0, had higher log losses, indicating poorer performance on the testing dataset. It is important to note that accuracy alone is not a sufficient measure of model performance, as it can be misleading in the presence of class imbalance or when models are overfitting. Therefore, considering log loss as an evaluation metric provides a more comprehensive assessment of model performance.

Formatted Table
Model Training Loss Training Accuracy Validation Loss Validation Accuracy Kaggle Testing Loss Model Type
base_model_kfold 0.000 1.000 0.005 0.999 26.643 Base-Kfold
base_model_gkfold 0.746 0.855 3.525 0.315 23.763 Base-Gkfold
best_model_vgg16_dropout_1024 0.193 0.937 0.929 0.690 0.731 VGG16
best_model_vgg16 0.181 0.944 0.891 0.718 0.752 VGG16
best_model_vgg19_l2_4096_1024_adam_lr 0.512 0.943 1.181 0.719 0.806 VGG19
best_model_vgg16_l2_1024 0.433 0.926 1.324 0.645 0.837 VGG16
best_model_vgg16_l2_4096_1024_adam_lr 0.574 0.944 1.280 0.720 0.850 VGG16
best_model_EfficientNetB0_v2_base 0.090 0.984 0.949 0.686 0.857 EfficientNetB0
best_model_EfficientNetB0_v2_augment 0.205 0.952 0.983 0.693 0.878 EfficientNetB0

Classification Report for VGG16 Model:

Based on the provided classification report, the F1-scores for each class label are as follows:

  • Class c8 has the lowest F1-score of 0.36, indicating poor performance in terms of precision, recall, and overall classification accuracy.

  • Class c9 also has a relatively low F1-score of 0.49, indicating subpar performance compared to other classes.

  • On the other hand, classes c3 and c7 have the highest F1-scores of 0.83 and 0.86, respectively, indicating good performance in terms of precision, recall, and overall classification accuracy.

  • Classes c0, c1, c4, c5, and c6 have F1-scores ranging from 0.69 to 0.78, indicating moderate performance.

  • Class c2 has an F1-score of 0.54, suggesting relatively lower performance compared to other classes.

Overall, there is variation in the model's performance across different class labels. Class c8 and c9 seem to be the most challenging to classify, while classes c3 and c7 are relatively easier to classify based on their higher F1-scores. It would be beneficial to further investigate and potentially address the issues related to the poor performance of classes c8 and c9 in order to improve the overall model performance.

Formatted Table
Class Precision Recall F1-Score Support
c0 0.69 0.84 0.75 481
c1 0.97 0.60 0.74 452
c2 0.52 0.57 0.54 469
c3 0.98 0.72 0.83 476
c4 0.72 0.83 0.77 476
c5 0.78 0.77 0.78 456
c6 0.60 0.79 0.69 453
c7 0.85 0.87 0.86 413
c8 0.99 0.22 0.36 398
c9 0.40 0.64 0.49 400
Accuracy 0.69 4474
macro avg 0.75 0.68 0.68 4474
weighted avg 0.75 0.69 0.69 4474

Confusion Matrix for VGG16 model:

  1. Class c0 has 402 true positives, but it is frequently misclassified as class c9 (65 false negatives).

  2. Class c1 has 269 true positives, but it is often misclassified as class c0 (56 false positives) and class c6 (94 false positives).

  3. Class c2 has 265 true positives, but it is frequently misclassified as class c6 (104 false positives) and class c9 (15 false positives).

  4. Class c3 has 345 true positives and shows relatively low misclassifications in other classes. However, there are still false positives in classes like c4 (63 false positives).

  5. Class c4 has 395 true positives and performs well with relatively low misclassifications. However, there are still false positives in classes like c2 and c8.

  6. Class c5 has 350 true positives, but it is frequently misclassified as class c9 (101 false positives) and occasionally as class c0 (3 false positives).

  7. Class c6 has 360 true positives, but it is frequently misclassified as class c2 (54 false positives) and occasionally as class c4 (27 false positives).

  8. Class c7 has 359 true positives and demonstrates good performance with few misclassifications.

  9. Class c8 has misclassifications in multiple classes, such as c0, c2, c4, and c6, indicating difficulty in distinguishing it from these classes.

  10. Class c9 has 256 true positives, but it is frequently misclassified as class c0 (63 false positives) and class c2 (50 false positives).


Mislabeled Prediction Analysis:

Following figure shows a collection of sample images where the model predictions were wrong. The title of each image shows the information of the original label, prediction and confidence value of that prediction as determined by the model

  1. Training and Validation Sets: The use of GroupKFold for train-test splitting, where each driver only appears in either the training or validation set, helped reduce model overfitting and led to a more robust model compared to a random train-test split.

  2. Model Architecture: Different CNN model architectures were explored, including a custom CNN model with two CNN layers and dense layers, as well as pre-trained models like VGG16, VGG19, and EfficientNetB0. The VGG16 model with a single dense layer and a dropout layer achieved the lowest log loss on the testing dataset, indicating its effectiveness in classification.

  3. Pre-trained Models: Transfer learning using pre-trained models allowed leveraging the knowledge and features learned from large datasets like ImageNet. The VGG16 model showed promising performance, while the EfficientNetB0 model had a faster runtime but comparatively higher log loss on the testing dataset.

  4. Image Augmentation: The use of image augmentation techniques, such as rotation, shear, and offsets, expanded the dataset and improved model generalization on unseen data. Augmentation was applied only to the training set, not the validation set.

  5. Confusion Matrix Analysis: The confusion matrix provided insights into the performance of the model for different classes. Class c3 and c7 had higher F1-scores, indicating better precision and recall, while class c8 and c9 had lower F1-scores, indicating misclassification issues.

Overall, the model's performance varied across different classes, with some classes showing high accuracy and low log loss while others suffered from misclassification issues. Further analysis and model fine-tuning, such as adjusting hyperparameters or exploring different architectures, may be necessary to improve the model's overall performance and address the challenges faced in specific classes.