Image Captioning with Vision-Language Models (Under review)
Published:
When considering off-policy reinforce- ment learning methods for treatment policies in healthcare data, it is gener- ally the case that the patient population is diverse and has different chronic con- ditions that we would like to take into account when identifying optimal treat- ment policies. In this work, we use multi-group Gaussian process regression models in a fitted Q-iteration framework to allow us to model these different patient sub- groups and adapt the optimal policies to each subgroup. Concurrently, we es- timate these functions across the entire patient population. Finally, we apply our multi-group reinforcement learn- ing (MGRL) framework to the problem of optimal treatment policies for elec- trolytes with pre-existing medical con- ditions to assess performance against other state-of-the-art methods. We show that MGGP supersedes the per- formance of other models in addressing group structure in reinforcement learn- ing settings due to the robust covariance functions which has been adapted to learn the different behaviours for mul- tiple groups while maintaining a single policy. Keywords: Offline Reinforcement learning, Multi-Group Gaussian pro- cesses, Clinical, Electronic health records.Image captioning is an active area of research in the multi-modal artificial intelligence (AI) com- munity as it connects vision and language under- standing, especially in settings where it is required that a model understands the content shown in an image, and generates semantically and grammati- cally correct descriptions. In this project, we fol- lowed a standard approach of deep learning-based image captioning model; injecting architecture for the encoder-decoder setup ,where the encoder extracts image features and the decoder generates a sequence of words which represents the image content. As such, we investigated image encoders, which are ResNet101, InceptionResNetV2, Ef- ficientNetB7, EfficientNetV2M and CLIP. As a caption generation structure, we explored long short-term memory (LSTM). The CLIP-LSTM model demonstrated superior performance com- pared to the encoder-decoder models, achieving a BLEU-1 score of 0.904 and a BLEU-4 score of 0.640. Additionally, among the CNN-LSTM models, EfficientNetV2M-LSTM exhibited the highest performance with a BLEU-1 score of 0.896 and a BLEU-4 score of 0.586 while using a single layer LSTM.