Training a massively multimodal transformer on YouTube data: pre-training and parameter efficient fine-tuning on HPC infrastructure