Towards Multi-Task Multi-Modal Models: A Video Generative Perspective