Visual grounding: building cross-modal visual-text alignment