Thinking with Video: The Next Leap in Multimodal AI Reasoning
Author(s): Kaushik Rajan Originally published on Towards AI. How video generation models like Sora-2 are bridging the gap between static images and dynamic understanding I still remember the first time I saw a Vision Language Model (VLM) describe a complex image. It felt like magic. But then I asked it to predict what would happen next in a chaotic street scene, and the magic faded. It struggled. It could see the “now,” but it was blind to the […]