GeoVideo: Introducing Geometric Regularization into Video Generation Model

Anonymous Submission
Anonymous Authors

Abstract

Recent advances in video generation have enabled the synthesis of high-quality and visually realistic clips using diffusion transformer models. However, most existing approaches operate purely in the 2D pixel space and lack explicit mechanisms for modeling 3D structures, often resulting in temporally inconsistent geometries, implausible motions, and structural artifacts. In this work, we introduce geometric regularization losses into video generation by augmenting latent diffusion models with per-frame depth prediction. We adopted depth as the geometric representation because of the great progress in depth prediction and its compatibility with image-based latent encoders. Specifically, to enforce structural consistency over time, we propose a multi-view geometric loss that aligns the predicted depth maps across frames within a shared 3D coordinate system. Our method bridges the gap between appearance generation and 3D structure modeling, leading to improved spatio-temporal coherence, shape consistency, and physical plausibility. Experiments across multiple datasets show that our approach produces significantly more stable and geometrically consistent results than existing baselines.

 


Static Scene (T2V)

The video shows a cluttered office space with wooden desks and chairs. The room is filled with natural light from large windows.

CogVideoX CogVideoX-Tuned GeoVideo (Ours) Depth (Ours)

The video features an indoor Nike sportswear store with a modern and stylish design. Initially, two mannequins in casual athletic wear are displayed on a grey platform, with a backdrop of sneakers and sportswear. As time passes, the mannequins change outfits, including a purple hoodie and black pants, and a white hoodie with black pants. The store's ambiance is highlighted by track lighting, a clean, monochromatic color scheme, and a large'swoosh' logo. The layout includes a variety of sneakers on wooden shelves and racks, with the brand's iconic logo and slogans like 'Just Do It' and 'Elevate' visible. The store's design is contemporary, with a focus on casual athletic fashion.

CogVideoX CogVideoX-Tuned GeoVideo (Ours) Depth (Ours)

The video takes us through an Apple Store, showcasing a variety of electronic devices and accessories. Initially, we see iPhones, tablets, and accessories on white shelves. As we move through the store, the focus shifts to different products. The store's modern design is highlighted by bright lighting, a clean layout, and a color scheme of whites and grays, with occasional pops of color from product packaging.

CogVideoX CogVideoX-Tuned GeoVideo (Ours) Depth (Ours)

The video features a serene park with a central tree. The tree is surrounded by a grassy field with white flowers and a winding path, leading to a grove of trees with rich foliage. The clear blue sky enhances the tranquility of the scene, which is devoid of people and wildlife. The lighting varies, implying it could be early morning or late afternoon, and the overall mood is one of peaceful solitude.

CogVideoX CogVideoX-Tuned GeoVideo (Ours) Depth (Ours)

The video features a serene residential area with a traditional Chinese architectural entrance gate, marked by a golden signboard and red columns, surrounded by a circular pattern of grey tiles and pebbles. A central stone monument with a conical cap is a focal point, with manicured greenery and a solitary tree enhancing the tranquil atmosphere. As time passes, the scene includes a manicured lawn, a decorative tree, and a stone stump, with the backdrop of high-rise buildings suggesting an urban setting. The area is bathed in sunlight, highlighting the harmonious blend of traditional and modern elements.

CogVideoX CogVideoX-Tuned GeoVideo (Ours) Depth (Ours)

The video features a series of shots of a monument with golden Chinese characters in a tranquil, well-maintained plaza. The monument is surrounded by a multi-story building with a white facade and red brick accents, and a traditional street lamp. The setting is serene, with a clear blue sky and mature trees, suggesting a historical or educational institution. As the frames advance, the monument's inscription changes, and the surrounding buildings vary in architectural style, including a cylindrical stone monument and a classical building with a red brick facade. The plaza is consistently empty, emphasizing the monument's prominence and the peaceful atmosphere.

LucidDreamer Director3D SplatFlow GeoVideo (Ours)

The video features a serene park with two green topiary swans forming a heart shape with their necks, set against a backdrop of a large red flower-shaped ground design and a blue playground area. The swans are surrounded by a tranquil setting with leafless trees, a distant water body, and a traditional Chinese pavilion. As time passes, the scene remains largely unchanged, maintaining a calm and romantic atmosphere. The park is quiet, with no people or animals present, and the soft lighting suggests it might be early morning or late afternoon. The video concludes with a unique green topiary sculpture resembling a smiling face, standing on a vibrant flower-shaped pathway.

LucidDreamer Director3D SplatFlow GeoVideo (Ours)

The video features a statue of Sun Yat-sen in various serene park settings, with a backdrop of modern high-rise buildings and a clear sky. The statue, depicted in traditional Chinese attire, is surrounded by a circular bed of purple and brown shrubs, a red archway with golden Chinese characters, and a white informational plaque. As time passes, the scene includes a red and gold '20' sign, suggesting a commemorative event, and a pathway inviting visitors to explore. The park's atmosphere is tranquil, with a mix of historical reverence and contemporary urban development.

LucidDreamer Director3D SplatFlow GeoVideo (Ours)

 


Dynamic Video (T2V)

A wasp with yellow and black markings is seen pollinating a dandelion flower amidst a lush meadow. The scene unfolds under the bright midday sun, highlighting the vivid colors and intricate details of the wasp and flower. As time passes, the wasp continues its pollination efforts, with the background of green grass and other dandelions softly blurred, emphasizing the serene and dynamic interaction between the insect and its floral host. The video captures the essence of nature's daily ballet, showcasing the wasp's delicate movements and the tranquil beauty of the meadow.

CogVideoX GeoVideo (Ours) Depth (Ours)

A solitary swan with pristine white plumage and an orange beak glides across a tranquil lake, its reflection mirroring on the water's surface. The surrounding landscape is serene, featuring dense, dark foliage and a forest of slender trees with exposed roots, suggesting a secluded natural habitat. As the swan continues its graceful journey, the environment remains undisturbed by human presence, highlighting the peaceful solitude of the scene. The swan's elegant posture and the stillness of the water convey a mood of tranquility and natural beauty, with the overcast lighting enhancing the serene atmosphere.

CogVideoX GeoVideo (Ours) Depth (Ours)

Two seagulls are observed on a tranquil beach, engaging in various activities such as standing, walking, and foraging in the shallow waters. The scene is serene, with the seagulls' white and grey feathers contrasting against the muted beige sands and soft blue-green hues of the sea. The overcast sky enhances the calm atmosphere, contributing to the absence of human presence. Gentle waves wash over the shore, and the seagulls' relaxed postures suggest a peaceful coexistence within their natural coastal habitat. The lighting indicates it might be early morning or late afternoon, adding to the quietude of the setting.

CogVideoX GeoVideo (Ours) Depth (Ours)

A juvenile swan with grey and white plumage is seen engaging in playful behavior in a tranquil pond, repeatedly dipping its beak and feet into the water. The swan's actions are highlighted by the gentle ripples on the water's surface and the soft lighting, which enhances the serene atmosphere. As time passes, another swan, possibly an adult, appears in the background, observing the young swan's antics. Eventually, a group of juvenile swans is shown, with one engaging in playful behavior while another watches, against a backdrop of lush greenery, suggesting a peaceful, natural habitat.

CogVideoX GeoVideo (Ours) Depth (Ours)

A tiger stands in stark contrast against the darkness, its silhouette and striped fur highlighted by a soft light that casts deep shadows, creating an atmosphere of mystery and solitude. The tiger's gaze is consistently directed off-camera, suggesting a silent vigilance and a serene yet powerful presence. The scene remains devoid of any other elements, focusing solely on the tiger's majestic form and the dramatic interplay of light and shadow. The tiger's poised stance and the subtle interplay of light and shadow accentuate its solitary figure, evoking a sense of quiet anticipation and the wild's enigmatic allure.

CogVideoX GeoVideo (Ours) Depth (Ours)

 


Dynamic Video (I2V)

The video shows a dog walking directly toward the camera with a calm and steady gait. Its eyes are fixed forward, and its tail moves slightly, suggesting curiosity or familiarity. The setting is softly lit, with the dog’s approach gradually filling the frame and creating a sense of intimacy and engagement.

Input Image
Input
CogVideoX GeoVideo (Ours) Depth (Ours)

The video features a cow gently shifting its head from side to side while standing in a quiet rural field. The animal’s movements are slow and deliberate, with subtle changes in gaze and posture suggesting attentiveness to its surroundings. The soft ambient lighting and stillness of the background create a calm and pastoral atmosphere.

Input Image
Input
CogVideoX GeoVideo (Ours) Depth (Ours)

Input Prompt.

The video captures a white horse in mid-run, its hooves kicking up dust as it moves across a sun-drenched field. The flowing mane and tail add to the sense of movement, while the soft lighting and open landscape highlight the horse's form and motion.
Input
CogVideoX GeoVideo (Ours) Depth (Ours)

The video captures a calm scene where a cow slowly tilts and turns its head, highlighting its quiet presence in a rural setting. The soft ambient sounds and static background contrast with the subtle motion, emphasizing a sense of quiet observation and stillness.

Input Image
Input
CogVideoX GeoVideo (Ours) Depth (Ours)

 


Ablation Study

w/o Lgeo.

Input Image
Input
w/o Lgeo GeoVideo (Ours) Depth (Ours)

The video takes us through a corridor in a school, showcasing a blend of educational and creative elements. Initially, the corridor is adorned with green plants, red lanterns, and a whiteboard with colorful drawings. As we move forward, the corridor features a variety of potted plants, dried botanicals, and framed pictures, with a whiteboard displaying educational notes. The corridor is well-lit, with natural light enhancing the warm atmosphere. Decorative items like woven baskets, wooden sculptures, and colorful beads are visible, alongside a whiteboard with Chinese characters. The scene transitions to a garden center with artificial plants, dried flowers, and a rustic wooden bench, concluding with a view of a bare tree outside, suggesting a cold season.

w/o Lgeo GeoVideo (Ours) Depth (Ours)

w/o MP.

An elderly couple, dressed in winter attire, explores a bustling indoor market, focusing on a deli counter filled with cured meats and cheeses. They examine the products, including hanging sausages, cured meats, and cheeses, under warm lighting that highlights the textures and colors of the food. The scene is set against a backdrop of wooden shelves stocked with various goods, contributing to the market's traditional and authentic atmosphere. The couple's interaction with the stall and each other suggests a shared interest in the culinary offerings, amidst a lively market environment.

w/o MP GeoVideo (Ours) Depth (Ours)

Ducks swim in a tranquil lake, their reflections mirroring on the calm surface. The surrounding landscape is a serene blend of greenery and earthy tones, with the water's edge showing signs of algae or aquatic plants. As time passes, the ducks continue their peaceful swim, undisturbed by any other creatures or human presence. The scene is consistently tranquil, with the ducks' dark feathers contrasting against the muted colors of the environment. The lighting suggests it could be early morning or late afternoon, casting a soft glow over the natural setting.

w/o MP GeoVideo (Ours) Depth (Ours)

 


Reconstruction Results

Reconstruction of the generated videos.