PhyGenBench

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Fanqing Meng^*,¹^,² , Jiaqi Liao^*,², Xinyu Tan, Wenqi Shao²^,^†, Quanfeng Lu², Kaipeng Zhang², Yu Cheng⁴^, Dianqi Li, Yu Qiao¹, Ping Luo³^,²^,^†

¹Shanghai Jiao Tong University, ²OpenGVLab, Shanghai AI Laboratory
³The University of Hong Kong, ⁴The Chinese University of Hong Kong,

Because the number of videos uploaded is large, the loading may be slow. If this happens, please be patient and wait for a while.

Introduction

Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive Physics Generation Benchmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic physical phenomenons). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications.

Samples of videos generated by Kling or Gen-3 in PhyGenBench with 4 different aspects. The results show that current T2V models struggle to generate videos that align with physical commonsense (e.g., the lack of a plane's reflection in water in the first video of the second row).

Benchmark Construction

An illustration of our data construction pipeline which divides into 4 stages.First, We select key physical laws and manually craft initial prompts that reflect the corresponding physical phenomena in Prompt Engineering.Then GPT-4o adds details in Prompt Augmentation and enhances diversity by varying objects in Diversity Enhancement. Finally, After manual review in Quality Control, we obtain 160 T2V prompts.

Overview

An overview of the proposed PhyGenEval. PhyGenEval is divided into three parts: Key Physical Phenomena Detection, Physics Order Verification, and Overall Naturalness Evaluation. Each part uses an appropriate VLM in combination with physical-based customized questions generated by GPT-4o. The final score is the combined result of the three parts. For the example in the figure, the three-stage scores are 0, 1 (only q1 is correct), and 0. The final score is calculated as 0

Comparison Results

The comparison between PhyGenEval and existing T2V evaluation metrics including Videophy,VideoScore and DEVIL in PhyGenBench.The score generated by the metric reflects the degree to which the video conforms to physics.The higher the score, the more the video conforms to physics.

Therefore,as shown in image, except for the proposed PhyGenEval, the current methods cannot reasonably assess the correctness of physical commosense in videos from PhyGenBench.In other words, PhyGenEval is much more closely aligned with human judgement.

Comparison Results Video Display

(a) A delicate, fragile egg is hurled with significant force towards a rugged, solid rock surface, where it collides upon impact

Expected Phenomenon: the egg should break upon impact while the stone remains intact because the hardness of the egg is much lower than that of stone

(b) A cup of oil is slowly poured out in the space station, releasing the liquid into the surrounding area

Expected Phenomenon: the oil in the cup should float in the air due to loss of gravity in the space station

(c) A stone is gently placed on the surface of a pool filled with water.

Expected Phenomenon: the stone should sink to the bottom of the pool due to high density

(d) A timelapse captures the reaction a s concentrated sulfuric acid is poured onto a piece of bread.

Expected Phenomenon: the bread should shrink, turn black and become carbonized due to the dehydrating effect of concentrated sulfuric acid

(e) A piece of copper is ignited, emitting a vivid and unique flame as it burns steadily.

Expected Phenomenon: because the flame color reaction of copper is green, the color of the flame should be green

(f) A kite is soaring above a smooth and tranquil pond

Expected Phenomenon: with the smooth and tranquil pond acting as a mirror,there should be the reflections of both the kite and the green mountains

Quantitative Evaluation

Model	Size	Mechanics(↑)	Optics(↑)	Thermal(↑)	Material(↑)	Average(↑)	Human(↑)
CogVideoX	2B	0.38	0.43	0.34	0.39	0.39	0.31
CogVideoX	5B	0.39	0.55	0.40	0.42	0.45	0.37
Open-Sora V1.2	1.1B	0.37	0.44	0.37	0.37	0.44	0.35
Lavie	860M	0.30	0.44	0.38	0.32	0.36	0.30
Vcthietz2.0	2B	0.41	0.56	0.44	0.37	0.45	0.36
Pika	-	0.35	0.46	0.39	0.39	0.39	0.36
Gen-3	-	0.45	0.57	0.49	0.51	0.51	0.48
Kling	-	0.45	0.58	0.50	0.40	0.49	0.44

We conduct extensive experiments on a wide range of popular video generation models. As illustrated in the Table above, even the best-performing model, Gen-3, only attains a PCA score of 0.51 on PhyGenBench. This indicates that even for prompts containing obvious physical commonsense, current T2V models struggle to generate videos that comply with intuitive physics.It indirectly reflects that these models are still far from achieving the world simulator.

Furthermore, we identify the following key observations:

1): Across various categories of physical commonsense, all models consistently demonstrate superior performance in the domain of optics compared to other areas. Notably, Vchitect2.0 and CogVideoX-5b achieve a PCA score in the optics domain comparable to that of closed-source models. We posit that this superior performance in the optics domain can be attributed to the abundant and explicit representation of optical knowledge in pre-training datasets, thereby enhancing the model's comprehension in this area.

2): Kling and Gen-3 exhibit significantly higher performance compared to other models. Specifically, Gen-3 demonstrates a robust understanding of material properties, achieving a score of 0.51, which substantially surpasses other models. Kling performs particularly well in thermodynamics, attaining the highest score of 0.50 in this domain.

3): Among open-source models, Vchitect2.0 and CogVideoX 5b perform comparatively well, both exceeding the performance level of Pika. In contrast, Lavie consistently exhibits lower physical correctness across all categories.

Qualitative Evaluation

The different video cases for 4 physical commonsense categories are illustrated in figure above. Our main observations are as follows:

In mechanics, the models struggle to generate simple physically accurate phenomenons. As shown in figure above, all models fail to depict the glass ball sinking in water. As for (b), instead showing it floating on the surface, OpenSora and Gen-3 even produce videos where the ball is suspended. Additionally, the models do not capture special physical phenomenons, such as the state of water in zero gravity, as seen in (a).

In optics, the models perform relatively better. (c) and (d) show the models handling reflections of balloons in water and colorful bubbles, though OpenSora and CogVideoX still produce reflections with noticeable distortions in (d).

In thermal, the models fail to generate accurate videos of phase transitions. For the melting phenomenon in (e), most models show incorrect results, with CogVideoX even producing a video where the ice cream increases in size. Similar errors appear in the sublimation process in (f), with only Gen-3 showing partial understanding.

Regarding material properties, (g) shows all models failing to recognize that an egg should break when hitting a rock, with Kling displaying the egg bouncing like a rubber ball. For simple chemical reactions, such as the black bread experiment in (h), none of the models demonstrate an accurate understanding of the expected reaction.

In conclusion, current models perform relatively well in generating optical phenomenons but are weaker in mechanics, thermal, and material properties.

Qualitative Evaluation Video Display

(a) A cup of oil is slowly poured out in the space station, releasing the liquid into the surrounding area

Expected Phenomenon: the oil in the cup should float in the air due to loss of gravity in the space station

(b) A solid glass ball is gently placed on the surface of a bathtub filled with water.

Expected Phenomenon: the glass ball should sink in to the bottom of the bathtub due to higher density than that of water

Expected Phenomenon: there should be a reflection of the ballon in the ocean

(d) A large number of soap bubbles are floating in the air under the sunlight.

Expected Phenomenon: the ice cream should melt due to inferenece & diffraction of sunlight

(e) A timelapse captures the gradual transformation of ice cream as the temperature rises significantly above 100 degree Celsius .

Expected Phenomenon: the ice cream should melt due to high temperature

(f) A timelapse captures the transformation of dry ice as it is exposed to a significantly increasing temperature at room temperature

Expected Phenomenon: the dry ice should sublime due to high temperature with white fog surrounding

(g) A delicate, fragile egg is hurled with significant force towards a rugged, solid rock surface, where it collides upon impact

Expected Phenomenon: the egg should break upon impact while the stone remains intact because stone is much more harder than egg

(h) A timelapse captures the reaction as concentrated sulfuric acid is poured onto a piece of bread.

Expected Phenomenon: the bread should shrink, turn black and become carbonized due to dehydration of concentrated sulfuric acid

PhyGenBench

Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation

Overview

🔔News

Introduction

PhyGenBench

Benchmark Construction

PhyGenEval

Overview

Experiment Results

Comparison Results

Comparison Results Video Display

Quantitative Evaluation

Qualitative Evaluation

Qualitative Evaluation Video Display

Discussion