[1]Zheng Q, Liu D, Wang C, et al. Esceme: Vision-and-language navigation with episodic scene memory[J]. International Journal of Computer Vision (IJCV), 2024: 1-21.
[2]Li K, Yu B, Zheng Q, et al. MuEP: A multimodal benchmark for embodied planning with foundation models[C]//Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI). 2024.
[3]Zhang H, Liu D, Zheng Q, et al. Modeling video as stochastic processes for fine-grained video representation learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2023: 2225-2234.
[4]Zheng Q, Wang C, Tao D. Syntax-aware action targeting for video captioning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2020: 13096-13105.
[5]Yu C, Zhao X, Zheng Q, et al. Hierarchical bilinear pooling for fine-grained visual recognition[C]//Proceedings of the European Conference on Computer Vision (ECCV). 2018: 574-589.