JungMok Lee

I am a 2nd-year Master student at AMI Lab, where I work on multimodal perception, advised by Tae-Hyun Oh. I major in Department of Electrical Engineering at POSTECH.

I'm interested in computer vision, multimodal large language model, machine learning, and robotics.

Email / GitHub / Google Scholar / LinkedIn

Publications & Research

Automated Model Discovery via Multi-modal & Multi-step Pipeline

Lee Jung-Mok, Nam Hyeon-Woo, Moon Ye-Bin, Junhyun Nam, Tae-Hyun Oh
NeurIPS 2025, IPIU 2025 (Best Poster Awarded!!)
[project page] [arxiv] [code]

We present a multi-modal & multi-step pipeline for effective automated model discovery, using the multimodal LLM agents. We model the interpretation of time-series data using Gaussian Process kernel discovery.

Training-free Multimodal Embedding for Structure-Aware Retrieval of Scalable Vector Graphics and Images

Kyeongseon Kim, Baek Seong-Eun, Lee Jung-Mok, Tae-Hyun Oh
WACV 2026 (Round 1 Acceptance!!)
[project page] [arxiv] [code]

We propose the first training-free multimodal embedding method that uses a Multimodal Large Language Model (MLLM) to project text, images, and SVG code into an aligned space.

AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models

Kim Sung-Bin*, Oh Hyun-Bin*, Lee Jung-Mok, Arda Senocak, Joon Son Chung, Tae-Hyun Oh
ICLR 2025
[project page] [arxiv] [code]

We introduce a comprehensive audio-visual hallucination benchmark specifically designed to evaluate the perception and comprehension capabilities of audio-visual LLMs.

Ongoing Projects

These include coursework, side projects and unpublished research work.

Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization

Baek Seong-Eun, Lee Jung-Mok, Kim Sung-Bin, Tae-Hyun Oh
ICLRsubmission
2025-09-20

We present an approach that integrates LLMs with Bayesian Optimization to efficiently tune LoRA hyperparameters. This method enables adaptive search over the hyperparameter space, improving fine-tuning performance and computational efficiency.

Does Visual Arrow Improve Motion Comprehension in Multimodal Large Language Models?

Baek Seong-Eun, Nam Hyeon-Woo, Lee Jung-Mok, Tae-Hyun Oh
ongoing
2025-03-20

In ViMP, we draw a visual arrow on a sampled frame of a video by hand. ViMP conveys motion information with visual contexts, reducing the need for multiple frame.

SMILE-NEXT: An Enhanced Multimodal Dataset and Model for Laughter Understanding

Lee JungMok, Kim Sung-Bin, Lee Hyun, Tae-Hyun Oh
ongoing
2024-10-15
[project page] [code]

We introduce the SMILE-NEXT, a comprehensive corpus combining audio, visual, and textual cues for laughter understanding across diverse contexts. We propose Laugh Expert MOE architecture, a lightweight yet expressive model architecture to efficiently model laughter perception.

Design and source code from Jon Barron's website