Efficient Prompt Engineering for Large Foundation Models

Xu, Siyu

Permalink

Access status:

Open Access

Type

Thesis

Thesis type

Masters by Research

Author/s

Xu, Siyu

Abstract

The rapid advancements in large foundation models, particularly GPT-4V, have unlocked significant potential in various applications, including visual recognition tasks. However, the high computational and financial costs associated with GPT-4V’s inference remain substantial barriers ...
See moreThe rapid advancements in large foundation models, particularly GPT-4V, have unlocked significant potential in various applications, including visual recognition tasks. However, the high computational and financial costs associated with GPT-4V’s inference remain substantial barriers to its widespread use. In response to these challenges, this thesis introduces ``Collage Prompting'', a novel and budget-friendly prompt engineering technique that concatenates multiple images into a single visual input. By allowing GPT-4V to process several images simultaneously, this approach not only reduces inference costs but also opens new avenues for more efficient utilization of large-scale models in real-world scenarios. The thesis further investigates the influence of image arrangement within the collage prompt on recognition accuracy. We present a framework that uses a graph-based predictor to optimize the placement of images, improving the model’s performance by ensuring the most favorable configuration. To facilitate future research in this area, we introduce CollagePrompt, a comprehensive benchmark designed to evaluate the cost-effectiveness and recognition performance of collage prompts. This benchmark provides a platform for testing various image arrangements and includes a baseline optimization technique derived from genetic algorithms. Through extensive experimentation across diverse datasets, we demonstrate that collage prompts with optimized image layouts significantly outperform randomly arranged ones in terms of both accuracy and cost-efficiency. Moreover, two key metrics are proposed to measure the effectiveness of different collage configurations. This research contributes to the emerging field of prompt engineering by offering a practical solution that enhances the economic viability of large foundation models like GPT-4V, without compromising their visual recognition capabilities.
See less

Date

2025

Rights statement

The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.

Faculty/School

Faculty of Engineering, School of Computer Science

Awarding institution

The University of Sydney

Subjects

Large Language Models
Large Multimodal Models
GPT-4V
Efficient Prompting Engineering