Efficient Prompt Engineering for Large Foundation Models
Access status:
Open Access
Type
ThesisThesis type
Masters by ResearchAuthor/s
Xu, SiyuAbstract
The rapid advancements in large foundation models, particularly GPT-4V, have unlocked significant potential in various applications, including visual recognition tasks. However, the high computational and financial costs associated with GPT-4V’s inference remain substantial barriers ...
See moreThe rapid advancements in large foundation models, particularly GPT-4V, have unlocked significant potential in various applications, including visual recognition tasks. However, the high computational and financial costs associated with GPT-4V’s inference remain substantial barriers to its widespread use. In response to these challenges, this thesis introduces ``Collage Prompting'', a novel and budget-friendly prompt engineering technique that concatenates multiple images into a single visual input. By allowing GPT-4V to process several images simultaneously, this approach not only reduces inference costs but also opens new avenues for more efficient utilization of large-scale models in real-world scenarios. The thesis further investigates the influence of image arrangement within the collage prompt on recognition accuracy. We present a framework that uses a graph-based predictor to optimize the placement of images, improving the model’s performance by ensuring the most favorable configuration. To facilitate future research in this area, we introduce CollagePrompt, a comprehensive benchmark designed to evaluate the cost-effectiveness and recognition performance of collage prompts. This benchmark provides a platform for testing various image arrangements and includes a baseline optimization technique derived from genetic algorithms. Through extensive experimentation across diverse datasets, we demonstrate that collage prompts with optimized image layouts significantly outperform randomly arranged ones in terms of both accuracy and cost-efficiency. Moreover, two key metrics are proposed to measure the effectiveness of different collage configurations. This research contributes to the emerging field of prompt engineering by offering a practical solution that enhances the economic viability of large foundation models like GPT-4V, without compromising their visual recognition capabilities.
See less
See moreThe rapid advancements in large foundation models, particularly GPT-4V, have unlocked significant potential in various applications, including visual recognition tasks. However, the high computational and financial costs associated with GPT-4V’s inference remain substantial barriers to its widespread use. In response to these challenges, this thesis introduces ``Collage Prompting'', a novel and budget-friendly prompt engineering technique that concatenates multiple images into a single visual input. By allowing GPT-4V to process several images simultaneously, this approach not only reduces inference costs but also opens new avenues for more efficient utilization of large-scale models in real-world scenarios. The thesis further investigates the influence of image arrangement within the collage prompt on recognition accuracy. We present a framework that uses a graph-based predictor to optimize the placement of images, improving the model’s performance by ensuring the most favorable configuration. To facilitate future research in this area, we introduce CollagePrompt, a comprehensive benchmark designed to evaluate the cost-effectiveness and recognition performance of collage prompts. This benchmark provides a platform for testing various image arrangements and includes a baseline optimization technique derived from genetic algorithms. Through extensive experimentation across diverse datasets, we demonstrate that collage prompts with optimized image layouts significantly outperform randomly arranged ones in terms of both accuracy and cost-efficiency. Moreover, two key metrics are proposed to measure the effectiveness of different collage configurations. This research contributes to the emerging field of prompt engineering by offering a practical solution that enhances the economic viability of large foundation models like GPT-4V, without compromising their visual recognition capabilities.
See less
Date
2025Rights statement
The author retains copyright of this thesis. It may only be used for the purposes of research and study. It must not be used for any other purposes and may not be transmitted or shared with others without prior permission.Faculty/School
Faculty of Engineering, School of Computer ScienceAwarding institution
The University of SydneyShare