>

SoM : Set-of-Mark Prompting
Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang*⚑, Hao Zhang*, Feng Li*, Xueyan Zou*, Chunyuan Li, Jianfeng Gao

Microsoft Research, Redmond; HKUST; University of Wisconsin-Madison;

*Core Contribution, Project Lead

[arXiv]    [Code]




Introduction

We present Set-of-Mark (SoM), simply overlaying a number of spatial and speakable marks on the images, to unleash the visual grounding abilities of large multimodal models (LMMs), such as GPT-4V.


Image


SoM ToolBox

We provide a ToolBox [Instruction] to help you generate marked image prompts. Try SoM prompts on Chat GPT4-v!

# Text prompts to GPT-4V for the SoM examples:

- Example1: What is on the left side of the right laptop?
- Example2: Can you count how many fruits and what are the categories in the basket?


Examples

GPT-4V Mobile (left) v.s. GPT-4V + SoM (right):
Image

User:
Can you list the categories of each mark with spatial identifier?

GPT4-V:

Of course! Here's a list of the marked items based on the image you provided, identified by the numbers and letters you've drawn:
(1) - A potted plant.
(2) - Grapes on the grill.
(3) - Red chili or pepper.
(4) - Another potted plant.
(5) - An item which looks like a pitcher or jug.
(6) - A piece of raw meat (possibly steak) on a stick.
(7) - Cooked meat.
(a) - A bowl filled with grapes.
(b) - Grill mesh/cover.
Let me know if you need more details about any specific item!

Image

User:
Can you list the categories of each mark with spatial identifier?

GPT-4V (SoM):

Certainly! Here's a breakdown of the objects in the image based on the spatial identifiers:
1. Table surface (with some liquid spill)
2. Purple bowl (with two different sections numbered as 2)
3. Pink container with a dipping sauce and a wooden stick
4. Grilled items (appear to be seafood, possibly eel or a similar fish)
5. Potted plant
6. Gray cup or mug
7. Grill with hot coals
10. Cluster of grapes on the grill
11. Packaged food item (possibly meat)
12. Lavender or similar herb in a purple pot
13. Bowl of green grapes
14. Outdoor setting with trees and railing
15. Single red grape on the grill
I hope this helps! If you have any other questions, feel free to ask.

GPT-4V (left) v.s. GPT-4V + SoM (right):
Image

User:
What is in this dish?

GPT4-V:

The dish appears to feature:
· A grilled or pan-seared fish fillet, which looks like it could be a type of white fish, possibly sea bass, cod, or halibut. · Asparagus spears.
· Potatoes, possibly roasted or boiled.
· A wedge of lemon for squeezing over the dish.
· Garnished with fresh parsley or cilantro leaves.
· There might be a light sauce or seasoning drizzled over the fish and vegetables, but it's hard to specify its ingredients from the image alone.
This looks like a delicious and healthy meal!

Image

User:
What is in this dish?

GPT-4V (SoM):

Here's a description of the items in the dish:
1. Fish Fillet: The main component of the dish appears to be a fillet of fish, likely grilled or pan-seared.
2. Lemon Wedge: A common accompaniment with fish dishes to enhance flavor when squeezed over the top.
3. Parsley or Cilantro: This green herb is used as a garnish and can add a fresh flavor to dishes.
4. Asparagus: These green stalks are often steamed, salted, or grilled and served as a side.
5. Sauce or Broth: There is a yellowish sauce or broth on the plate.
6. Potato: It looks like a roasted or boiled potato.
7. Garnish: These could be green onions or chives scattered on the plate to add color and flavor.
8. Sauce or Oil Drizzle: There's a darker drizzle around the potato, which could be a type of sauce or perhaps a balsamic reduction.
The dish seems to be a well-balanced meal with a protein (fish), vegetables (asparagus and potato), and various flavor enhancers (sauces, lemon, and herbs).

GPT-4V (left) v.s. GPT-4V + SoM (right):
Image

User:
I am on a diet, what are the top things I should avoid eating?

GPT4-V:

If you're on a diet, it would depend on the specific type of diet you're following. However, based on the image you provided, here are some common items that many diets might limit or avoid:
· Burgers: Especially if made with high-fat meats or served with cheese and mayonnaise.
· Fries: Deep-fried foods are typically high in calories and unhealthy fats.
· Chips: Can be high in salt and fats, especially if they're commercially produced.
· Onion rings: Another deep-fried item.
· Sandwiches with white bread: Many diets recommend whole grains over refined grains.
· Processed meats: Like the ham in the sandwich, as they can be high in sodium and preservatives.
· Sauces: Some can be high in sugar, sodium, or unhealthy fats.
· Soft drinks or sweetened beverages: High in sugars and empty calories.
Always consult with a nutritionist or dietician to get guidance tailored to your specific needs and health goals. They can provide recommendations based on your diet and personal health objectives.

Image

User:
I am on a diet, what are the top things I should avoid eating?

GPT-4V (SoM):

Based on the typical diet considerations, here are the top items from the provided image that you should most likely avoid or limit:
1. Onion rings (1 and 11) - Fried foods like these are often high in unhealthy fats and calories.
2. Fries (8) - Another fried item that can be calorie-dense and contain unhealthy fats.
3. Burgers (3 and 5) - Depending on their composition, they can be high in calories, unhealthy fats, and may contain processed meats.
4. Processed meats (2 and 15) - These can be high in sodium, unhealthy fats, and other additives.
5. Chips (16) - Typically high in unhealthy fats and can be calorie-dense.: There is a yellowish sauce or broth on the plate.
While indulging occasionally is okay, it's best to consume these items in moderation if you're trying to maintain a healthy diet.

Results

Image Image

Citation

@misc{yang2023setofmark,
   title = {Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V},
   url = {https://arxiv.org/abs/2310.11441},
   author={Jianwei Yang and Hao Zhang and Feng Li and Xueyan Zou and Chunyuan Li and Jianfeng Gao},
   eprint={2310.11441},
   year = {2023},
   archivePrefix={arXiv},
   primaryClass={cs.CV}
}