Spring AI 与多模态和图像

本文将教你如何使用 Spring AI 的多模态特性创建一个能够处理图像和文本的 Spring Boot 应用程序。多模态是指能够同时理解和处理来自不同来源的信息，包括文本、图像、音频和其他数据格式。我们将进行一些关于多模态和图像的简单实验。

源代码

如果您想自己尝试，可以查看我的源代码。为此，您必须克隆我的示例 GitHub 仓库。然后，您只需按照我的说明操作即可。

Spring AI 多模态应用的动机

多模态大型语言模型 (LLM) 的功能使其能够处理和生成文本以及其他模态的信息，包括图像、音频和视频。此功能适用于我们希望 LLM 检测图像中的特定元素或描述其内容的场景。假设我们有一个输入图像列表。我们希望在该列表中找到与描述匹配的图像。例如，此描述可以要求模型查找包含指定元素的图像。Spring AI Message API 提供了支持多模态 LLM 所需的所有元素。下图展示了我们的场景。

核心实现

1) 通过 Media 发送图片

控制器在构造函数里加载图片，并构建 Media 列表。每个 Media 都包含图片 ID、MIME 类型和资源数据。

this.images = List.of(
    Media.builder().id("fruits").mimeType(MimeTypeUtils.IMAGE_PNG).data(new ClassPathResource("images/fruits.png")).build(),
    Media.builder().id("fruits-2").mimeType(MimeTypeUtils.IMAGE_PNG).data(new ClassPathResource("images/fruits-2.png")).build(),
    Media.builder().id("fruits-3").mimeType(MimeTypeUtils.IMAGE_PNG).data(new ClassPathResource("images/fruits-3.png")).build(),
    Media.builder().id("fruits-4").mimeType(MimeTypeUtils.IMAGE_PNG).data(new ClassPathResource("images/fruits-4.png")).build(),
    Media.builder().id("fruits-5").mimeType(MimeTypeUtils.IMAGE_PNG).data(new ClassPathResource("images/fruits-5.png")).build(),
    Media.builder().id("animals").mimeType(MimeTypeUtils.IMAGE_PNG).data(new ClassPathResource("images/animals.png")).build(),
    Media.builder().id("animals-2").mimeType(MimeTypeUtils.IMAGE_PNG).data(new ClassPathResource("images/animals-2.png")).build(),
    Media.builder().id("animals-3").mimeType(MimeTypeUtils.IMAGE_PNG).data(new ClassPathResource("images/animals-3.png")).build(),
    Media.builder().id("animals-4").mimeType(MimeTypeUtils.IMAGE_PNG).data(new ClassPathResource("images/animals-4.png")).build(),
    Media.builder().id("animals-5").mimeType(MimeTypeUtils.IMAGE_PNG).data(new ClassPathResource("images/animals-5.png")).build()
);

2) 找到包含指定物体的图片

/api/image/find/{object} 将图片列表与文本提示一起发送给模型，并要求模型只返回“位置编号”。随后服务端把该编号对应的图片二进制返回给客户端。

@GetMapping(value = "/find/{object}", produces = MediaType.IMAGE_PNG_VALUE)
@ResponseBody
byte[] analyze(@PathVariable String object) {
    String msg = """
            Which picture contains %s.
            Return only a single picture.
            Return only the number that indicates its position in the media list.
            """.formatted(object);
    UserMessage um = UserMessage.builder().text(msg).media(images).build();
    String content = this.chatClient.prompt(new Prompt(um)).call().content();
    return images.get(Integer.parseInt(content) - 1).getDataAsByteArray();
}

3) 图片描述与结构化输出

/api/image/describe 会把“全部图片”作为输入，模型返回 String[]；
/api/image/describe/{image} 对单张图片进行结构化描述，返回 List<Item>，字段包含 name 与 category。

@GetMapping("/describe")
String[] describe() {
    UserMessage um = UserMessage.builder().text("""
            Explain what do you see on each image in the input list.
            Return data in RFC8259 compliant JSON format.
            """).media(List.copyOf(Stream.concat(images.stream(), dynamicImages.stream()).toList())).build();
    return this.chatClient.prompt(new Prompt(um)).call().entity(String[].class);
}

4) 生成图片并匹配（可选）

若配置了 ImageModel，则可以调用 /api/image/generate/{object} 生成新图片并加入 dynamicImages 列表。
若同时配置了向量库，则 /api/image/generate-and-match/{object} 会生成图片 → 生成图片描述 → 将描述在向量库中检索相似文档。

接口清单

GET /api/image/find/{object}：找出包含指定物体的图片，返回 PNG。
GET /api/image/describe：描述图片列表，返回 String[]。
GET /api/image/describe/{image}：描述单张图片，返回 List<Item>。
GET /api/image/generate/{object}：生成图片，返回 PNG（需要 ImageModel）。
GET /api/image/load：将图片描述写入向量库（需要 VectorStore）。
GET /api/image/generate-and-match/{object}：生成 → 描述 → 相似度检索（需要 ImageModel + VectorStore）。

运行与调用

运行应用

cd chat-rag
mvn spring-boot:run

示例请求

curl -o result.png "http://localhost:8080/api/image/find/banana"
curl -s "http://localhost:8080/api/image/describe"
curl -s "http://localhost:8080/api/image/describe/animals-3"

若已配置 ImageModel 与向量库，可继续尝试：

curl -o generated.png "http://localhost:8080/api/image/generate/cat"
curl -s "http://localhost:8080/api/image/load"
curl -s "http://localhost:8080/api/image/generate-and-match/cat"

小结

多模态的关键是把图片放入 Media，并通过 UserMessage.media(...) 与文本一起发送。
只要模型支持图片输入，就能完成“找图、描述图”的任务；若增加 ImageModel 与 VectorStore，还能扩展到“生成与匹配”。