AI批量生成的图片质量大部分过得去,但也有一小部分不能符合要求,比较经过检查才行,而人工检查成千上万的图片非常耗时,最近我们使用AI来进行自动化质量检查,虽然最后的结论是目前还无法用自动检查替代人工检查,但中间也有不少收获,下面记录一下过程和要点,也可以供其他有类似需求的朋友参考。🤝
一、问题提出
为网页绘制配图一直是我想做的事情,AI绘图以前需要人工来绘制插图,工作量和成本不可想象,现在AI绘图出现后就有了为大量网页绘制插图实现的可能。
我们从去年上半年就摸索Stable Diffusion生成图片,后来购买Midjourney账号使用,再后来ChatGPT+DALL-E 3来绘图,先尝试手工操作处理,再后来尝试批量生成图片。
生成成千上万图片后需要人工检查,费时费力,成了阻碍我们继续使用文生图模型来提升网站内容质量的主要因素。
二、解决思路
批量生成的图片就用批量的办法来做初步的质量检查、筛选,然后还要再人工检查,只希望能够降低人工检查的工作量。
- 先使用文生图模型从提示词生成图片;
- 需要使用图生文模型API对已经生成的图片写出描述;
- 再使用大语言模型API对写出的描述和原始绘图提示词进行对比打分;
- 不合格的重绘,合格的交给人工审核,希望机器审核合格后人工驳回得少。
流程图如下:
三、模型选择
生成图片有两种方式:ChatGPT+DALL-E的浏览器自动化、批量调用Stable Diffusion的API,下面是第二种方式用到的模型:
- 文生文:可以使用GPT-3.5或者Llama-3-70B来生成绘图提示词prompt和图片描述explanation;
- 文生图:使用公司内带GPU的机器上安装的Stable Diffusion模型的API或者网上在线SD的API;
- 图生文:尝试了Cloudflare中提供的llava-1.5-7b-hf和uform-gen2-qwen-500m两种生成description;
- 对比打分:用GPT-3.5或者Llama-3-70B来对绘图前后的描述进行对比,或者对其它指标进行对比。
四、提示词调试
各阶段提示词举例:
图生文llava-1.5-7b-hf:
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const res: any = await fetch("https://cataas.com/cat");
const blob = await res.arrayBuffer();
const input = {
image: [...new Uint8Array(blob)],
prompt: "Generate a detailed description for this image",
max_tokens: 512,
};
const response = await env.AI.run(
"@cf/llava-hf/llava-1.5-7b-hf",
input
);
return new Response(JSON.stringify(response));
},
} satisfies ExportedHandler<Env>;
图生文uform-gen2-qwen-500m:
export interface Env {
AI: Ai;
}
export default {
async fetch(request: Request, env: Env): Promise<Response> {
const res: any = await fetch("https://cataas.com/cat");
const blob = await res.arrayBuffer();
const input = {
image: [...new Uint8Array(blob)],
prompt: "Generate a detailed description for this image",
max_tokens: 512,
};
const response = await env.AI.run(
"@cf/unum/uform-gen2-qwen-500m",
input
);
return new Response(JSON.stringify(response));
},
} satisfies ExportedHandler<Env>;
在本机批量调用的时候,图片可以放置在本地,fetch的url可以用http://localhost开头
文档示范中的prompt: "Generate a caption for this image"生成的文字只是一个图片的标题,过于简单,caption改为description后就是图片的描述,再加上detailed又可以更详细一些,但无论我再怎么改提示词,都没有办法生成更多细节描述了。
LLM评估explanation和description打分的system prompt:
There are two paragraphs below. The "explanation" field is the description of a picture when using text-to-image, and the "description" field is the picture description generated when using image-to-text. We are now going to evaluate the image quality of the text-to-image by comparing the contents of these two fields. Please carefully compare the protagonist, background, various details, style, feeling and other aspects of the two descriptions, and give a score of 0-5 points for each aspect (the higher the similarity, the higher the score, and the lower the similarity, the lower the score), and then give a comprehensive score of 0-5 points.
You should only output the json data like:
{
"final score": "3"
}
do not output other informaion.
输入例子user prompt:
{
"explanation":"這幅插圖展現了一個人獨自坐在樹下,臉上帶著憂傷的表情,眼中含淚,象徵著悲傷和憂愁。背景色調暗淡,以傳達悲哀和悲痛的感覺。畫風仿傚傳統中國水墨畫,通過簡單而慎重的細節捕捉了“哀”字的本質。重點在於描繪悲傷和哀悼的情感,營造一種寧靜和內省的氛圍。",
"description":"The image portrays a somber scene set in a cemetery. A woman, dressed in a gray robe, sits on a stone bench under a large tree. The tree, with its twisted trunk and gnarled branches, is situated in the foreground, with its roots extending into the background. The woman, who is dressed in a long-sleeved shirt, is seated with her hands clasped together, her head bowed in a state of deep contemplation. The cemetery is filled with numerous gravestones, scattered across the landscape, and a fence can be seen in the distance. The foggy sky overhead adds to the atmosphere of the scene, creating a sense of solitude and quiet reflection."
}
LLM评估word和description打分的system prompt:
There are two paragraphs below. The "word" field is a word in the Chinese dictionary. We use text-to-image technology to generate a picture for it. The "description" field is the picture description generated when using image-to-text. We are now going to evaluate the picture quality of the text-to-image by comparing the contents of these two fields. Please carefully evaluate the protagonist, background, details, style, feeling and other aspects of the description, and then give a comprehensive score of 0-5 points (the more suitable the picture is for the word, the higher the score, and the less suitable the picture is for the word, the lower the score).
You should only output the json data like:
{
"final score": "3"
}
do not output other informaion.
输入例子user prompt:
{
"word":"哀",
"description":"The image portrays a somber scene set in a cemetery. A woman, dressed in a gray robe, sits on a stone bench under a large tree. The tree, with its twisted trunk and gnarled branches, is situated in the foreground, with its roots extending into the background. The woman, who is dressed in a long-sleeved shirt, is seated with her hands clasped together, her head bowed in a state of deep contemplation. The cemetery is filled with numerous gravestones, scattered across the landscape, and a fence can be seen in the distance. The foggy sky overhead adds to the atmosphere of the scene, creating a sense of solitude and quiet reflection."
}
还把word加释义一起来与description也进行对比打分,分数也差不太多。
下面是各种模型生成内容的对比表格,内容太多了,就放一个示意图。
五、实践情况
我们用800个图片作为例子,批量运用图生文API、批量LLM打分对比,然后再人工评估,将人工评估与自动打分进行对比。
人工评估的情况:
- 合格:644个
- 不合格:156个
机器评估的情况:
- 合格:600个,其中506个人工评估合格,94个人工评估不合格,机器错判率:15.66%
- 不合格:200个,其中138个人工评估合格,62个人工评估不合格,机器错判率:69%
从以上数据来看,机器评估与人工评估相差过大,准确率不足,难以实现我们预期的甄别效果。
六、结论及后续
目前还不能依靠自动化的办法来进行大批量的生成图片的质量检查。主要原因在于文生图和图生文这两个过程中信息失真过大,比较图片生成前后的文字就难以得到准确的判断。
后续可以进行的工作:
- 各阶段提示词的优化
- 自动化循环进行(目前用程序,以后是否用智能体)
- 更合适模型的选择(国内免费模型)
- Lora训练提升图片合格率(这个是重点)
现阶段无论文生图、图生文都还不能让人满意,而文生文基本上还是过关的,也许再过一年半载,文生图、图生文的质量大幅提升后,还是可以用批量方式实现质量检查的。
评论