Research & Analysis
- Text-image alignment: Difficulty in accurately depicting all attributes described in user prompts.
- Body problem: The output images sometimes present distorted, incomplete, duplicated, or abnormal body parts of humans (especially hands).
- Aesthetic Dissatisfaction: Output images deviating from from user’s aesthetic preferences.
Since our mobile app uses Stable Diffusion, I wanted to see if I could enhance the model outcome by coming up with a better prompting strategy. I ran self-analysis of different SD models, and carried out prompting experiments with users. I also encouraged users to work with ChatGPT to inspire them with prompting methods that offer different potentials. I tested the images using the same prompts with different phrase formation. Then I did a comparative evaluation of these sets of images among the users.
The main audience group of our app consists of users who have little experience with generative art AI, but like to explore art in this form of interaction that is light and fast. They focus on the quality of the result images, mostly satisfied with images that align with their conception of "beauty". To determine what they think is "pretty" or more likely to download, I studied their rate on the images from the comparative evaluation.
The users were also encouraged to interact with the system model with the assistant of ChatGPT to prompt the system for better images. Through this participatory design process, I also got to learn how users evaluate the quality of their own prompts and the images generated.
Strategy & Design
Stable Diffusion web UI include a complete set of functionalities needed for image generation. However, to transfer the same power onto mobile interface without losing the core in AI image generation is something I have been exploring. Thus, I collaborated with our product manager to study a few existing mobile AI art generator and compare them to our app's functionality. Most of the mobile apps do a good job in making the interface intuitive and image generation process as straightforward as possible.
The initial app interface has all of the main features: a prompting section where users can enter their desired text prompt, image style and size selection, image output, and an inspiration gallery. From here, I studied the impact of each feature's impact on user engagement rate.
I collaborated with the engineers to explore alternative ways for user to better prompt the model. Based on the results from the user study, we learned that users tend to like the images with prompts consisting of short segmented phrases instead of single words. To ensure optimal AI model performance, I worked with the engineers on setting limits on user selection within the system. This involves refining the system tree so that only options with the most consistent performance appear as tags. While users can still manually input their selections in the prompt, this approach helps streamline choices and enhances the overall efficiency of the model.
I have refined the app's prompting flow to enhance user efficiency. Previously, users needed to manually type in their text prompts, which could be time-consuming and potentially confusing. Now, users can begin by selecting predefined labels that serve as a starting point for their input. This change not only simplifies the process but also ensures a more guided and efficient user experience. By providing a set of relevant labels, we help users quickly narrow down their choices and streamline their interactions with the app. This improved prompting strategy is designed to save time and reduce cognitive load, making the app more intuitive and user-friendly.
Conclusion
Throughout the project, I came to understand how human-AI interaction is changing the way we design software interfaces. To make AI tools more accessible and usable, designers can start from learning about how users interpret AI systems and studying their expectations for the AI systems, and then trying to bridge the perspective gap.
Thus, from redesigning and reevaluating the interface, I have come up with a few tips for engineers and users:
Tip 1: At the beginning of prompting, having a prompt structure consisting of a small/local scope and a large/global scope keyword is sufficient for a good base image with a fast generation.
Tip 2: “The more detailed, the longer the prompt, the better the image quality” is not necessarily true. Keeping the length of prompts between 5 and 10 short phrases (100 and 500 iterations), separated by “,” or “;”, is sufficient. The number of iterations and length of optimization did not significantly correlate to the image quality.
Tip 3: When choosing the subject for generation, think about the relationship between the subjects and image styles (i.e. choose subjects that harmonize with the chosen style in terms of abstractness level). This can involve choosing subjects that align with the abstractness or concreteness of the style, or pairing subjects that are easily understandable or closely related to the style.