Objects Perceive Me

What gazes back?

Computational Representation or the Statistical Gaze


“I think the style would be a bit whimsical and abstract and weird, and it tends to blend things in ways you might not ask, in ways that are surprising and beautiful. It tends to use a lot of blues and oranges. It has some favorite colors and some favorite faces. If you give it a really vague instruction, it has to go to its favorites. So, we don’t know why it happens, but there’s a particular woman’s face it likes to draw — we don’t know where it comes from, from one of our 12 training datasets — but people just call it “Miss Journey.” And there’s one dude’s face, which is kind of square and imposing, and he also shows up some time, but he doesn’t have a name yet. But it’s like an artist who has their own faces and colors.” 


David Holz, Midjourney founder, interview with The Verge (2022)


For us humans, the computation involved in generative AI is catalyzing a significant change in the processual truth of what an image is. The web of computational operations in this new production process forces us to re-think what the art history and photography canons call representation. Once conceived through optical concepts and materials such as a vanishing point, a photographic plate, or a camera mirror that were invented to accommodate the human eye, computation requires images to be processed as digitized data, or numerical information.

In this visual investigation, I wanted to see what would happen when a generative AI model is set off on a recursive loop in which its own outputs are iteratively fed back to it as inputs. My hunch, or hypothesis, suggested that the statistical operations of Midjourney would prompt the model to converge to the most probable averages of its dataset when left unattended by human intervention. In this experiment, I wanted to make experienceable in an exaggerated way what could become of image production if it is increasingly automated to produce what is most probable on the Internet.

The increasing presence of AI-generated images is a phenomenon that extends to mobile photography, the metaverse, and scientific observation. Content on the Internet will soon become an AI-majority artifact, which means future datasets used to train newer AI models will rely on synthetic data, creating a closed feedback system that can intensify initial conditions and biases. Researchers have already observed this process in AI-generated natural language experiments. In one experiment, they call this effect model collapse.[1] In the published paper, they include statistical evidence suggesting that recursion with AI-generated data creates a homogenization in outputs that increasingly forgets the tails of its distribution curve. In other words, outliers in the training data become lost as the model reinforces what was originally overrepresented, leading to increasing convergence and more errors. This recursive effect is a slippery slope that poses one of the more troubling aspects of automation I wanted to explore.
           
I designed a test for this by manually setting up a recursive process on Midjourney. I fed the model’s visual outputs back in as its inputs over a series of iterations to gauge how the initial image might change and converge formally when left reproducing without my textual prompting. Midjourney offers the ability to prompt the model with a pair of images rather than words, or a combination of images and words. In the case of the former, the company states on its website and Discord channel that the model “looks at the concepts and aesthetics of each image and merges them into a novel new image.” Just how concepts or aesthetics are defined by the model can’t really be known, although learning how to steer it towards desired outcomes has created a market for what the industry calls prompt engineering.


[1] https://arxiv.org/abs/2305.17493


meta-diffusion 1, (2023). Initial reference image: Heydar Aliyev Centre by Zaha Hadid. Initial text prompt: “an architectural structure in the shape of a tesseract in the middle of a contemporary Middle Eastern city.” All images produced with equal weights, default settings, v 5.1, medium stylized.



meta-diffusion 2, 2023. No initial reference image. Initial input prompt: “an architectural structure in the shape of a tesseract in the middle of a contemporary Middle Eastern city.” All images produced with equal weights, default settings, v 5.1, medium stylized.



meta-diffusion 3, 2023.  Initial input prompt: “a beautiful woman in a headscarf posing for a photograph.” Did not use grids; selected face I believed to be “darker” out of the first mostly white outputs. All images produced with equal weights, default settings, v 5.1, medium stylized.