We discuss with Fabian Offert concerning Generative Models, Synthetic Data, the role of Latent Spaces and the future of General Computation from Artificial Intelligence.
Fabian is an Assistant Professor in History and Theory of Digital Humanities at the University of California, Santa Barbara.
His research and teaching focus on the digital and computational humanities, with a special interest in the epistemology and aesthetics of computer vision and artificial intelligence.
At UCSB, he is affiliated with the Germanic and Slavic Studies department, the Media Arts and Technology program, the Comparative Literature program, and the Center for Responsible Machine Learning.He is also a principal investigator of the international research project “AI Forensics” (2022-25), funded by the Volkswagen Foundation.
Before joining the faculty at UCSB, he was affiliated with the DFG SPP “The Digital Image”, and the Critical Artificial Intelligence Group (KIM) at Karlsruhe University of Arts and Design. Previously, he worked for a number of German cultural institutions like ZKM Karlsruhe, Ruhrtriennale Festival of the Arts, and Goethe-Institut New York.
Dustin Breitling: Can you reflect on your intellectual journey, particularly on some formative moments, thinkers, or encounters that paved the way for your current interests. You weave together a focus on quantum simulation, AI Art, Machine Learning and Computer Science, that is also accompanied by revaluations of what the field of digital humanities means today.
Fabian Offert: Yeah, of course, it’s always difficult to build a narrative about yourself. But I believe one important factor that has led me to these disparate interests is the industry dimension. So I studied performance art, which means that I kind of hung out in weird theater spaces to build things. Just to give you an example, at one point, my partner and I realized an experimental music theater piece in collaboration with IRCAM, the French Electronic Music Research Center, and built a life-sized electronics circuit on stage. So we would buy these really huge amounts of pure copper, pure flat copper, and copper.
We’ll arrange them in complex structures before electrifying them. Then the musicians in the project would have copper pieces under their shoes, and they would open and close connections while walking across the stage, generating pulses. These were fed into an algorithm that a composer friend of ours designed. This is one project. Another one that was recently there was the stage design, which was entirely made of cardboard. We eventually fed all of this stage—the complete stage—into the shredder until nothing was left. So I’m just mentioning this because I really work in a very hands-on manner with things and do additive work that requires lots of failures to work. So both are functional. Aesthetically, I think that’s why I’m bringing up these projects that have nothing to do with what I’m doing now. But I think the, let’s say, materialist or strong materialist tendency in my theoretical work really goes back to this understanding that when it comes to objects, not concepts, and I would argue, actually, we can talk about this, AI is very much an object or a set of extremely heterogeneous objects, that you have to have some sort of tangible relation to it to critique it properly. This trajectory is not very straightforward. However, I believe that the materialist element, or materialism, in my work is a combination of theoretical elements. In many of the things that I do.
DB: We can plunge further into the nature of this materiality where you and Peter Bell have discussed digital humanities. You’re looking into this generative approach that can be harnessed to guide the further development of the computational humanities, as you describe in your Generative Digital Humanities paper. Could you untangle whether you understand a material dimension intersecting with your focus here? Nevertheless, there’s also, with Generative Adversarial Networks, a potentially epistemological quality to it, that plays a role in the realm of science. You argue as well, there’s a humanist process of discovery. So, would this be a way to combine a material approach with an epistemological approach?
FO: I think so. It’s really funny that you mentioned this, the paper that Peter and I wrote, because it resonated with a lot of people. At the same time, in a way, it’s completely outdated now because GANs have been completely replaced by Transformer-based systems, but nonetheless, for both GANs and this newer generation of multimodal generative models, with Stable Diffusion, DALL-E, I would really still claim that these epistemic implications that you mention are still valid. I’ll try to give you an example. That also kind of goes back to my paper on the use of Generative Adversarial Networks in the sciences. First of all, it’s kind of a trivial observation on my part, that generative systems exist. According to what I read in the paper, derivative systems are more or less historically fixed. So they can only produce variations of what they have seen. This is starting to change a bit with what the engineers call retrieval models that have access to current sources and not only what’s in their training data. But generally, this is still true for most generative systems. So this is also, of course, why models trained on biased data sets can suddenly produce unbiased outputs. That’s a neat little experiment from way back in the early days of AI. So the current generation of AI systems shows this quite convincingly on a toy level, so you can train a neural network. If you replicate this at some point, you can train a neural network to translate binary numbers into decimal numbers. So the training set that you have for this experiment is potentially infinite because it costs us nothing to generate billions of data points. The one constraint in this experiment is that in the binary numbers that you feed to the network, the least significant bit, so the bit on the right, is always zero.
That means the network only sees even numbers, because if the least significant bit is zero, the decimal number that is the binary number’s translation is also even. After training on this for a while, the network will have accuracy really close to 100% for even binary numbers, but zero point accuracy for odd binary numbers. As a result, it will be incapable of dealing with odd numbers; it is incapable of dealing with odd numbers. I like this example because the network has seen odd numbers on both sides of the equation. Because in decimal numbers, some of the digits are odd, but the number itself is not odd. However, just because of this one constraint—that the least significant bit can only be zero—our model is basically complete trash. The interesting takeaway is that I do not believe generalization always results in a world model. The representation of just some parts of the structure does not approximate the representation of the full structure. For generative image models, which is, of course, what you asked about.
This means that they are significantly constrained in what they can do. In the generative digital humanities paper, what we argue is that despite these constraints, generative image models can be useful as well, you could call them engines of interpolation, but only if we already have a good idea of what our data set contains. So this comes back to the question of representation; we can use the generative system, for instance, to represent a discrete collection of images as a continuous space, which is what generative systems do because they give you a latent space, a continuous space. In this space, we can find intermediate images, so images in between the data points that reflect those images that we put into the model. So in that sense, these intermediate images are what we argue for in the paper. In generative digital humanities, intermediate images provide us with a kind of hypothetical, let’s say rate of change, similar to image calculus.
The rate of change between two images that are in the data set. Of course, these intermediate images are entirely fictional and historical, because the network knows nothing about the actual, say, art historical transitions that could exist between two images, they can, I think, still serve as a kind of inspiration for thinking about what constitutes an image in the data set because generally, in these generative systems, they dissolve and then reconstitute an image data set. I believe this is the kind of productive grip we have on this. The main point of the paper, other than just to complete this thought real quick, is, of course, to show this hypothetical image space. To see this hypothetical image space as something useful, it’s not as strange, outrageous, and unscientific as people like to claim, because this is actually what digital humanists have been doing for years, with topic modeling. With the one difference that for text versus images, this continuous space, this continuous intermediate space, is just less visible, because at the end of the day, you have to kind of collapse it back into the discrete tokens that texts consist of, so that doesn’t mean that this continuous space that gives you this possibility of what interpolation is.
DB: I think there was a paper written about 10 years ago, and the question was if you could craft an algorithm that could center on how to write a logical fitness function that has an aesthetic sense. I think this is what you’ve mentioned, particularly how the employment of generative models I’ve seen, are trying to, capture a certain era of, let’s say, Renaissance painting, or also all the way up to, let’s say, Modernist painting, seems like this is where the generative models, of course, are becoming more and more utilized. As you mentioned, it’s not just limited to GANs. So, if you could just sketch out a bit more with the generative model, because I believe this is a really important point in your paper. And that paper is concretizing that distinction between what we mean by a discriminative approach and generative approach. And as you write, you know, that the discriminative approach is like a conditional probability distribution, where you’re looking at the probability of x given y, whereas the generative approach is learning the joint probability distribution, could you maybe just unpack that for some people who aren’t familiar with these types of terms?
FO: Yeah, let me try that. In other words, in a discriminative approach, you would basically learn to distinguish between things as you classify data points. Let’s say we have two classes of things, Renaissance paintings and Medieval paintings. These are terrible classes, they don’t make any sense in machine learning, but just for the sake of example, let’s say we have two classes of things, Renaissance paintings and Medieval paintings. We try to build a model that can look at previously unseen artwork with previously unseen data and properly categorize it. Whereas in a generative approach, you will try to basically find the connections, or, let me rephrase this, in a generative approach, you will basically try to model what some of these paintings that are in your data set have in common. So you would try to basically create—and this is something that I think is a quite important concept—a kind of semantic compression of the data set where all the traits that you can find in the data that you have still exist, but in a compressed space. This is the infamous latent space that everyone loves; so much has been written about it, that this is a compressed representation of the input data set. But because it’s a continuous space—and this goes back to my previous comment—because it’s a continuous space. And because it’s a compression, on the one hand, your input data isn’t fully represented, so you’ll never find one of the paintings from the input data set represented with 100% accuracy in the latent space, because that’s not how it works.
You can find approximations for specific data points, but we’ll never find the actual thing in there. This, by the way, also has implications for the current generation of prompt based models and their inability to recreate certain things; maybe we can talk about that as well. But basically, you create this compressed view of the data set that is continuous, which means that between these potential approximate mentions of images that you can kind of find, you have these intermediate points, and you can interpolate between points in that space. In the early days of AI, this is all that people did. They created these latent space interpolation videos, where you would start in one corner of that space, and then move to another, and you have all these morphing effects. very cinematic and very interesting, but also very boring after a time. Yeah, so basically, the difference again, is that you either classify things or you try to find their, let’s say, implicit characteristics in a compressed, technically compressed way.
DB: Finally, I’d like to make a comment because I believe there was something related to your paper here—you were essentially conducting a generative modeling approach. You employed over 20,000 adoration paintings or scenes because you were looking at an iconographic corpus in this case. Could you please explain some of your findings? I think this is a really important thread that could be tied in a bit more with the semantics and syntactic elements and dimensions that were also discussed when we were thinking about latent spaces. This is brought to relief, especially with multimodal language models, that’s a key element: to think about semantics tied to images. Tell me, why did you choose to use this corpus?
FO: The corpus arose from a research project I was working on at the time that attempted to leverage machine learning in general, rather than just generative models for iconographic analysis. So to try to find similarities—iconographic similarities—between artistic artifacts, paintings, basically, in their case. In terms of the findings of that more or less empirical part of the paper, we didn’t find anything that wasn’t obvious from the start. But I think the one result that is interesting, both logically and maybe also epistemologically, is that this method kind of works to really try to see the boundary of iconographic concepts. If you have these intermediate points and if you can generate these intermediate images, these interpolations, then there’s a very precise threshold where one thing stops to be that thing, it ceases to be that thing, and it becomes another thing. So in that sense, we propose that we can use these latent spaces that we can use generative models to think more about, basically, not really object boundaries, because it’s not only about objects, but maybe conceptual boundaries in images, which are notoriously hard to determine how to define.
Of course, you could go back to the pixel level and say that this pixel here—I don’t know, 350, 417, or something—is exactly where this one image object begins and never ends. In practice, however, this is not so simple. It’s actually pretty difficult. I believe that simply enriching a corpus with these imaginary images allows you to look closer at these boundaries and ask the question of what is in an image? I believe, the actual result of this empirical part of the paper, one of which is the question of what is in an image?
DB: The recurring terms that have been springing up have been related to latent spaces. I think this is actually something you wrote about two years ago, in “Latent Deep Space,” that unpacks and teases out this tension between, as you’ve mentioned previously, GANs, especially considering, as you point out, that GANs are being replaced. We see Diffusion models and Transformer models, as well, being harnessed as these speculative engines. They are however, still embedded in this technically defined space. And so what I found very fascinating about your paper was the utilization of GANs to reconstruct images of galaxies that have been, as you point out, perturbed by various sources of random and systematic noise from the sky background, the optical system of the telescope, or the detector used to record the data. Also, you know, for example, how these GANs are used for cancer recognition. Could you dissect their function? You articulated that latent spaces serve as compressed representations of inputs. A latent space is a really ubiquitous dimension that underpins a lot of these models.
FO: Yeah, sure. So, latent spaces are a strange phenomenon. I think it’s one of the first technical terms that many people encounter when they start looking into artificial intelligence, particularly visual artificial intelligence. However, this is the type of medium that people use in art; they explore, they build on occasion, but most of the time, they simply explore existing latent spaces. In the Latent Deep Space paper that you mentioned, I basically argued that in the sciences, latent spaces constitute a kind of optical medium. So if certain AI-based, image reconstruction techniques, are used on scientific images, the latent space basically replaces the lens of the optical instrument used to take these scientific images in the first place. Again, in the paper, I argue that this is, of course, not a new development, which is why the paper starts with Peter Galison’s reflections on the relation of images, data gathered to images, and images scattered into data. It also starts with the infamous black hole picture. That arguably follows the same logic as GANs, where generative systems enhance images. This picture supposedly shows a black hole that was generated basically, by using data from several interconnected GANs over time. It’s not like you could just point your telescope at the sky and take a photo of the black hole. So it’s a more constructed, generated image, and it is an actual depiction of what’s out there. In the paper, I cite a slew of case studies from computer science in which people use generative systems, particularly GANs, to improve scientific images. One example is that this is one of the few cases where the researchers themselves discovered something that they found extremely troubling. They were to write about this problematic aspect rather than use it for, you know, their day-to-day work. In this cancer example, they train a model to just translate between two kinds of images. So this is about MRI images.
There are basically two data formats that you can have, and they wrote a model, or built a model to translate between one or the other. What they found is that if the training data for this very simple image-to-image model was unbalanced, for instance, if you had more images of cancerous cells in one part of the data set, so in one image format, than in the other, then the network in the translation process, when it was fully trained, would basically produce cancer or visual signs of cancer, even if you put in an image without cancerous growth. Therefore, because of the unbalanced nature of the data set, and because, again, the systems can’t handle things that they have never seen, the simple act of image translation would produce a significant semantic shift in this case because we’re talking about applications and medicine, with significant real life consequences if they were ever used. This goes both ways. In the case of false positives and false negatives, both of these are terrible.
If you think of the practice that’s connected to this, and then in the other example that occurred in the paper, a bunch of researchers used GANs for photos of galaxies. They make a really interesting claim in that paper, because they claim that this model can transcend the deconvolution limit. Just as an example, what is the deconvolution limit? You’ve probably seen The X Files, CSI, or something similar. It’s that thing where they zoom into a photo infinitely. The photo will always have a bright, crisp resolution. Although you intuitively understand that this is not possible and that the deconvolution limit is just that—a limit that determines how much information you can extract from an image with a fixed resolution— And what they argue is that their methods using AI transcend that limit, which is actually physically impossible. So I argued in the paper that they’re basically just replacing information that’s in the original image that they’re trying to enhance with information that they bring in from the outside, through the generative system that they’re using. Due to the fact, they’re using a pre-trained GAN to enhance those images. It’s really interesting because these generative systems are so ubiquitous in science right now, and that’s one aspect of AI that people like to ignore that hasn’t been looked at all that much, actually, at this point.
DB: I also believe that for protein folding, or in general, the prediction of the structure of nearly all proteins, Deep Mind and Meta have been at the forefront. I believe they’re employing synthetic data, if I’m not mistaken.
FO: Yeah, the protein case is interesting. I’m actually writing a paper about this at the moment, because they’re using large language models for the protein folding thing. As a result, this is not an image model. It is language models, which bring up all these questions about whether this is a language because they’re using large language models for the protein folding. As a result, this is not an image model. It is the language models, that bring up all these questions about whether this is a language or, if not, why does the language model work? And in the paper that I’m writing, we basically argue—this is in collaboration with two PhD students here at UCSB—that language models are not language models at all. There are models, but they’re not language models. Protein folding is one such example. And it’s the case that proteins are not a language. They are nothing like languages. But again, this is a work in progress. So I can’t tell you much more about this synthetic data production.
DB: I think the statistic has been that about 60% of data in the future is estimated to be synthetic at some point. And do you see any other major applications of synthetic data that maybe we’re not readily aware of but are used on a regular basis? From what I’ve been researching, or as I was telling you before, particularly in fields like Remote Sensing, this is quite significant. It’s quite interesting, because, if I guess, one of the approaches or techniques they use is Domain Randomization. I think this ties in with your point, with synthetic data and then also latent spaces in this way? Where do we draw these limitations of imagination? It also ties in and incorporates the human because if I’m using a game engine, you know, of course, there’s all these virtual cameras that I can use, but I also have to think if I’m moving around objects, where should I actually place these objects? And, what are the possible cascade effects? The data or images and shots in a virtual environment that I generated eventually will impact a model via the dataset and whether it can recognize or identify objects.
FO: It’s really funny because they had this exact conversation about this feedback loop of synthetic data or generated images at a conference a couple of weeks ago. Of course, this might be a little bit controversial, but I think on paper, this is really a big problem. Because we can easily imagine how our visual culture and canon are becoming increasingly narrow. By virtue of the fact a model is trained on, let’s say, the real visual canon, the real visual culture, and then the outputs enhance what’s out there, you will eventually end up with this self-reinforcing loop, this feedback loop error, where models trained on synthetic data produce more synthetic data. So I can totally see that. But then on the other hand, at least for the moment, the process to really build a model at the scale of Stable Diffusion or a DALL-E requires so many resources and is so slow that I don’t think we are at that stage yet where that is an actual problem in practice. I know people are thinking about that. I know some researchers, for instance, are looking into questions like watermarking the outputs of generative models. So to make them visible to the next generation of generative models, this exact thing won’t happen. But I think we’re not at that point yet where that is the main problem.
DB: This also ties in with a recurring thinker that occupies your work. William John Thomas Mitchell, whose concept of the ‘metapicture’ informs how we can also comprehend generative models. Could you expand on his work, and in particular, the concept of the metapicture? Another thread that you have scoped in on is the nature of perceptual bias, which is not necessarily recapitulating, the work undertaken by the likes of let’s say, Kate Crawford or Trevor Paglen and numerous dataset papers that attempt to unmask the inherent social, racial and cultural biases that are perpetuated. Perhaps could you explain how your work overlaps and fundamentally differs from what we mean by ways of seeing, for example, when you refer to the myriad of techniques that constitute various ways of how artificial intelligence models, harness Linear Filters, Stochastic Clipping, Laplacian Pyramids, and how they come to reinforce their ways of statistically assessing and reproducing an execution or version of a world?
FO: Yeah, I mean, I am by no means an expert on Mitchell’s brand of visual studies. But I liked the concept of the ‘metapicture’ because it shows how representation can become extremely complicated. The interesting aspect of that, to me at least, is that contemporary models are able to match this sophistication of scaffolded representation that is meant by this concept of the ‘metapicture’. I’m going to give you an example, in collaboration with Peter Bell, we built a system called images.ai, which is basically an image retrieval engine. So it operates on some of the publicly available image collections of a big museum. So we have an index, the Metropolitan Museum collection, which is about 400,000 images, and some others. What this system does is generate image embeddings. CLIP is the part of generative models like DALL-E 2 that brings together text and image for each image in these collections based on the clip model. So this is the model trained on text image pairs from the web. That basically tells you what kind of visual elements belong to a label, which is why we can prompt these newer generations of models. and we use that for head retrieval purposes. So you can just ask the system to, you know, give me all the images, give me all the photographs by August Sander, in the collection of the Museum of Modern Art in New York, and stuff like that. An interesting experiment that we ran on this as we asked the MoMA collection for Las Meninas last minute, which is, of course, this very famous painting that Foucault, among others, spends a lot of time writing about in The Order of Things, and it’s generally regarded, as you know, as an art historian’s painting just because it plays on representation that it embodies. There’s a painter in the painting; there’s a window; that’s a mirror; all these interesting aspects: people looking at people; people looking at the spectators. As a result, all kinds of gaze relationships exist. As you know, a million texts have been written about this political painting. Now, if we look for this painting in the manga collection, and the MoMA doesn’t have it, we can find it in Madrid at the Prado.
Also, the MoMA doesn’t have much stuff from the Spanish Golden Age at all, It’s a contemporary art museum. But if you look for that painting, you’ll find a slew of photographs that I wish I could show you but can’t because we’re stalking, but that really capture how important this play on representation is in Las Meninas. For instance, if you get a photo where a person is looking through a shop window, at a painting of a person, another person is looking at that person who’s looking through the shop window at a painting of a person. So they’re the same kind of gaze relations going on in that photograph, which has nothing to do historically or otherwise with us. But it seems that in this case, at least, the CLIP model has learned something about the concept of representation that is somewhat reproducible when we run it on these image collections. So, in that sense, the concept of the metapicture is relevant because they believe that in modern models, we’ve moved beyond the level of just the syntactic, so we’re no longer dealing with just syntactic aspects of images. But there’s a semantic aspect that comes into play that we can somehow, at least to a degree address and work with, with these models. The concept of perceptual bias is a concept that is inspired by computer science research on the interpretability of machine learning systems, which we essentially translated into a humanities inquiry framework. So just to give you the short version of the concept, it’s long been known that machines see the world differently than we do, in some ways, it’s a superfluous insight.
Of course they see the world in a different way because they are machines rather than humans. But if you look at the details of these machine-learning ways of seeing, it turns out that we can’t even properly investigate these differences. So the difference between what a machine sees and what a human sees is that, as we show in our paper on perceptual bias, the more legible we try to make a neural network, the further we move away from accurately representing what is going on under the hood. Let me talk a little bit more about that. So the one specific method that we’re looking at in the paper is called feature visualization. What this feature visualization does, is give you a technique to visualize what specific neurons in a neural network are like. So the working hypothesis is that you’ll have a kind of hierarchical model of vision that emerges in particular convolutional neural networks, with earlier layers detecting basic patterns, such as curves or lines or specific textures, and later layers detecting classification. So the output will detect more complex, more semantic objects, almost semantic concepts. For instance, if you have a network that distinguishes between cats and dogs, the last layer before the output layer, presumably, has neurons that detect dog ears, cat tails, and so on. You get the idea. So that’s a hierarchical model of vision there. But it turns out that for this visualization technique to actually work, so as to produce images that show dog ears and cattails, and all these other things, it has to apply a bunch of filters in an iterative manner.
So this feature visualization technique is basically an optimization technique where you start with a noisy image that you feed into the network. Then basically, you measure how much specific neurons, the ones that you’re looking at—are firing, like in this image. So how much do they activate, looking at this image? Then, once you have that information, using what the engineers call backpropagation, you go back and change this noise image a tiny bit so that the neuron in question likes it even better. And you do this a couple of thousand times. Eventually, you end up with an image that supposedly shows you what the network sees—at this point, what the specific neuron sees as what works best—what activates it the most. But to actually get to the point where you have a legible image, it takes a couple of iterations. Also, you have to filter out high frequencies in the image. Because if you don’t do that, then what you end up with is an image that’s completely illegible and looks like noise. But that still activates that neuron, and you end up with what the engineers called an adversarial example. So an image that shows nothing, but that’s super relevant to the network, means that you get through an image that actually shows you something that is an image of something.
At several points in the process, you just have to throw information away, and you get a better image. The more information you throw away, the more high frequencies you actually get rid of. This means that through this process of continuous filtering, you’re not moving closer and closer to the actual thing that the neuron actually sees, but you’re moving further and further away. So even the visualization techniques that people have come up with don’t really bring you closer to the technical object under investigation. and this is what we show in the paper. And this is also what we mean by perceptual bias, there’s a specific way of looking at the world. That’s determined by the technical object. It’s really hard to find out what that specific way of seeing is. And on the other hand, this particular way of seeing, of course, influences what the model does and how it treats input data and so on. To return briefly to your question about Crawford’s work with Trevor Paglen and the work of many others, there is one point I would like to make.
This point in the paper—and I tried to make it as clear as I could—is that focusing on dataset bias is only half the picture. Of course, we have biased data sets. Of course, they produce models that are totally biased. And of course, these biases have really awful consequences. But there’s also another part where these biases are amplified, or new biases are added. And those are based on the actual architecture of the model. The actual, technical way that the model looks at the world and this is what we call perceptual bias. So basically, there is another layer of problems that you get when you use these systems.
DB: You write in On The Emergence of General Computation from Artificial Intelligence, that we will soon witness the reemergence of general computation from artificial intelligence, marshaling the examples of Jonas Degrave asking GPT to hallucinate a virtual Linux machine. AlphaTensors trailblazing feat of optimizing matrix multiplication, as well as proceeding to argue that the stack is dead. And you proceeded to argue that we entered the age of humanist hacking. Could you sketch the contours of this argument further and their implications?
FO: Yeah, so I usually try to avoid making highly speculative arguments. But, as you’ll see in the blog post for this one, I made an exception because I believe we’re witnessing a major paradigm shift. I would say it’s a paradigm shift from symbolic to post-symbolic computation. I’m going to tell you what I mean by that. I mean, it’s still unclear whether large language models, in particular, operate on a world model, which is also basically functioning models in a scientific sense of the world, or if it’s all just surface statistics, right? Despite that, they’re getting so good at emulating the fundamental building blocks of computation, that I think we can easily imagine a future in which computation itself becomes a software problem, a problem of neural networks. So if you want the inverse of Kittler’s hypothesis that there is no software—you know, there will not be nothing but software— So imagine computers that are designed exclusively to run your networks. We’re already there, people are already building these specialized machines. Consider the use of FPGAs, which are programmable chips that change the hardware depending on what you run on them. Everything else will be built around these neural networks.
Now, there’s one counterargument to that, which is that there’s always a need for software that is provably deterministic, right? So it’s the software that airplanes run on, which is often formally verified. So there’s a mathematical description of the complete set of possible behaviors for that software. But this is not what I mean, I mean that our everyday interactions with computation will almost certainly be mediated almost entirely by neural networks – again, this is speculative. I’m not talking about chatbots or other uniquely AI forms of interaction; those are an HCI problem, not an epistemic problem. I’m referring to a scenario in which something as simple as adding two numbers correctly is mediated by a neural network simply because all of your software runs in the browser. Your browser is nothing more than an interface to an industrial park, full of servers running neural networks on highly specialized hardware.
All these epistemic problems are suddenly your problems as well, and they become more pressing and more significant. So what I take away from this is that, at the end of the day, it’s probably a good idea to look closer at all of the quirks and idiosyncrasies of the current generation of models. In order to be prepared for a future in which all of these quirks and idiosyncrasies will inevitably structure your reality, or will structure your reality more than they do now. So again, this is speculative, it might not happen, and stuff might change. However, I believe that the epistemic implications of neural networks are not general, but rather the epistemic implications of specific model architectures.Knowing these epistemic implications will be really important.
DB: Could you tell us what are you working on right now?
FO: Yeah there are two projects that I’m working on. One is almost finished, and one is just beginning. So, is it the paper or the project that is just getting started on protein folding. This is in the framework of a grant project. I’m a part of five institutions that are working together on a grant project, and the Volkswagen Foundation is paying for some of them.It’s called AI forensics. It’s a three year project. So, within this framework, we’re working on protein folding, which is arguably the one scientific winner of artificial intelligence, so the one application that isn’t purely decorative, like DALL-E, or extremely dubious, like these chat bots. It’s going to be really interesting, I think, to see what we find when we look closer at these particular systems that people use to solve problems because, presumably, that’s what research has to address. This has been solved by artificial intelligence. The other project that’s going to come out soon in the form of a paper is on the notion of history in large image models, because I think it’s really interesting to think about the question of history or the notion of history, when it comes to image generation. I can just briefly mention one finding from that paper, which is that, particularly for models like DALL-E 2, which is this big, generative image model released by OpenAI, in 2022. For them, there’s a really strange double bind in terms of historical images. So, on the one hand, there’s this really strong tendency to make big, equals historically correct.
For example, if you’re trying to generate a photo of, say, a fascist parade, and you couldn’t before, you still can’t because DALL-E famously limits what you can generate.There used to be a workaround for that you can just misspell fascism. And then it would give you what you wanted. But if you do that, and I’ll explain why in a second, you’ll consistently get images that look like they were shot on early Kodachrome film or black-and-white images.So if it’s color images, it’s early Kodachrome, if it’s black and white is just regular black and white. Basically, it’s like looking at a specific period of the 1930s or 1940s through a mediated lens, and you can’t get rid of this specific form of mediation; you can’t get rid of it.You really need to throw in a copious amount of highly specific additional keywords and negative prompts to steer the model away from this particular medium. So there’s this strong bias toward historically accurate (in quotes) mediation that connects historical periods and historical media, so polemically, the recent past is literally right for DALL-E. black and white.
The more distant past is literally made of marble. And of course, this is not surprising because these models process an already mediated past. That makes sense. But then there’s this other aspect, which is that while syntactic speculation—generating historical images with inaccurate media—is really hard, semantic speculation is super easy. And you see that everywhere, because, you know, even the example image for DALL-E to the OpenAI gives you is the famous astronaut riding a horse on the moon. So you can create all sorts of crazy scenarios.This is essentially the other side of the coin.But there’s also the fact that these models are what I call “contingency machines” in the paper. They have to give you a result, even if it’s the most outrageous result, they can’t not give you an image, it’s just not a possibility. They always have to generate something. And then dally to come up with these extra safeguards that are supposed to make it culturally agnostic, right, which means they purposefully left out culturally specific information from the training data.
That is, if you ask a model for a specific historical image, for example, “tankmen 1989,” which obviously refers to the iconic photograph, right from the Chinese Tiananmen protests, you will get a photo of a soldier, an American soldier, proudly looking at a tank, rather than that famous scene of radical civil disobedience.As a result, there is a double bind of syntactic variability in the case of generally historical bronze and semantic arbitrary in the case of specific historical or provenance, and this is a highly politicized concept of history that is inherent in these models. One of the many consequences here, I’ll return to the previous example of generating a photo of a fascist parade. One of the many consequences is that, for instance, you can’t have fascism return because, at the same time, it’s censored, so you can’t talk about it. It’s remediated, it’s safely confined to, you know, a black and white media prison. It’s erased from the historical record because you can make DALL-E produce actual existing historical photos. So it’s really a highly politicized concept of history. In the paper, I look at Walter Benjamin, with a decidedly anti-fascist media theory, which is why I use these examples. So this is another project that I think would be really fun to work on because it has a practical component.But also, it’s a more theoretical, conceptual explanation of the concept of history in these large models.
DB:A paper was published in which they investigated, for example, how a model is able to generate images based on random words, or in this case, practically gibberish. So they just went through a whole chain of just random words, and, you know, coincidentally, it would just arise that some of the images would just resemble birds or species of different types of birds in this case. I think that ties in with what you’re talking about: the contingency in this case, at some level, it’s compelled or forced to spit out something, regardless of whether there is, let’s say, any social or cultural context that, of course, face-to-face, there wouldn’t be any type of reply or any necessary socially conventional response if someone’s just speaking gibberish in that case.
FO: Yeah, absolutely. There’s just no equivalent to the famous blue screen of death with neural networks, right? You’ll never get an actual failure that’s visible and legible as a failure. You always get something that looks like success. But as in the case that I mentioned, right, sometimes it’s the opposite of that.