Generative AI has made major leaps since we last explored its use in game QA, and this episode dives into how that progress is reshaping the field. Host Devin Becker is joined again by Christoffer Holmgård and Julian Togelius, co-founders of modl.ai, to unpack how recent advances in computer vision and agent behavior are enabling fully no-code QA testing workflows. We discuss the shift from traditional code-integrated systems to screen-seeing, input-driving AI agents, and the technical breakthroughs that finally made this approach viable. The conversation also explores the types of bugs and edge cases this new method catches, and the surprising ways it differs from prior tools.
The conversion also goes deeper into what this shift means for studios. Julian and Christoffer highlight how QA roles are evolving when testers can direct powerful AI agents without needing engineering resources. They also examine the line between automation and augmentation, arguing for the enduring value of human testers while outlining where AI can dramatically improve speed, coverage, and reporting. From auto-generating reproduction steps to fitting into broader ecosystems of AI coworkers, this episode offers a grounded, forward-looking take on how AI is transforming QA from the inside out.
Previous episode with Modl.ai: https://naavik.co/podcast/ai-powered-quality-assurance/

We’d like to thank Heroic Labs for making this episode possible! Thousands of studios have trusted Heroic Labs to help them focus on their games and not worry about gametech or scaling for success. To learn more and reach out, visit https://heroiclabs.com/?utm_source=Naavik&utm_medium=CPC&utm_campaign=Podcast

We’d also like to thank Neon – a merchant of record with customizable webshops optimized for conversion – for making this episode possible! Neon is trusted by some of the biggest names in gaming and can help you sell direct without the typical overhead. To learn more, visit https://www.neonpay.com/?utm_source=naavik
This transcript is machine-generated, and we apologize for any errors.
Devin: Hello everyone. I'm your host, Devin Becker, and today I'm delighted to be joined by Christoffer Holmgård, CTO and co-founder of modl.ai, and Julian Togelius, professor of Computer Science and Engineering at NYU and co-founder of modl.ai, AI model that uses AI to help improve the QA process for gamers. Quality Assurance, if you're not familiar with the acronym. Today we're gonna discuss the realities of the QA and AI market.
For those who didn't hear the previous episode that recorded last year, could you give us a quick overview of what modl.ai does and your background, Christoffer?
Christoffer: Oh, yeah, sure. Absolutely. So, so modl.ai is entirely focused on helping game developers automate and augment their QA processes and their QA teams.
So, so what the company does is that we apply AI to player games. We apply AI to look at what happens when these AI agents are, are, are playing the games, and then I identify, you know, issues or areas of concern that you would wanna lift up to your QA team so that somebody can do something about it.
And once those QA specialists managers have had a look at it, decide what goes on to the rest of the development team and, you know, ultimately helping people to do more in terms of improving the quality of the game to the benefit of their players. And we've, we've been doing that for quite a while.
We got started in this space in, in 2020, and then in recent years, we've started taking a slightly different approach to how we do it because as, as people might have noticed how AI works and the meaning of AI has, has changed a lot over the, the last five years.
Devin: Cool. And your background and involvement in modl.ai, Julian?
Julian: So, I, I've been doing, I came to this from academia, and I'm still largely in academia. I, I've been doing research on AI for video games since way before it was cool. Really, I mean, starting like more than 20 years ago, and back then, you know, people did not take it serious at all. You're doing what? You're doing neural networks for like car racing and platformers and stuff like this. Like why? Why would anyone want that? That that's not real science. I used to get asked like, oh, that's fun, but what's your real research? And then I'm like, this is my real research. Oh. And they look at me like, oh man, this guy is never gonna have a career.
Well, jokes on them. So, you know, later on I was part of the founding team of modl.ai together with Christoffer and several others, and we were looking into it, because we came from a disparate set of backgrounds, including game design and ai, and a couple of other, and a couple of other things, and we had a lot of experience in, in this.
We wanted to bring modern ai to video games. I mean, what we're looking for, what are the problems we can solve? And we pretty quickly zeroed in on yes, quality assurance is the thing because, broadly because it's a problem people want solved, people are generally fine with like, speeding up quality assurance, and, it's like there's a lot of like low value testing, tasks that people don't necessarily want to do. So, we decided that, which is as opposite to, for example, character art and so on, where, where it's much more ethically tricky. So, we decided this is what we're gonna go for. And yeah, as, as we'll give a little bit overview, we, we tried a bunch of things to make, to make this work, and we've learned a lot along the way.
And I think we have something really quite cool now, finally after like, after many learnings along the way.
Devin: Cool. Well, it's been, it's been a little over a year since you guys were last on, I think in spring of last year. And we'll link to that appearance in the show notes as well, just to make sure people can catch up on a little bit more detail on the QA process you talked about.
But I wanted to kind of catch up on like, since the last time you were on Christoffer, what are some of the biggest, ways that AI and generative as basically, has, has changed? And it affected the product model, the AI, just to catch us up on, on what's going, especially since it's been kind of a big last year, I think, for AI.
Christoffer: Yeah, definitely. So, like in the last year, and obviously, you know, you could see a lot of things ramping up to that, until it happened, but definitely in the last kinda like 12 or 18 months, we think that it's become really visible, that we could use generative AI to achieve something that we actually talked about all the way back when we started the company, but that we never attempted because we didn't think it would be technically feasible.
And, and what that is, is to do what we would call, you know, what we call black box testing or like complete completely, you know, non-integrated testing of video games., And so, so what that means is that, you know, the major thing that we've seen enabled by generative AI is that it has enabled us to take images from games.
And then use those images to make decisions over what to do in the game now. But, but you know, if you went back like a couple years ago, our products and everything we did was based on integrating very tightly into the game engine and you get a bunch of benefits out of it. You get kinda like, you know, direct x-ray vision into what's happening inside of the game or, you know, if we think about it from sort of like the self-driving car metaphor perspective, it's kinda like having a super advanced lidar system and you can, you know, measure everything that's happening.
That, that makes it really easy to be, or at least easier to be an AI that has to play a game. But it also means that you have to invest a lot of time in, you know, putting this, software integration into the game, getting these values out. You have to think a lot about, you know, all these different parts of it.
Now what, what we started doing in the last year was that we, we decided to take the route where we said, well, what if you didn't have to do any of that? Kinda like engineering and integration work at all? What if we could just play the game from the screen the way that a human player does? And what if we could just send our input to the game using the same inputs that a player does or, or simulated versions we off?
And, and we, we'd been wanting to do that since probably at least 2019. Probably earlier, Julian in, in the research domain, right?
Julian: I recently came across a webpage from my days as a postdoc back in 2008 or so, where I was like, Hey, I'm working on this thing, trying to play first person shooter games for through games on pixels and so on.
And it is true that I was working on it. It is not true that I made any progress.
Christoffer: That's right. But, but, but I mean, well, some progress was made, but, but I mean around last year, 2024, I just think it became very evident that this suddenly, you know, the LLMs and the V LMS were big enough. They were precise enough, and there is enough capability in there that you could extract information out of a systems, but we just didn't consider that realistic before.
And you, you know, the other side of it, the other thing that generative and AI has enabled us to do with the product that we have now is that it enables our users to kinda like talk directly to the AI agents in terms of what they should be doing when we're testing the game or in our, in our case, writing it.
To be honest, but, you know, we, we could add a voice feature at some point. I don't know. But, but, but what it means is, like previously we would have, you know, interfaces and custom things that we had built to allow, you know, either a developer or an, or the QA person to tell the AI agent what to do. And the way that our solution works now you write it in plain English until we support other languages, I suppose English for now, but so, so you're right and plain English, what do you want to happen in the game? And that's kinda like also a major change in how we've been approaching this product for the last, you know—
Julian: Twelve to sixteen months. It, it's like, you know, when we started this, 2019-2020, the stuff we're doing now couldn't have been done there.
And then there wasn't the technology to build on. And in fact, the technology, it probably wouldn't have been possible even two years ago. Maybe one year ago, you know? So, we are kind of, right now, the way we're building the product, we are riding on the frontier of foundation models. It sounded really good, didn't it?
You know, should, should, should put it done on, on a, on ad somewhere. But, but the good thing about this is that project would only get better and, and, but it's also clear that, you know, it, it could not have been done. And we should have asked the investors for, you know, an eight TA runway originally, right?
Christoffer: That would've been perfect timing. But, you know, be, be, be better late than never. But, but you know, I also think from a product perspective, now, we talked a little bit about the technical parts of things and motivations and all that. One of the things that you gain when you have a black box testing solution and something that can, you know, play the game just by looking at it, take instructions just in English, is that you broaden the range of people who can contribute to that.
You know, to that process of our automation in game development. And so suddenly this is not necessarily just for remit of, you know, game programmers and engineers and people who can like, you know, have direct access to code base. Suddenly you have a much wider range of people who can start automating what happens in the game production process.
And they all bring different expertise into that. They can contribute in different ways. And so, it, you know, that's, that's something that's really exciting to us as well. But we can bring it to more people.
Devin: Well, it's cool that you guys are able to make that shift because it's obviously, when it's a QA tester testing the game, or a player playing the game, they're not integrated into the code and they inter, they don't interact through the code.
They, they watch the game name, put the controls and things like that. So you're doing it, it sounds like a better job not actually simulating the end user. And that, I would imagine can result in a better QA process. Obviously, you know, not exactly the same, but definitely something that better mirrors either a QA tester or a player. But you know, speaking of like those changes, looking back, Julian, you were talking about some of the, the things that made progress that didn't, what assumptions from the sort of previous generation, like, you know, maybe prior to last year, or going through the last year, assumptions from that sort of previous era of AI before the generative LLM stuff got really good, turned out to kinda be wrong, and which turned out to be correct, especially the stuff that you've benefited from or the, or the hurdles you kind of had to overcome to get here.
Julian: So, basically, within modl we've tried like a hundred ways of doing this. And outside of the model, we've, like an academic community, tried a thousand ways to play games in general. And I often wanna sum this up, you know, and in, in a, we, we are on our second edition of AR in Games book with, by Me and AKA is also involved in a company.
We basically catalog the different ways you could use AI to play a game. And it turns out that differ is completely depending on the characteristics of the game. And the characteristics of what you have access to, what, what is the information you have access to? Like, do you have full estate information?
How is it represented? Is it pixels, is it objects? Is it simulate sensors? Do you have a forward model so you can simulate things? And, and all these things. There are much easier and, and, and much harder ways of doing this thing, And if you can, if you have a simulator, it's like a forward model of the game, then things are way easier.
So, the chess really is like the paradigmatic, like, you know, simplest possible case, you have like full info information. You have a discreet game state that is like easily processable just a few bytes. A fast forward model you can simulate millions of moves, all of these things. Modern games is not like that.
You don't have that, you can't simulate the engine that fast. The, the core loop is kind of like, you know, it intermingles with UI elements and all kinds of stuff. And it's not even really well-defined what the game state is. There's like a whole list of like, you know, things that make it hard.
So, for a long time we tried various things of like looking at, you know, how could you make an integration that integrates an engine? Even supposing that you don't have a forward simulator, just gets structured information about the game. In a way that is useful for a model, or a search algorithm trying to play the game.
And this turned out to be really, really hard because one thing is this, one learning we've had, again and again and again, is that all games are different. And not only are they different, they're different in different ways. Often, sometimes subtle things with the mechanics and even like how the UI plays into the game loop and so on really throws you off. So, in a sense, what we're doing here is just like, we are like, you know, this is the Gordian Knot approach. We, we, we, we we're taking here of just basically like, nope, skip all of that and basically just work from pixels. So, I think a lot of the learnings we have about like what kind of information we need, they still apply. You know, you need certain other information in a first person shorter than you need for, for a matching time game, for example. We are not building those sensors because it's just too expensive to do for each, for each game. We really learned the hard way that you can't spend like tens of thousands of euros of integration on each game.
It, it doesn't work that way, but we do have a lot of, a lot of learnings that we brought with us, including how different game genres are. Very different kind of affordances and, and, and things that are important to look at. And these things scale. Like, you know, matching tile games do behave similarly in different ways in terms of what the information you need and what the, and, and at what kind of scale of time you need to operate and so on.
And they differ. That differs a lot from vertical and shooters, for example. And sometimes you see people coming new to this field, and basically, oh, we can have one model that works for both of these. And I'm like. Sure. Try that. Call me when you're done. Tell me how it went.
Devin: So, I mean, something that I'm, I'm really curious about, you mentioned, you know, different genres and, and different affordances, things like that. And you feel like the, the customers not having to directly integrate the game into the code. Are they still having to explain to the AI how the game works? Like what kind of prep work do they have to do? And also, how do you handle those genres that are maybe brand new or mashups of genres where the AI may not have any preconceived notions of how to play it?
Like, you know, I imagine the AI like basically understands the first person shooter and how that works, and you know, genres that have been around for some time that may be changed in the model, at least in like conceptual understanding. How do they handle this? Is, this is new stuff like the, they we're gonna have to explain to you AI, how this game works so you could test it.
Christoffer: Yeah. I, I mean, I mean, concretely the way that we do it, maybe the, maybe the place to start is that at the moment we're trying to be pretty specific about what kinds of games would we support. Not really based on like how much, information is available on the internet about them. It turns out that, you know, for most game genres, unless it's like super novel, there's usually some information out there that will have been crawled by the frontier models that they can talk about, at least at the high level. But you gotta think about like, all components of the system and kinda like the speed of how quickly some of these models can make decisions inside of games in a cost effective manner.
And so, so for that reason, we've been spending this year mostly with, with mobile casual games. And they are, they're super interesting to work on because they are, if you don't play them, they are more complex. You'd often think, and they require sort of like some thinking ahead and a good amount of reasoning, and a lot of them don't require you to have, you know, twitch level, , capabilities in terms when you're of, of playing the game, you can take a little bit of time to make a decision or even if there's some real time aspect to it, you might be able to play in a number of steps ahead and then still buy yourself some time from the whole system to think.
So that's, that's kind of like one, one way that we've been addressing this, and now we're scaling it to other gamers and moving in other directions. But that's, that's where we've started., I would also say that the other way that we handle, kinda like differences in complexities is that we provide the, the AI models, both the big ones that we can only get, you know, from, from, from large server files, but also some that we set up ourselves.
We provide them with, you know, specific context about the game and question and descriptions about the game. So, you kinda like, you provide a little description about, you know, what is this game? What is this about? We might help the customer do that as well if, if they want us to, but it really is all text, at least for now.
We may be adding more, more, more modalities later, but we add that in there. So, you say like, oh, this is the game about this and that, and you wanna make sure that you keep an eye on these rules and that this happens. You definitely wanna let us know because vet's not, you know, according to the rules.
And then the other big part where we do some, some game specific tweaking is for what we call work. What is called, I guess, but perceptual part of the problem. So that's just being able to see, you know, what the heck is happening on the screen, where's it happening, and when is it happening? And being able to do it fast enough.
So we, we still train our own models, but where we used to train it, you know. Back on the real paradigm, it was like much more on the behavioral side. A lot of our training now goes in on just being able to see and act accurately inside of these inside of these games.
Julian: Yeah, it's interesting to point out there that there really is a role for like multiple different models, different at the at different sizes, right? We have like, I mean there is the model, large models that needs to do all this reasoning and then you have the models that you can tune for very specific circumstances. It's kind of an intricate game, latency cost and so on.
Devin: So, talk about how you guys are integrating it now. What technical breakthroughs, I mean, outside of just the general LLM stuff that's been going on in the last year or two, made this sort of vision-based approach, like doable for you guys. I obviously you said, Julie, you mentioned like maybe within the last year, maybe within the last two years, was this even a doable approach?
What is it specifically that made this doable? Is it computer vision stuff? Was it the LLMs? What, what kind of tech made this, this doable?
Julian: I, I think, I think them coming together, so the video language models or multimodal models. So, the large models from the leading providers, including of course open AI and, and Google, so Gemini, and also a bunch of open source.
Models like Quin, they are like natively multimodal. They can take images in as well as text and images and text and combine that so that gives us reason because you, you have to think about a test. They, they simply, they have seen so much essentially so they can recognize so much, and you can do some kind of visual reasoning, by combining the text and the input. Now, the, it has downside as well. One of them is that they're slow, right? If you, if you gotta send your query over the network to one of these large models, it's gonna take times. You're not gonna be, you're not, you're not gonna be able to act within a second.
You're most likely gonna take several seconds to act which is why you may want to combine these in dispersed with local models that are fast models. You can run locally and the, and, and that, that respond in much less than a second. And also use the models that are pure vision models to identify and locate various objects.
So, here's like with this, this is a complex stance, the complex orchestration. Another issue is that these large models, and this is this is December 25, and I'm saying this and it's true now, we might be different. They are, in general, very good at recognizing things and also even understanding relationships between things, but they are weirdly bad at spatial reasoning things as like, you know, oh, here is a carat and here is a flashlight, and the flash is under a table. And basically how do you need to move the carat to put it on top of the table? And, and these things, you, it, it often just fails. It's, it's, it's like, it's like it, it, it knows about the objects, knows what they are and so on, but can't figure out how things relate in space.
And then you need to sort of, so basically get around that by orchestrating smaller models to locate things and maybe sort of have like a separate little pathfinding routine and so on. So, here's where it gets, it's like the devil is in the details.
Christoffer: Really, I, I mean, and, and I think like, follow off on what you were saying, Julian, and like nine outta 10 times or even more often, kinda like the high level plan that you're gonna get from some of these models. If you orchestrate them correctly and give them the right information from the game, it's just gonna be super reasonable. Like, like that, that, that part like works really well if you put it well together.
But then, like you're saying, making a timely decision and understanding where things are, that's, that's usually where you have to do a little bit of extra work. Yeah. In words may been work.
Julian: Well, it's not at all obvious to a person who's not working a lot with these models, what they're good at and what they're not good at. That's the kind of jacket frontier thing, right?
Christoffer: Yeah. Yeah. And, and, and I think another thing that, that's worthwhile, kinda like pointing out is where, where you can gain a lot is by also like especially this has really come through this year, but if, if you can orchestrate the different models, not just in terms of like, oh, I gotta call that one, or I gotta call that one, that's important, right?
But the other thing that you can do is that you can crystallize artifacts out of the models, but you can then use really quickly later. And so, what do I mean about that? That's, with that I, I mean that instead of like having to use an LLM every time you have to make a decision, you can use an LLM to generate a little piece of code, uh, that can do that every time from the outside of the game.
But instead of like query a model, you're just running something that you know is gonna work, right? Or if you need to be able to recognize something inside of the game rather than asking a super big. VLM every time. Maybe you ask it, you know, 20,000 times ahead of time and then you store it in something else.
And then you use kinda like this cash response to improve the performance, when you're actually, when you need to perform kinda like a runtime. But that's also, you know, the way that rebuild our models and dataset first stuff is kinda like. Based on both before we're playing the game and when we're playing the game.
And after we're playing the game, we're sending all this inf information around, you know, for different visual models and only textual models and we pull it back in and we train something. And that's also something that's only really like come together this year, I would say, at least for us anyway.
Devin: Do you guys think, then, that maybe the next big advance based off of what you're saying, Julian, about spatial reasoning and things like that and just the, the ability for, you know, the models to learn off the games that world models could be the next big leap in terms of this tech? Or is it something else you think?
Julian: Great question and I wish I had like the simple, answer. Role models simply mean very different things to very different people, to different people. I mean there's the tech, which is like such a genie three, where you basically simulate, simulate kinda 3D you, you're basically simulating, it produces walking simulators in fanciful environments for you that that's the way of putting it, issue with it, there's several issues with it. One is that it's slow and others it's unreliable and so on. But even if you assume that, you can get around this, you, you, you get the, you get them coherent, you get them like, you know, profitable, and controllable and fast and all this kind of thing.
Then you have the issue that they are making very strong assumptions on the world. Then we get back to what kinda games are we playing. So, world model, like nothing you can produce in Genie three has anything like the demand mix you get in Candy Crush. So, then you need a world model. Then you can imagine like, can we get a world model of things that are vaguely, like matching tile games?
You know, oh, in this world we get PGE World and Petris and Candy Crush and, uh, Dr. Mario what, whatever. But that's a completely different problem. I mean, and now, now I'm, now I feel I'm putting on my researcher hat here as in like, I want to do this, I wanna build this. This, this motivates me.
But, I think that's pretty far from being useful for, for what we're building a model. We, we need something simpler when we. We need models to understand that if an object moves to the left, then it'll be to the left of where it was before and things like this. But yeah, that, I mean, I mean that's the, that's the level.
And it's like, again, if, if you don't play around with a lot of these models, it's kind of like, it's, it's, it's really hard to sort of see the relative capabilities where like, you know, they can be amazing at figuring out like. You can get a lyrical description of the background scenery you're in and like, you know, intricate in detail, sort of, you know, description of like the objects and, and, and what they could be used for and so on.
But then it still can't understand that, you know, it needs to move left to go left or something.
Devin: Yeah. Well then, I mean, going back to, you know, how it actually is used in practice and the, the current products and things like that, how do the actual game studios integrated if you're, I mean. We, we touched on it lightly, but they're not integrating it in the code level, right? So what, what do they do? Assuming it's a genre that you guys are working with.
Christoffer: Yeah. I, I mean, currently the way that we, that we work with and have worked with for early customers is that we ask people to point those to their app store built. The game is already live, and then we're gonna show first results with that in the first meeting.
And that's, that's kinda like a neat thing about this black box approach. You can show up and say like, okay, this is what we were able to do with your game. Kinda like out of the box and, and really invest what integration looks like. It's, it's uploading your binary, and then we take it and we run it in our infrastructure with the agents and we get results out of it.
And most of our results are derived from, you know, the video, of the game playing and whatever logs that we can pull from the outside. So, you don't have to do anything special. I mean, for some games you maybe wanna set up a particular user account or if you have, you know, some, some hard and soft currency systems going on, maybe you wanna have some tests where it's great to have like a million billion.
You know, Jules, but you can just like, spend as you testing the game. But it's, it's, it's, it aligns much closer to the practices that QA teams are used to having when we're setting stuff up for testing it. You also have to produce some texts that tells you what the agent should do. So the agent that plays the game and in that case you can probably start from the test cases that you have already or you can come up with, with some that aligned with it.
And we've kind of like come up with our notion, our own version of it. We call them tasks, but, but at the end of the day, it aligns, you know, alongside either kinda like you can have a task that's like a high level description of something that you want tested. And that could be like, oh, make sure all my menus work.
But don't go to the store. Or vice versa. You know, make sure the store works, but don't spend time on the settings menu. Or it can be kinda like very specific, you know, lists of sequences and of stuff that you want to have and kind kinda like anything in between.
Devin: Well then, you know, having this easier integration, obviously, like it's great that you guys can do stuff ahead of time.
And, and that enables a lot of like accessibility. I'm kinda curious then on the output side and on the capability side, what is this newer approach, you know, with the vision-based approach versus the code integration approach. What does that enable,, and, and what does that maybe stronger in the QA side than the code integr approach, which I imagine also has some advantages as well.
Like, what does this enable, what does this let people do that they just couldn't do before?
Christoffer: Yeah. Yeah. Maybe I can, well, I can start at least, but til you to jump in, Julian, I guess we could start with the strengths. And I think this is, at least from our perspective, and something that we, that we see.
You know, studios and people working on the games find very appealing is that there's, there's ecological validity to it. So, the results that you're gonna get out of testing your game with this technology are gonna be very close to what you would be getting if you would just hand it off, you know, a phone or the device to human tester and ask them to play the game. So we're sort of, yeah, go for it, Julian.
Julian: Sorry. No, no, no. I'm, I'm just saying that it's like, sorry. Just like the, the wider view is like, it's a common problem, but like when you use various gameplay algorithms that, um, you might get good game playing behavior that is not human. Like, I mean, it's like common from like chess and go and so on.
We've seen this in, you know, top level play and like ways to do so weird things. And we've also seen this a lot with like this discussion about reward hacking in learning to play video games where they learn really strange strategists, through, through reinforcement learning. But here, because basically, I mean, eh, the auto aggressive training of large language models isn't exactly.
You can't really call it supervi and, or, or you can't really call it like imitation learning. But the, but the way they understand it, it's more, let the VMs understand. It's more like what humans would do it. And, and, and, and the kind of action suggestions you get is more like you humans doing is humans, humans come up with because they're trained on human data.
So that is largely why it is true what Christoffer says, that it's, it, it's more human-like behavior. Although now when I think about it, I'm not entirely sure that this completely explains it. It's empirically true though, that it, it gets more human-like than it would get with, for example, a search algorithm.
Christoffer: Well, actually, well actually I wasn't talking about the agent behavior. I was talking about the game itself. But you, you're absolutely right. But I was kinda like thinking that the thing that actually gets tested, the artifact that gets tested has not been modified to support the agent. In any way.
Julian: Right.
Christoffer: That's true.
Julian: Nice. But what I said was also true, right?
Christoffer: Yes. Yeah. I reached—
Julian: This is today after the podcast. We gotta get back to this and try to sort of.
Devin: I'm curious though, does it ever actually watch players play to, to learn anything about how to play as well?
Christoffer: We, we don't do it at the moment for anything that we're, we're, we're, we're selling to, we're stuff.
Julian: Yeah, and we have like a whole pipeline for it and so on, but right now we're trying to go to do away with it or like to, to not use it when we don't need to, because that's essentially integration time.
Devin: Right. I'm just curious if be useful to generate more human-like behavior in certain situations where maybe the LM or AI doesn't really do that normally?
Christoffer: Yeah, no, it, it'd definitely be useful for, yeah, for that. Sorry, go ahead.
Julian: Although right now, good behavior is more important than human-like behavior for most games. Yeah, I mean, and it's like that's actually like this, like three different things. It's like playing game well, playing human like, and also playing in a game in a sense, in, in a way that's useful for testing.
Because the testing, you're often sure you need to be able to play the game, you need to go through it, but you also want coverage. And that is slightly different., But yeah, it, it'll, it can definitely help to train on human behavior for the particular game. But if we can do without it, it's preferable.
Devin: Can you actually tell the, the ai since you say, you know, you can kinda communicate with it, can you tell it to purposely try and break the game to do things that like a, a, a bad actor would do? Game, you know, I don't try and play the game, well try and screw it up. Try and do things that you shouldn't try and break the game as much as possible.
Christoffer: Oh, yeah, definitely. Like a lot of the, a lot of the tasks that end up getting entered into our system go in that direction because you wanna be testing for some of those edge cases, or you just wanna make, you know, you wanna be sure that everything works. We've also, you know, we, we give the agents full OS level access.
So, if it's like on an Android phone, it can use the keyboard and it can, you know, it could go to a different app and then go to play store or whatever. So, you know, we've also had it rename itself and, you know, try to change the account password and stuff like that. And you obvious you could put some guard guardrails around it, right?
But all, all of that is available kinda like buy to sign, but because you wanna see what happens when the agent, you know, turns off the internet to the game mid-session, right?
Devin: Nice. Well, yeah, definitely, definitely helpful. You mentioned earlier about like this something being something more accessible for non-technical people as well, and maybe even non-QA people.
Do you see this expanding out within companies and, and things like that? The, the sort of QA role or the ability to do QA stuff? Say for example, right when, when you're developing the game and you're one of the coders on it. It's very helpful to be able to then run the code that you just built and test and make sure that that thing worked before sending it to QA.
Is this something where you like increase the accessibility of QA processes to other people and what do you see that, like, what direction do you see that going,, in terms of helpfulness for, for companies?
Christoffer: We, we'd love to be enabling that way across different disciplines. I will say what we are, what we're putting out there has, you know, the QA folk in mind.
So like the typical user for us, kinda like North Star is like serving a QA specialist or a QA manager. But there's no reason that other disciplines can't, like, pick this up and use it for, you could be interested as a designer in like, how often does something occur, perhaps, or like you were saying, if you're like an engineer in the game, you wanna make sure that you, we spend a lot of time in the, in, in this part of the game.
And any, you know, if you have anyone with an account on our platform could, could do that. I suppose you're not limited to, to, to being on the QA team necessarily.
Devin: Well, Julian going to, for, for more between somewhere between practical and ethical question is, how you draw that sort of line between like augmenting what humans are doing, the quote unquote giving them superpowers and the replacing human part, which I think is, especially in the QA department, but a bit of a big concern, you know, with unionization and things like that going on.
So, I imagine that's something on, you know, people's minds listening to this. Like where, where do you see that landing, especially with your product, but also just the general concept of using AI and QA.
Julian: Honestly, it's something, that we think a lot about, and it's something that, I mean, also outside of QA.
So, when we went for QA originally, one of the reasons was that we thought that this was the one, this was one of the fields for people who were less. Word about being essentially replaced because fewer people take that kind of see it as a domain of their own creative expression. Right. But then again, it's a job.
And I mean, taking, trying to take away people's jobs is not something to be done without serious thought, you know, basically applied to it. So, what we want to do is basically to give people a leverage to basically be more effective testers and more effective at QA. Now, and you could argue that okay, if, if, if the machine does all the QA, what, what, what would human do?
No, but human sits on top of it. And here's the, here, here's the crucial thing though. Almost all games are nowhere near tested enough. I mean, we need way more QA, we just don't have the resources. So, what we see as what we're doing is that we are basically giving, giving testers superpower, like basically, instead of like playing this kind of, you know, repetitive sequence a hundred times. You're basically telling a hundred agents to go around. We go and do it, then report back to you afterwards. And, and, you know, and then you need to do this every, every time someone changes something in the game, well, you just click this button and there you go. And instead, you spend your time basically doing high level analysis of the results and also thinking about what should be tested.
Like basically we're seeing this in other kind of AI augmented field, like, you know, the cognitive work moves more into deciding what should be done. Like when you're kind of using AI tools to program for you. It comes more in like what should be implemented and so on.
Christoffer: And, and if I can add to that, Julian, I, I actually think, you know, from my perspective there, you do need a lot of creativity to do like good QA and hopefully this could kind of like enable you to unlock more of that because a lot of the like, yes, you have to do like a bunch of repetitive and you have to test everything works and you have to go to all the parts of the level and all that stuff, right?
But the kinda like underlying that, you know. You once You also remember that you need a lot of kinda like lateral thinking and the kinda like creative, holistic systems understanding both to figure out like, where do I go to break this system? What have I seen and haven't I seen? And so that is the thing that I imagine the human operator would be spending more time on.
Right. And then also kinda like diagnosing things. So rather than having to collect all the evidence yourself, like through, you know, by playing the game, you can just, you get handed like, you know, kinda like a, a detective mystery case, and you, like, you start putting those, you know, different clues together, uh, out of the gate.
And that's, that's part of our vision with it anyway.
Devin: So, I mean, how, how far can this go in terms of acting like a real QA tester in, in the reporting side of things, right? You talk about, you know, the, the sort of QA person started deploying these, like a bunch of workers and bringing information back.
But how far does that information go? Like, can it actually identify like reproduction steps? Here's how you would reproduce it. Here's like, here's a video of it. Here's, like sort of the root cause we think, you know, or is that left more to the human side of things of like, Hey, I found, bugs. You figure out how this all works so off like a recording or something like that.
Christoffer: I would say some of it we pull together for the user already. So, if you, if you give us a list of steps, the system will check for all of those different things happening. If something different happens, it will, tell you what happened. We're also increasingly working on pulling together multimodal bits of information.
So, we take the video and we take the instructions. We take all the thinking that the agent had, and then also, you know, the game log. If we can access it, anything that happened on the device, and then we can like make all those available in the reporting. And we try and use that to pinpoint the different moments that a human, expert, human tester would wanna know about and then summarize it.
So, it's kinda like in two levels we have like all the individual indicators. Like here we saw something happen this minute, go to this video, you can correlate it to the log. And we had a performance spike and then we start looking for patterns and all of us, and we pull them together and kind of like.
You know, high level reports for, for the user, and then sort of like on from there deciding what happens next. That's, that's where we currently say that's where the human expertise comes in and you decide what flies and what doesn't fly, what gets sent on and what doesn't. Of course, you know, it'd be super interesting in time and I think eventually will be feasible to take some of these, like higher level.
Summaries and then say, oh, actually, you know, in nine outta 10 cases, this is the typical fix for an issue like this, and here's a, you know, a post that goes along with it. Maybe you wanna send that to the dev team and maybe you wanna send it off to some other AI agent that could make a pull request to your repo.
And then somebody could decide if that's what we want to do.
Devin: So, how do you see this fitting in then with other AI , tools? Like, so for example, on the report writing side, right? Like maybe, you know, there's a human involved in identifying a bunch of stuff and, and collecting that information, but then AI writing the report or, you know, dealing with tickets.
Like how do you see this fitting into the broader ecosystem as people look for different places in different ways to integrate AI and it's almost sort of like idea of AI coworkers where there's people with sort of different AI departments handling different aspects of it. Where does this all fit in?
Christoffer: I, I think this is a process that, that kinda like was missing before. But you know, on one side of this we have kind of like the CI practices and producing the artifact gets tested in the first place. And that has to go together with the tools that people are using already with their testing, testing expertise to decide what actually gets tested on this kind of artifact.
And you've got, you know, our system in the middle, right? But the output is also information, but then needs to go, you know, needs to go to JIRA where people might review it or it goes to some documentation system or it goes straight to the code repository. But I, I, you know, I think it's key for a solution like this, but it's like super interoperable.
With all the other systems that people use and that they might want to connect to this. Like we, we don't see any reason for, like, once you've run a test with us and you've executed your game and the agent did things and we collected data about your game, and we, we make conclusions off of that. A product like ours should just make that available for all the other places where it would be helpful for you to plug it in because at the end of the day, at the end of the day, it's kinda like about maximizing the leverage that this automation gives you, and then let people do as much as they can with it.
Julian: Imagine this going really far, like in different ways. You just, you maybe one day we'll have a command line interface. You know, there's cloud code and model model, you know, tests sort of, right.
Devin: See a lot or it sort of integrates into that sort of broader envi AI environment, so to speak.
Christoffer: Yeah, yeah, yeah. I mean we, we'd love for this to be like, imagine if this generates an ongoing knowledge base about the history of your game and where your game is now.
And you can actually query that done right into that. You can ask, you know, oh, is level six broken, right? And then the system comes back and tells you, oh, these are the reasons that I think level six is broken. And now you might become more interested. And then you say, okay, how long does it take you to complete level six on average?
And then you start using that data in interesting ways, right? If you have something that's playing your game all the time. You collect someone's information about it, right? And so you could imagine just like asking questions straight down into that and getting, you know, information back that you can use.
Julian: And you can look at what the tests you've done in the past and saying, Hey, I saw you did these tests on level six before, like, you know, should you do this? You know, you want me to basically run those again and do slight variations and stuff like this and, and, and while you are anyway doing something else, um, have it running.
Yeah. This is like since the very beginning, of model, we had this vision that like, you know, you should, as you're developing, there should be agents playing your game all the time, you know? Keeping the service warm and, and feeding data to you. This also comes out of like academic prototypes. So this, we did way, way in the past basically for these like, you know, real time, real time agents playing.
Devin: I guess I'm curious where you see the management side of this going, like, 'cause right now, you know, you're talking about, oh, this is kind of deployed by maybe the, the QA team leads or some of the maybe more experienced QA testers who kind of know how to direct, what kind of testing to do. But do you see the, do you see this going pretty easily into the, a QA agent manager that's AI.
Managing the other QA agents that are AI to like, like you said, run that 24 7 where it's like, Hey, we're off for the weekend, but this, this AI over the weekend is just sitting there managing itself, directing different tasks, knowing which coverage needs to be done. Or figuring out additional coverage, checking and test.
Like I could see, obviously there's a lot of potential for orchestration to be pretty huge in this. Do you guys see the potential to grow this, for example, this product out to that, or to integrate maybe with other products that help manage that? As we kind of touched on a little bit earlier.
Christoffer: Yeah. Yeah. I think there's a lot of, and maybe that isn't even the end game, but that's, that's kinda like a place where you could end up through something like with this, right?
But something that's continuously generating new knowledge about your game. And then again, we're not, you know, we're not limited to just finding bugs off of it or kinda like, you know, concrete issues that would need to be addressed. You could also use this for balancing information or personalization information or assessing what would be good for different player groups.
Like suddenly the possibility space expands a lot from being find issues to, so not just talking about, you know, whether it works, but kinda like what works well and what, like aligns with the vision that we have. And you start giving people design relevant information, kinda like up and down that access, if you think about it there.
Devin: Cool. A lot of, a lot of fun possibilities I imagine. Julian, then given how much has changed in the last year, two years, I mean, I guess all the time really, but especially in the last couple years, what do you see happening in, say, the next two years from sort of the technology standpoint that affects all this, whether that be like specifically your product or just QA sort of testing in general.
Might even be outside of games, but just general sort of QA listing that's possible with, with the SEC and next, next level. It doesn't sound like world models probably within the next two years would be super useful, but what, what else might be.
Julian: You were saying? You were saying before the end game, you know, just like, and I was just thinking the end game, there is no end game.
There was a Marvel movie called The End Game, you know, and then they just kept coming out to movies afterwards. They just kept getting more and more confusing.
Christoffer: You know, it's like both of us already here. I would, I would like for this to not go the direction to MCU let's, uh, let's hope it's more positive than that.
Julian: It's, I mean the spatial reasoning thing, obviously I'm taking aside is, is, is, is a really big thing. Then there is something as simple as speed. I mean, speed is a real issue for us right now. Like we, we constantly navigating the trade off between analysis, quality, reasoning quality and speed.
And the more that solves itself, the, the better results are going to be. There's at some point gonna be a test, a cost issue as well. But this goes hand in hand for the speed, essentially. Wow, this sounds really, really boring. I'm just like, you're, you're asking for this visionary stuff. I'm just like talking speed and cost and, you know, but, but honestly.
These are major issues for you, for, for, for us. And there's a lot of stuff we could do in theory that just doesn't fit in a production workflow because speed and cost. Um, and we are really excited about like, what we could learn across games. Like, so as we said before, like, you know, all the games are different, but within genres in particular, things that could, that could generalize and questions such as like, you know, if you learn something about testing strategies for like match three games, will it carry over to merge type games or like falling blocks type games and so on, and can we use that to basically suggest interesting actions and so on. Then another thing that, you know, to, again, looking at a very rich literature of things having been tried in academia, but like are far from like applicable in in production so far. Can we look at player styles? Can we have this elevens tested by, played in lots, lots of different playing styles, like here's played different player archetypes. What, which kind of issues would be faced by a pro player or versus like an amateur. And of course like going beyond, does it work or not? Like giving like complete experience reports of, of, of the games. Like, Hey, I played this part of content you're proposing here. And I think it, this is gonna be interesting for like an novice players, but kind of frankly a bit boring for pro players.
So, like, hey, I play this and, you know. I think there are too many options. A few people are going to look, explore these things. Like, there's so many things you can, you can go into here and, and then, you know, having, being able to eat even like what kind of behaviors would, would happen, what kind of players would be attracted to this, for example, at some point there is the, I mean, and this is some, this, this isn't, I'm taking off my academic hat,, you know, because it's, it might get dirty. Then you have to connect this to ad monetization. Can you use this play through data to predict which ads should be shown or which ads you should show? What kind of players you want to, you want to attract by through understanding the game through automatic play through analysis. Almost certainly you can. I mean, and if you are the company doing all this testing, then you'll be sitting on a lot of, on a lot of like data about like game play through profiles and what play archetypes will work and will not work.
And then you could use that to do interesting recommendations about, for example, ad placements and, and ad buying. Maybe I'm saying too much now. Christoffer, tell me if I should. You know, I—
Christoffer: I think, I think that’s right.
Devin: Well, yeah, it's really interesting to think about that idea of an AI focus group, like QA testing for subjective instead of objective problem.
Julian: Yes. No, no, but, but, but obviously, I mean, I mean, it's like this thing, like does it work and is, does it have obvious bugs? It's really, really just scratching the surface of everything in get through plate analysis.
And, and again, like, you know, people who want to second guess us and figure out the next moves can dig into the many dozens of papers we have in our academic selves. Written about the, about the topics.
Christoffer: Plenty more additional. Like—
Julian: This is like a whole heap of ideas there, like, you know, that we. Just barely got working for one simple prototype game, but like, you know, at currently is far from working in a generalizable way in any kind of production environment, but, but we know that in principle it can be done.
Well, hopefully, hopefully it'll be fine and no problem. Of course. And of course, and of course there's the integr, the more direct integration into, into content pipelines, like, you know, okay, I see that this, I see that level six because we've been talking a lot about level six today.
Christoffer: Best level also.
Julian: Level six doesn't have good content for like, for exploration focus players. Should I, do you want me to suggest this kind of labyrinth puzzle here and so on?
Devin: It's gonna be like chat GPT at the end always suggesting like, you want me to make a table of that? You want me to make some slides for you?
Christoffer: Yeah, yeah.
Julian: Like, you know, which is like, so, so I'm thinking, I'm thinking more visually just pops up like, Hey, how about this puzzle here? Do you want to put this here?
Devin: Okay. So, we like to, we like to finish things off here with just, just a few quick questions just to get some, some quick bites for fun. So, so feel free to be a little controversial. If you'd like, but we're just gonna go through the three questions real quick. And the first one, just kind of an e example you can give of an unexpected success or like an emergent behavior or something from using AI and QA that's just like, whoa, what, what, what just happened?
Christoffer: Well, the, the overall unexpected success is that it works as well as it does, like we would thought that. But, but, like one thing we saw recently was we an agent that entirely followed, you know, failed to follow our instructions. And by doing so, you know, showed, it, it exposed a bug in our system and then at the same time, cost triggered a bug in the, in the, in the customer's game as well.
So, I think in many ways their bug showed us our bug and our bug showed them their bug. And we weren't really counting on that. So, the question is like, who's really testing who?
Devin: Cool. Good answer there. What areas, and I know you mentioned the jaggedness earlier, do, do you still see AI unexpectedly struggling outside of just like the genre thing?
Christoffer: Sure. I think we all, I'm always surprised what they can remember and what they can't, especially for kinda like long running or long horizon tasks. They'll, they'll remember really arcane and unexpected things and then they'll, you know, forget whether red means on or off for a pun. And you kinda like have to work about both of both around both of those things.
So, you know, smart but maybe, maybe not necessarily smart in that way.
Devin: Some coherency issues seems to be a, a constant thing with these. The more complicated they get, the less, the shorter the memory seems to be. So that makes sense. Um, and this feel free the better off they are. Yeah. Feel free to get a little controversial on this one as well.
Some shots fired moments if you like, but any recent gangs that you've seen that really could have benefited from ai, QA coverage.
Christoffer: Oh yeah. You know, I don't wanna cold. So, what I actually have seen are some games that have been out for a decade but could still benefit from some QA coverage. So, so I, I think nobody goes free here.
You can, you can get a title that's been in live ops for half a decade, and you'll still find stuff in there, but people either know about or even then, so I think, I think we're, we're all in this picture and we have to decide if we like it.
Devin: Maybe you'll have to get some people to use it for remasters, right?
So, when they go to remaster the game and know about a lot, I show you second. Get, make sure they get some good coverage there and have the remaster be a better version.
Christoffer: Nope. No, that's right. We'll, we'll test both and then, uh, if you want we can compare which, which one had. Fewer or, or, or, or more bugs in them.
Yeah. Let's, we'll, we'll divide one for free.
Devin: Cool. Well, hopefully it leads to better games, right? That's what this is all about. Hopefully at the end of the day, as your product grows, as other products gather, just the whole AI ecosystem grows. It leads to better games and not hopefully worse games. But I think there's definitely some positive future here, and it's great to hear what you guys have cooking, I imagine a lot more cooking under the hood that we'll see over time here.
So, for anyone who wants to check it out. It's model ai, but no e in there. So just MO dl. Do ai, of course, be linked to the show notes, but for those of you who are just listening, I wanna make sure people know where to find you guys. Lots of cool stuff. Lots great, great to see the, the technology be unlocked as well from these new changes and vision.
So, but I wanna thank you guys for, for joining us for second time for you, Christopher. First time for you, Julian. Really appreciate the conversation. A lot of, fun stuff in here. I'm sure we could keep going forever, as this tech is, continues to emerge, but I wanna thank everyone listening as well.
Hopefully you had a good time. It sounds like Julian is suggesting there's a lot of reading material out there for those of you wanna go further down this rabbit hole, so, or have AI help read it for you. Yeah. So, it's like, so.
Julian: But, yeah, but then you'll have to curate the app, ai sub to see what it missed.
Devin: Comprehension tests. We'll see. Cool. Well, thanks guys. And we'll catch everyone on the next episode. In the meantime, have a good one.
If you enjoyed today's episode, whether on YouTube or your favorite podcast app, make sure to like, subscribe, comment, or give a five-star review. And if you wanna reach out or provide feedback, shoot us a note at [email protected] or find us on Twitter and LinkedIn. Plus, if you wanna learn more about what Naavik has to offer, make sure to check out our website www.naavik.co there. You can sign up for the number one games industry newsletter, Naavik Digest, or contact us to learn about our wide-ranging consulting and advisory services.
Again, that is www.naavik.co. Thanks for listening and we'll catch you in the next episode.








