Once again, Angelo is joined by colleagues Andy Lee & John Ryan as they continue their discussion on simulating biological systems and how they use computer science to simulate drug compounds.
This episode touches on computer simulations, machine learning, and GPU's. How do these aspects of computer science relate and differ? Andy and John dive deep into how they push the boundaries of what is possible and practical in modern medicine by simulating biological systems.
Our Team:
Host:Angelo Kastroulis
Executive Producer: Náture Kastroulis
Producer: Albert Perrotta;
Communications Strategist: Albert Perrotta;
Audio Engineer: Ryan Thompson
Music: All Things Grow by Oliver Worth
Angelo: On today's podcast, we're going to continue our discussion of simulation using computer science to simulate drug compounds. But we're also going to talk about maybe how that compares to machine learning and other techniques like GPU's and things like that. And I'm sure there's a few other things along the way that might be interesting. I'm joined again by Andy Lee and John Ryan. I'm your host, Angelo Kastroulis and this is Counting Sand. Let's jump in by first talking a little bit about GPU’s. We talked last episode on using simulation and of course, when you think about simulating like a game simulates, you've got GPU’s involved. So, what's interesting with these kinds of problems, hardware makes a big difference. There's no question, right? A new generation of GPU's comes out that has some new capability and it makes an immediate impact. But algorithmically, one of the things that you guys said last episode, you made your own engine. It sounds like there's a war story there that you might be able to find interesting. Do you have any stories that you'd like to share or tell us in terms about the stuff that you've done along the way, building your own engine, improvements you’ve made, maybe incremental improvements or things like that, that seemed to add up. Do you have any thoughts or stories in that vein? John: I mean, I start by saying that the war is ongoing. Neither side has won yet. And by that, I mean, we're still finding ways to improve. We're still finding the pain points and finding ways to get over them. I think the biggest, easy win was just that initial switch going from the game dev engine to the initial implementation and you know, we spent about a year on the initial implementation in the game dev engine before we switched over to writing the C, C++ engine. And Pascal had just come out, the Pascal generation of Nvidia cards. And as you mentioned, a new generation of hardware comes out that enables things to happen. I think that was a big turning point for us. Just moving to those new GPU's we were able to get immense performance gains. That was a huge jump for Nvidia at that time. And we jumped right onto that. We hired a new programmer out of a local university here to tackle that project. And he basically took the code that I was working on in the game engine and more or less ported that over to the new C++ engine running with CUDA. And you know, we took about another year to get that ro the state where we had it in the game engine. And during that time I was able to work on refining the user interfaces and moving things over. But one thing we've learned I think is that the CUDA kernels run in C. So CUDA kernels are C kernels and we adopted a pretty strong object oriented C++ approach to our system early on. And with that, I think there were some unnecessary complexities that were designed into the system that we were able to discover later when we saw some issues with the system, we had some performance issues that were related around these complexities and the object-oriented design of the system. And by kind of taking that away and focusing on a simpler, more C approach, and this is something that we're still working on, we're kind of going back and saying like, these things could be more simplified. We should be looking at things in a more functional approach rather than trying to define all these classes of objects, because in a system like this everything is really just an entity. You know and these entities have different properties that interact with each other in a different way. So, while I can't think of any specific gains or specific stories where we got an increase from solving a problem in a month a lot of the problems we've been approaching have been long-term stories, you know, things that have developed over months and that’s been kind of my big takeaway and kind of where we are in the system right now is, finding ways to really simplify the design of the system. Because as complex as biology is and as complex as the problems that we're solving are, when you look at them from a systems design standpoint like this it really is deceptively simple. When you just kind of look at everything as being an entity and that these entities are going to interact with each other. And we can simplify the system and a simpler system can be a more performance system. Angelo: Yeah, exactly. Code is a liability, really, in a lot of ways. You want less of it because that's less you have to maintain, less you have to test and it make sense. Less code is actually faster. So I really appreciated that point. It rang with me because that's a finding we also have found many, many times and I have some war stories there too. But there are so many things that you said there that I really want to hone in on and maybe reiterate a little bit. You guys, you mentioned, you know, building your own engine, you kind of came across a problem. And a lot of times in computer science, we have this fear of building it ourselves, you know, there's the not invented here kind of movement and all of that. But the reality is that a lot of times things are complicated and we have a slant that we are more interested in than maybe the technologies out there. I want to talk about the courage it takes to kind of build your own system and people might think of it as this mountain that, oh no, we can't do that. That'd be a huge investment and all of that, but obviously it makes games. And what have you guys learned from that? I mean, you may inherently may not have that fear, but did you ever feel it? How did you guys kind of surmount that to get past it and make that hard decision of saying, okay, we're going to get rid of this engine and write our own. John: I'd start by saying, I think there's always a fear. I think some people are better than others at hiding their fear. Just for a quick analogy, I mentioned I was in recording engineering. I'm a musician as well. I've been playing on stage in front of people for years since I was in, you know, I started performing in middle school. And there's, you know, I don't think a gig I've ever played where I didn't have fear or nervousness before going up on there, even though I’ve played these songs a hundred times and I've done this before. So, I think there always is a bit of fear or maybe something like it in approaching a project like this. But I think that's also what kicks the adrenaline off and gets you excited and gets you going into something. So, yeah, there are definitely moments when, you know, 1:00 AM, in front of the computer, you know, looking at the code in front of me and thinking, are we really gonna do this? Is this going to work out, but you know, so far it has, you know, every, every time we've gotten to that point. Especially, you know, technology is a fast paced world, biotech is a fast-paced world, you know, any kind of science industry, like this it's all fast paced. So you have to find the moments to catch your breath and to reflect and we try to do that as a company, too. We try to look at, you know, take some time to look at the past year or the past project and do some post-mortem analysis and it's really useful to, especially when you're feeling like there's a lot in front of you and it's tough to overcome, looking back at oh, wow we actually accomplished a lot this year and you kind of miss that when you're moving too fast. Andy: And I think we've done a lot of, you know, looking at if it exists, we want to use it. We don't want to build things that are already out there. And so when you're coming with this idea at the beginning of like, can we simulate or emulate biology in a useful way, we looked at other things and there's a whole field of systems biology and tools that are out there around that. We experimented with some of those before we ever built the simulator. And, you know, really found them lacking as far as getting into the mechanistic nature of the system. Some of those and even some of the machine learning approaches that can be applied toward Omix profiles might tell you what happens, what the outcomes are based on averages and whatnot, but they're not going to tell you necessarily how that happened. And if you're trying to find new druggable targets, it's really important to know that okay, these different expression levels that are input to the system, these outcomes through these specific mechanisms. And we might intervene at that one. So I mean, that was one of the things of like why we went to this in the first place. And then we get to other areas where we're like, okay, now we need to find compounds that are going to dock into a protein. Well, there are some really good tools that work well. And so rather than building our own, we've implemented other open source tools that we can build and improve, you know, work within stuff that's out there. So I think that that's really important to think about that. You know, like you say, don't build it if you don't have to, but once you do then, dive in and make it happen. John: Yeah, for that reason, I think, you know, we haven't built anything that we didn't have to. Right. We're very conscious of trying to find what's out there before we do that. But this space with the simulation really, while there are some things that are kind of like it, and we found some things that are showing kind of, entity interactions in a cell environment, there's none that are doing it like this that are producing the kind of results that we're producing. And that's why we had to build it because there’s a need for it. Angelo: And that's the key. I appreciate the pragmatic approach of saying, okay, we're not saying just build it just to build it. Because I think those of us who are hungry engineers would just play with anything new because we'd love to, you know, that's what we'd love to do. But I think that there's a lot to think about in terms of, are you trying to take your use case and shoehorn it into what's out there or are you really trying to solve a particular problem? And then maybe there isn't a perfect tool that fits. You have to have the courage to be able to say, I don't see anything that fits. I need to build it. And one of the things that we like to say is we're not putting a man on the moon, right? We're we're going to do research. We're going to read papers. We're going to read about things and techniques that are being done. Innovation though, is taking those ideas and putting them into something that it hasn't been applied to yet. And I will say this, if you're not afraid, you're probably not doing it right. Because there needs to be some fear in what we do. Let me ask you a little bit about GPU's. How do you model this problem? When I was at Harvard in the data systems lab, we were exposed to GPU’s too, we did it from a sense of, can we get this computation a database is trying to run onto a GPU? There are things to think about. I mean, the cost of getting it onto a GPU is kind of expensive as opposed to a CPU. And then you have to be able to package the problem, right. You know, if it's too big, requires too much data or whatever, it doesn't fit. How are you able to kind of put this problem onto a GPU? Maybe it lends itself that way, but any thoughts? John: One thing, kind of goes back to what I was saying before about developing a simpler system, you know, looking at the complexities that we had designed into the first iteration and where we're making the steps to simplify it. That's one of the first steps I think, to making this fit on the GPU, like you say, is by trying to represent all of this information in the system in the smallest possible way. So, we’d spent a lot of time planning on how can we represent these entities and data. And this is one of those transformational layers where we have all of this verbose representations of data in our database and in our interfaces and what we're getting from the biologists and showing to them. But getting it into, you know, when we kick off a simulation and we pull that data from the database for that model. And then we take that information and we, you know, basically just turn everything into IDs in that system, it is as stripped down as it possibly can be trying to use references as much as possible to things to really keep the data as small as possible. And at a high level, it's just squeezing that data into the smallest possible representation and then offloading that onto the device to do the work and try to keep that data on the device for as long as possible to just do those iterations to run those cycles of simulation and only pull the data off when we absolutely have to. And there's a struggle with that because as Andy mentioned, we want to be able to offer to the user the ability to kind of view the state of their simulation as it runs. And in order to do that, we have to then pull it off the device. And take that data off and then, you know, maybe push a database call so that current snapshot of the simulation you know, can then get updated. So we've had to do some tweaking around, you know, how often do we need to pull it off the device in order to update the state and the database for logging purposes. So when these simulations run, we want to have a time series view of what's happening for every entity or every organelle, every possible thing we can that's in our system to track as much as we can. So that's really I think a big pain point there as far as performance goes is pulling that data on and off. So it's been a lot of trial and error going from, you know, do we log every single interaction, every single time point? Or do we just let it run for 10 or 50 or a hundred iterations before we want to take a snapshot? So balancing, the more resolution you get, the less performant it's going to be. Angelo: Yeah that, I really appreciate your thought on that. Moving the data is expensive onto these GPUs. In fact, moving the data in general is expensive. So understanding that concept I think helps us build better data systems. That is when, at least for us, that's when we decide to build our own system. When we're trying to minimize the movement of data and the current process doesn't do it. You get orders of magnitude. I remember there was a famous quote. It says that, to get double digit improvements or whatever, that's quantitative in nature, you can kind of micro optimize your way there. If you want orders of magnitude, it has to be qualitative. You have to rethink the problem and GPU's are like that. GPUs require you to kind of come back and reframe the problem differently to kind of fit into that solution a little bit like the way a lot of things work, like quantum, for example, you have to rethink the problem, reframe it to fit into that. So you mentioned one of the things I think that I first thought of when I thought about simulation is I thought of it in terms of its competitive technique to ML. There’s an ML approach and there's a simulation approach, but I don't think that that's the case that there probably, they compliment each other. And it sounds to me that when you mentioned you kind of build this data set to be able to extract these expert rules from it, that's exactly what you do with ML too. You're kind of building a set to maybe either, the two major schools of thought in ML are either kind of build some kind of cohort where you’re clumping things together or to build something that you're able to have some predictive power on some of the features that you don't know. How do you guys see ML and how have you used what you have to augment your own ML? Andy: Yeah, there are a few places within the simulation approach. Of course natural language processing approaches to try to extract rules out of the publications and some applications there, ultimately we're trying to find things that correlate and interact together. So there are some approaches that you can extract interactions from co-expression data and that can be done with a machine learning approach where if we can take transcript profiles of patients, you can see if two different proteins change expression together, then there's some interaction between them. You may not know exactly how or what, but then you can kind of build, you can extract interaction rules from that. There are some machine learning algorithms that do that. And then you know, taking the output of our simulation, there are some things we we've talked about as a possibility where now that we're generating these hundreds of thousands of simulations of a particular set. Now we have pretty large data sets of the way these things change together. And we may be able to now apply machine learning to the logs of our simulations to start to extract rules from that and insights that you may not pick up if you just manually look at them or look at them one at a time. Angelo: Makes sense. John: Yeah, I think one thing I've thought about in this space of comparing the two was, machine learning and a lot of applications is about making a prediction, right? It's about trying to train some model to predict for some outcomes, some classification or some regression. And that's not exactly what we're doing with the simulation, right? It's not that we're trying to make some kind of, you know, categorical prediction, but rather learn from a system. Now, there are quantitative data points that come out of this that you can analyze and you can graph and you can run stats on and you can do all the rigorous science to get an answer from this. But one thing I think is really a benefit from the simulation that you don't get in ML or other AI approaches, is the ability to actually observe the system. You know, one thing we're working on and have we're kind of in another, along with the system itself, in another phase of iteration, is the visualizing the system. And I mentioned earlier about deciding the resolution we want to to store these snapshots of the simulation. One of the reasons to think about the resolution you want is not just for looking at the final result for analysis, but to actually be able to playback that system as well. So we store the actual positional data in a time series form. So we have some prototypes of a new system that we're working on for visualizing these simulations. And I think that there is a benefit to that not only from a research and you know trying to get some answers out of the system, but also there's an educational aspect to it too. I think one thing we haven't touched on much is using this simulation for education and that's not just academic education, but even for research education, learning more about how the systems work. Being able to see actually the simulation work by drawing those entities and those containers, those organelles in a 3D space and then play that over time, gives us a way for our human intuition to actually see these systems work. And sometimes you don't need to look at the data, you know, for example, if we're looking at something like endocytosis or exocytosis, right, you don't need to look at the data if you have a 3D temporal interpretation of endocytosis happening or exocytosis happening to be able to say, okay, when the model is set to these parameters, you get more vesicles forming in endocytosis, or more extracellular entities exocytosising, in this sense, if you can just look at it. You know, a picture is worth a thousand words and a video playback is worth that much more especially when you have the ability to control it because the simulation has been run and then you're just working with playback data of these logs. So that is, I think, a huge thing that our simulation offers that you just don't get out of any kind of ML approach. Angelo: That's fascinating. That's a great point. One of the biggest criticisms of ML is we can't look inside the black box. We can't explain it. So the explainability of a model is really important. Here, you can explain it because you can see why it is that it came to the certain conclusion because you can play it back. That's a really good point. I think that that's worth bringing out. The other thing I think that's interesting here is, simulation is a data point too that can feed other models later, right? Maybe models can predict how a simulation will go based on millions of simulations you've already done. So eventually, like this data will become a self-feeding machine that will create, you know, new ways of even things to think about the problem that we haven't done. Right? Beause not everything is about feature extraction of existing data. There are like interesting new approaches that you guys can discover just because simulation enables it. So that's really neat. Go ahead. You were going to say something to Andy. Andy: Oh, I was just gonna say, you know, we've been talking about simulation a lot and it just brings to mind I had an interesting debate with somebody a while back who claimed that this system is not a simulation at all. It's actually an emulation and there's an important distinction that because we have this mechanistic detail and we're moving things around and it really is, we're emulating a cell where you can have all of this movement that is similar to what you have in the real life system. Whereas a simulation is more predictive type approximation and I think that distinction may be an interesting one to think through a little bit more of this mechanistic description and understanding of what's happening inside comes because we're really trying to emulate the cell in silico and it's just a different approach. Angelo: I think if you think about, how you have been saying that the rules continue to grow. You're improving the rules. You're adding more rules. I would think of it as a graph of simulation and emulation. At some point, they're going to cross where you have enough rules encoded in the system, that it does a really good job of emulating what the cell does. Now, nobody knows all of the things that a cell does, it's way too complicated, but for the purposes of what you're simulating, you know, you may have hit some of the major points. And that is a really interesting point, which brings me to the question I wanted to ask. And that is, what impact does all of this have practically speaking to just improving lives and finding drugs and all of that. Andy: Yeah. I mean, the core that we set out which has really come to fruition is understanding what are the core mechanisms that are wrong within a disease, and then using that to initiate new drug discovery programs. And so for instance, our work around Parkinson's disease led to validation that sporadic Parkinson's disease patients have deficits in a process called mitophagy. And we simulated this and showed this three or four years ago now, which then led us to say, okay, let's see if we can make drugs that can increase mitophagy that could be therapeutic. So we now have therapeutics that are maybe a year, year and a half away from the clinic that can do just that. Interestingly, in the last year or two, we've seen quite a few other groups like now going, hey, mitophagy is really important for Parkinson’s, start to go down that. So this tool gave us a head start to go down that we had the confidence to embark on that earlier than the rest of the field because of the simulation. So I think that's one of the meaningful things you know and continuing to get new insights beyond that program. We have a project that's just wrapping up with a collaboration with researchers at University of Oxford where we're taking individual patient derived cells and doing simulations of those and then being able to see what's different here. One of the big challenges in the laboratory is that you can only measure one or two or a handful of proteins at any given time. You've got to tag them with antibodies and measure them and so you can't measure everything. Or you can measure, you know, with proteomics or mass spectrometry approaches. You can measure quantities of the proteins, but you can't really, you can't get into the post-translational modification states of those things. You can't fractionate them to the level to tell how many of the protein are on the plasma membrane or in the golgi or in the lysosomes. Whereas we can simulate that level of resolution for all of the entities across the continuous time series. So using the patient derived cells as an input running simulations, we were able to make predictions of what points should be measured that are likely to be different downstream of a mutation. So if you take patient cells from patients with the mutation against healthy humans, we saw differences there which they then were able to validate. So we now have some new insights into some of the specific genetic forms of Parkinson's that point us into biomarkers of level of progression, and also looking at some new therapeutic approaches that we can do there. And also looking at what’s different between these patient derived cells and the brain cells in those simulations. We now know there are meaningful differences in the profile of those patient derived cells that make them not a great platform for testing therapeutics. Some of the key machinery is just at a very different level in the patient derived cells than what you have in the brain. So they behave very differently. And you have to think about those things. You can extrapolate if you know that, but I think a lot of people are looking at those cells is like, well, let's just screen a bunch of drugs and see what happens there when we can now say, well, if you see this in the cell, you can expect to see this in the brain because we've simulated both of them. Right? And that's sort of power that you can get from using this platform that you just, there isn't any other way to do it. Angelo: Yeah. Simulation is interesting because it isn't something that, we would think that something that's just been done for decades. It isn’t. The way that we normally derive drugs is the hard way, in a lab. Like you said, tagging some things and you can follow a few things through the process. You can try to determine quantities, but you can't simulate everything and understand everything. It would take you many, many, many iterations of that, which are expensive and long. So it takes a long time to bring drugs to market. So what's interesting, I think about this, if you think about changing the world, you know, computer scientists, we like to gravitate to things that change the world. But sometimes we think of things in terms of big discoveries. Research isn't like that. We don't always know that we're going to find anything and what we might find. But I remember my thesis director told me years ago, even finding that something doesn't work, contributes to the general knowledge of the industry. And so you've changed the world because now people will build upon your paper that definitively proved something doesn't have an effect and then that helps further research. So, what's nice about things like this is, really no matter the outcome, finding a compound that can help many people will change the world. Finding a compound that helps one person changes the world. Finding new insights into compounds changes the world and even finding that things don't work changes in the world. So I think no matter what, you have this machine that's able to produce new knowledge really, really quickly, which I think is fascinating. And the things that you guys are doing are really fascinating. And I wanted to thank you guys for taking the time, John and Andy, to be with us here today and to chat a little bit about what you do. Thank you so much. Andy: Thanks for having us. John: Yeah. Thank you. Angelo: I’m your host, Angelo Kastroulis, and this has been Counting Sand. Thank you for joining us today. Before you go, please take a moment to like, review, and subscribe to the podcast. Also, you can follow us on social media. You can find me on LinkedIn @angelok1, feel free to reach out to me there. You can also follow me on Twitter, my handle is @angelokastr. Or you can follow my company Ballista Group. You can also find the show on countingsandshow.com or your favorite podcast platform.