Counting Sand

Cutting-Edge Data Systems: Machine Learning

Episode Summary

On the cutting-edge of data systems lies the exciting work being done at the Harvard Data Systems Lab. This critical research is funded by National Science Foundation, Amazon, Meta, Tableau, Google, Microsoft, etc. Angelo is thrilled to be joined by the head of the lab, Stratos Idreos. During his time at Harvard, Stratos was Angelo’s thesis director and mentor.

Episode Notes

Over the last couple of years, Harvard Data Systems Lab has been focused on cutting-edge research and applications of complex data systems, focusing on such areas as artificial intelligence and machine learning pipelines. In this episode of Counting Sand, Angelo and Stratos dive deep into what they have learned and what’s next in these fields.

In a time crunch? Check out the time stamps below:

[01:00] - What’s new at Harvard Data Systems Lab?

[08:20] - What are examples of general data structure applications?

[14:13] - How do we decrease the time spent from research to application?

[20:23] - What are the benefits of machine learning?

[22:15] - What are some helpful tips when writing a thesis?

[25:00] - How important is the creative process when writing a research paper?

Helpful links:

Harvard Data Systems Lab: http://daslab.seas.harvard.edu/

Harvard Data Systems Lab Twitter: https://twitter.com/HarvardDASlab

Our Team:

Host:Angelo Kastroulis

Executive Producer: Náture Kastroulis

Producer: Albert Perrotta

Communications Strategist: Albert Perrotta

Audio Engineer: Ryan Thompson

Music: All Things Grow by Oliver Worth

Episode Transcription

Angelo: On the cutting edge of data systems lies the work being done at the Harvard Data Systems Lab. The work is funded by the likes of National Science Foundation, Amazon, Facebook, Tableau, Google, Microsoft, and so on. I love the work being done there. I'm also excited to be joined by the head of the lab, my thesis director, my mentor, and my friend, Stratos Idreos. I’m your host Angelo Kastroulis, and this is Counting Sand. We haven't spoken in many years. I think since I left the lab and you just had a child and now our children are growing up. How long was that? Stratos: Two kids by now. So life is crazy. Everything is unsettled, but in a good way. Angelo: Yes. Well, I mean, those are the most fun parts of our lives for sure. And they grow up fast. And you're not only a professor at the university, you do research, you write many papers, and you're an entrepreneur. Anything interesting going on at the lab? Stratos: Well, the lab has been doing lots of, I mean, as you know, lots of research in the systems, overall systems area. Over the last couple of years we are doing, in addition to our research on data systems, we are doing new things in the area of artificial intelligence, in particular machine learning pipelines. But also lately we've been expanding in the blockchain, kind of, this new emerging space that as some of the other things that are happening in the field in general, they are partly based on old technology, but also partly on lots of exciting new concepts that are emerging. And so we are looking from a systems perspective, what we can bring in this space, especially for problems that have to do with how fast you can do, in the case of blockchain, for example, how fast you can do transactions. And what is the security aspects involved and how you can actually have both speed and security. Angelo: Hmm. Okay. That's interesting. I had done a little bit of work in healthcare and blockchain is one of those things that's really interesting if you're in a space where nobody trusts each other and you want to have, the problem is the amount of data we could put in there. I know MIT did something with John Halamka. I can't remember how many years ago. But the project never really happened. It didn't materialize. They tried to do something, but it was just too big. They couldn't figure out where to put the patient records. You can't stuff them into the blockchain. So that's unique problem. So there's definitely lots of room for research in these areas. What are some of your favorite things that you're researching over there? Anything you can talk about? Stratos: My research is really only one thing. So there's one concept that I'm just fundamentally very excited about and all the reasons that we do is kind of trying to attack this one problem from that perspective and to see if it's possible to actually reason with that. So let me explain what this means, because I know that what I said is very vague. So, if you see lots of different scientific fields, how they evolved over the, you know, decades for computer science or hundreds of years for other scientific fields, there's one very fundamental concept that is similar. And to me, very exciting. And that is the concept of reasoning from first principles, which actually dates all the way back to Ancient Greece and Egypt. And then, lots of wonderful work mostly in Germany in the 16 and 17 hundreds. So if you go back to the most simple philosophical examples, is the idea that if you know the fundamental, basic principles of the world, then everything around us consists, can be created, can be synthesized from these first principles. You know, in Ancient Greece they would say it's water and fire and air. The thing that Plato later called the elements. And if somehow you know these first principles and you know how they communicate, what you can synthesize with those principles, what are the rules of synthesis. If you can understand what is the impact, whatever impact may mean, depending on what you're trying to synthesize, well, if all of that is true, now you can see the size algorithms in computer science that actually traverse this massive space of possible synthesis that you can create and the reason about that. Now, very specifically what this could mean for, let's go to a specific problem. So let's talk about? Data systems, which are one of the most fundamental things that we have in our life today, because everything we do goes through data systems. Now, one very, very interesting theme in our work and observation is that there is simply no perfect data system. So why this is important for the society in general is because, our life is so dependent on data by now. There's so many wonderful things that we can do with data. But then in order to be able to do all those things with data, we need to be able to move data, store data, process data. And because all of those things are very, very, very, very expensive, data systems are very critical because data systems is the thing that actually does all those steps. And the thing is that because there's so many different ways that we can design data systems and there's so many different contexts, where data systems have to run on the hardware and different data, different cloud budgets that people can afford, if we were able to create custom data systems for every single problem that we have, then we could always have the perfect solution that is actually possible. It makes it fast. It makes it viable in terms of cloud costs. It makes it viable in terms of response time and so on and so forth. Right now it's not possible to do that because creating a data system is equivalent to saying, I'm going to create this massive project. I'm going to build this amazing bridge that connects let's say Boston to Cambridge here. And of course, this is not going to happen tomorrow. It's going to take years. And data systems are very much like that. Today, it takes seven years, eight years to build a custom data system for a particular problem, for a particular hardware and so on and so forth. And so while this may be in theory possible, it's actually not possible. People cannot afford to build data systems. Only the biggest companies can afford to build data systems and even then they're taking massive risk because you don't know if it's gonna work out and it's such a long term investment. Now, if you had a process that understands what are the basic building blocks, the first principles of the system design and could automatically traverse the massive space of possible data systems and say, hey, this is your data, this is your hardware, this is your cloud budget, here's the perfect data system just for you. And by the way, here's the code, the actual code that gives you that data system. Now, everybody could have the perfect data system without having to wait eight years at a time, or having to spend millions of dollars to implement and build a data system. Which opens the door of, you know, all sorts of exciting possibilities that go back to making things possible in the first place, having equity to who has access to technology and all sorts of wonderful things. So this is kind of effect that these kind of research can potentially have. In being able to create a custom solutions, and not only custom solutions, but also solutions that have not been seen before. Because by being able to synthesize from first principles, you even create designs that you haven't seen before you don't know they existed even though, you know, in theory they're possible, a human being hasn't seen those designs so we just don't know them. We expect researchers to potentially view them in the future. So for example, in our research we've shown that there are more than 10 to the 100 possible data structures. Like, one of the most fundamental things in computer sciences is how you store data, data structures. And we could actually calculate that number because we created the design space from this first principles and see how many sentences are actually possible. At the same time, we've calculated that since the beginning of computer science, which we can probably data back to the 1950s, if we have to put some kind of threshold, the whole research community has published about 5,000 data structures. And so to put that in perspective, 5,000 is really nothing compared to 10 to the 100. 10 to the 100 is just a massive number. The estimated number of stars in the sky, in the universe is 10 to the 24. So that goes to show you that even though there must be lots of these new data structures, there must be potentially useless. Like we believe for, you know, is there life in the universe other than on Earth? I mean maybe, if there's such a huge number of stars. Same thing here. There must be useful data structures that we are not seeing just because of the size of the space. Angelo: That makes a lot of sense. I was working on a project, you know, since I'd work on labs, it really, the lab really fundamentally changed the way I approach things. Because of just, when you understand what's going on under the covers you start thinking different. There was a project I was working on and I proposed, we create a data system and they said, are you crazy? It was a healthcare based project. Healthcare has a specific data structure. It has specific query patterns. And the problem was that a generalizable data system, you know, in order to generalize and make it available for any unknown workload, you have to trade things off. But if you don't have that criteria and you say, I know what the workload's always going to look like, something like this, you can optimize it for that. And it can be many, many times faster and it wasn't as hard to build because you're not building it to be used by other people, but still, you know, it wasn't like we were inventing something brand new. We were taking LSM trees that already existed. And we said, okay, we'll implement that. We'll find the library that does it too, but we implement an LSM tree. There are query parsers out there so we can query parse and we have to write our own optimizer, let's just take Volcano. And then you start doing some different things that are already done and you assemble them, but you do it in a unique, different way and it's really, really fast. So what you're saying makes a lot of sense. As I researched, one of the things I liked in the lab was that I always thought ML is one of those ways because the search space is so complex. 10 to the 100, never thought about that, but I thought ML is good at figuring out a pattern anyway, and then maybe you can find some useful pattern, but this isn't ML, you're not talking about using necessarily machine learning to find the pattern you're talking about generating new kinds of data systems from it. Stratos: Yeah. Yeah. ML is an absolutely amazing tool to use in many occasions, but not when you know what to do. Many times in our research we use ML so that it can point our intuition in a direction that makes more sense than what we're looking at. And oftentimes we find that, oh, because an ML algorithm is converging in this direction, we might actually be able to understand the problem better and then we'll move on and many times we are able to create an actual mathematical solution out of this direction. And oftentimes we are unable to do that. And this is when we rely on ML. So we use ML very heavily in our lab, both to lead our research in directions that maybe we're not seeing the patterns and then we use it as inspiration. And then if we can find analytical solutions then we go with that because this is always perfect. And if we can't, we rely on ML. You know, we have kids it's very similar with what happens in real life. Right? So you see a kid the way they grew up, they try out things but once they know something, they don't repeat the error. Right? So, when you see a one year old trying to stand up under the table, they hit the head three times. The fourth time, they know they have to move away. They won't stand up below the table. It just doesn't make sense anymore. So we try to use ML in the same way where there's exploration, there's learning but then once we know we don't have to learn anymore for a particular context. Angelo: Yeah, that makes a lot of sense. So, what's interesting about this is it isn't just, I remember, we reviewed the H2O paper years and years ago, where it was just kind of changing the storage style and then there was research. I mean, mine was about scans versus using an index, you know, it's just making a choice between two different things that they can represent the data. And those are just, you know, B-Trees and scan are just two ways. There are hundreds of ways to represent and access data and all of that, especially if you know what the data looks like and what the data is going to be used for which you don't normally have a priori knowledge of that when you're building data systems. But so that's interesting. So is that what you mean by the building blocks? It's putting those kinds of ideas together and it can put them in any combination that might make sense for your particular need? Stratos: The concept of bringing together blocks of system functionality, core system functionality, and then deciding to use it or not to use it, still a very hard problem. What we are talking about is even lower than that. So let me give you a specific example, which would be a bit more technical example for people that might have background let's say, in data structures. So one choice could be, does my system contain a B-Tree or a hash table? And you could say B-Tree and hash table are components of my system or not, I make a choice on the fly for that. And that's great for adaptability. What we're describing is lower than that. Imagine that the first principles that we're describing is, how do you actually create the node of a tree? How do you decide what kind of compression you're using inside the node of a tree? How do you decide, how many levels the tree has with respect to the data or with respect to the hardware? So it's a series of design decisions that make this not just a tree or node discussion, but you could have actually trillions of trees that the system designs on the fly. Or it can actually create a completely new data structure that, you know, it starts with a tree-like shape, then it looks like an LSM tree, then it looks like a hash table. And it does all of that because it adjusts completely to the data in the hardware at hand. And so it doesn't synthesize a system out of big components that humans have created. It synthesizes the system completely from first principles that humans have decided that these are the first principles, but they haven't synthesized anything, they haven't designed anything. Angelo: Okay. So that's an enormously complicated problem because we're not even, yeah. We're talking about kind of putting the blocks together and some of the blocks might be put together in ways that are not advantageous at all. So we'd have to, I mean, not everything is going to be valuable, so you have to... That's interesting. It's an interesting problem. So normally, you know what you were talking about, how long, and this is one of the things that I've always been interested in. How do we decrease the amount of time? Compress the time that it takes to go from research to application? It takes decades. How do we compress that? Because there's always something written that's very applicable and interesting. I will say, being in the application side for most of my career, there is fear. People are absolutely afraid of reading a paper and saying, let's take it and implement it. It's a lot of work, you know, I don't understand the space. So it takes a really long time to get some of this so I could definitely see a system that could be a generator of other systems that produces something like that. It's really interesting. You mentioned intuition, you know, where ML doesn't work Expert systems are a great example, but in a system where you have knowledge, like, I remember, you wrote this paper too, Cracking. This a long time ago. I don't remember how long it was, but it was before Harvard so I don't know. But, the thing I still remember about that paper is that there was an intuition in there that said, that if we, and you can correct me, obviously you wrote it, but something like database systems who have to maintain structures when you write are always going to be hamstrung by that ability. And so it's going to affect their ability to write. What if we didn't have to do that and we can do it later while we're doing something else like querying, and we could do only the parts we care about and then the parts of data that we're never going to see why waste computation, something like that. That makes a lot of sense to me that we can rethink problems. And you don't need ML to tell you that. The hard part is, well, how do I solve it then? Stratos: Yeah, there's a limit. So by the way, you translated very nicely the theme from these series of papers. One of the models was that the system itself should treat any kind of request from the outside world, which is, you know, store this data, process this query, is to treat all those requests as advices on how the data should be stored in the system. So it should have no fixed way to store the data. Everything should look like an advice to the system. Oh, because I got this data, because I got these queries, here's how I should store my data. And that should be an always on discussion. I think sometimes with ML, sometimes people confuse ML as the ultimate tool that will make all decisions. And this is definitely wrong. But I think this is kind of, its early days of ML being more broadly used. And so it takes time for most people do for all of us to process what exactly we can do with that. But, there's definitely many areas where we found ML to give us amazing results. And especially one of the things that ML does very well is that, when you have patterns that humans cannot see, or it's hard for humans to see those patterns, for ML that part is actually easy. So it maybe takes processing time or whatever, but ML can spot patterns. And then if you have enough data, you can trust the patterns most of the time. If you don't have enough data you have a problem. So if you use it right, and what using it right can mean in this case, is that, oh, I see this pattern. Let me follow it and see if there's an algorithmic reason behind this pattern. And if there is now I can use that to create something else that is more like a fixed role that has some mathematical representation and in the code has an actual fixed code path that does this. Other cases where this is useful is in our work, we do a lot of modeling in terms of creating mathematical models that estimate what would be the performance impact of specific design decisions. So this is very important in our world because we want to make choices and you want to make informed choices. So this is another huge topic that people have been studying since the early days of computer science. How do you create those mathematical models and how much you can trust them? How do you verify them and all that stuff? It's a very challenging and very impactful problem. So by now I think I can kind of confidently say that we can actually model everything. And by we I mean computer scientists. The hardest problems can be modeled. The problem is that, there's a degree of complexity in how long it takes to create these models and maintain those models. And we always have to maintain them over time because the context changes and a big part of the context is hardware that is changing. When we're modeling an algorithm, a part of a system that operates with data coming from disc or from a faraway machine in the network, what happens there is that the most expensive operation is the data movement. And so we can create models that are still they're quite complex, but really there's only one thing that you're modeling—data movement. When the data comes in memory and your model is trying to represent the cost of algorithm that operates in memory. Now a very advanced computer scientist can do that if you give them some time, but it is another level of complexity of a problem. So as an example, in some kinds of work that we've done in our lab, which actually your thesis, was a spinoff of that, where we modeled in memory indexes and scans, it actually took us, I would say a full year to go through the process of very carefully modeling everything up to the last detail to the hardware and all of that. And we only have done that at this level for, you know, effectively two data structures. And that took a ton of expertise and trial and error and all of that. And with ML in memory, by running a few basic simulations on machine learning models that represent the behavior of these algorithms, you can do this in, as you know, because you've done it, you know, way, way, way faster. And so this is a situation where we can do it, but the amount of time and effort that is needed makes it impractical. So if you go back to our research where we want to be able to reason and model 10 to the 100 possible data structures if you put one year for every data structure or every couple of data structures, it just doesn't scale. So that's why we use ML to model things in the memory and these are only some of the uses of course. By now we know so many interesting use cases of machine learning, but also not so useful cases of machine learning, but fundamentally speaking, for areas where we know that we have complex problems, that humans could make errors or it would take a ton of time for humans to come up with a solution, a prediction, whatever this is, these are wonderful candidates for careful use of machine learning. Angelo: Yeah. Well, you had mentioned that's, I think, one of the things that inspired me to work on that particular problem with thesis. You mentioned that, you know, you took a year to come up with this one mathematical model for how to explain that particular thing. But by the time you did it it's already outdated because the hardware underlying it changed, the capabilities changed. So it's almost an untenable process. That's what I liked about ML. And I didn't start with ML. I went a different direction first and it didn't work out, but ML, what I liked about it in that particular scenario was, the space was sufficiently complicated that I think that there's a fundamental problem with just the idea of cost analysis. You have to bias your mathematical model by including certain variables. And then when you run them, say in an Amazon cloud, you have no idea what it's running on or what's underneath it, or how many other factors there are that are going to contribute to it. And so you can't even include those in the model, whereas ML it's about patterns and it says, well, it'll all come out in the wash. Just run it and then I'll observe and I'll tell you. So I like that, but the one thing that you just said, I think that's really important that I hope everybody takes home and thinks about it is, it's okay if your ML models don't always have applicable value. You can use them just as a research springboard to say, oh, it showed me that these patterns are interesting, or it showed me that these features are more important. I never would have thought that. It doesn't make sense. Now I can research that and figure out why and then I can create a mathematical model from it. Stratos: Yeah, absolutely. And I think people are slowly coming up, you know, actually being able to see this. I think this is a wonderful use of ML, that it should be in the arsenal of every researcher. Angelo: Yeah, I agree. It's almost. I used to just think as a different branch of computer science, but I almost think it's a new world of computer science. And I do think that the future of computer scientists is we all have to understand it. To different degrees, I think, but there are so many models and so many approaches, but they're all just, you know, mathematical. One thing that you said to me years ago that, I always think about this when I think of you. I was struggling a lot with just, you know, getting to the writing part of the thesis and it was just, you have a lot to write and it's a mountain of work that you're trying to do. And part of it is your research isn't done because you can't wait until your research is done. I had already been working on it for a year. That it'll take you some time and I was a little stressed about it and I was going to Greece and I said, I don't know, you know, I'm going to be on this island, I don't know if that's going to work. And I remember you said something like, well, I can tell you that that formula certainly works. I've written many papers on islands and they're some of my best work, something like that. That really stuck with me and so I was just wondering, have you, are you writing anything now that's interesting, or I know you're also an entrepreneur, do you have anything cool and interesting going on? Stratos: So first of all, I think, you know, writing and anything creative, and writing is creative because the writing is part of when we write research, we kind of understand the research better and we understand what it means, how it came about, how it can be communicated, how it connects with other areas, how we can explain it. So it's part of the process and there's something magical about writing that you first have to write it for it to be complete. I'm not sure how to explain it. So writing is a very creative part of the work that we do. And in some sense, I think it happens, at least for me, but I think it's true for most people that, you are more creative in areas or places or contexts that you feel calm, you feel happy. So if that helps you with the work, just go ahead and do it. Because also if you think about the world, what matters is progress, right? I mean, not above everything else, but at the end of the day either you're having progress or not. So, if you can work better in some contexts, go and be in that context. So for me, it has been a recipe to be by the sea every summer. And that helps me have progress. I'm spending quite some time these days on a startup, the startup is still under stealth mode. But, lots of exciting things are happening. Raised money maybe six months ago or so. And, you know, all the discussion we just had about data systems and machine learning and how you can synthesize, the most tailored system for a particular context. So these are all concepts that are coming together in the startup. And hopefully, soon we'll be able to tell you more about how you can have your perfect data system for all the artificial intelligence that is happening out there. Angelo: Nice. Nice. Well, I can't wait to hear about it. I've got a little startup too, that I'm working on, nothing like a data system, but it is a computational engine for healthcare. You know, things we learned from doing it once and doing it fast. So I'd love to share stories on that one day and see how interesting that'd be. You're right about, you know, the other side of it of thinking there is a creative aspect to writing and I miss it. Writing a paper, you know, you do take a beating with the papers too. I mean, you know, it's not just you write it and it's wonderfully accepted. You have a lot of work to make sure that it's going to be a good paper. But there definitely is a creative aspect to it. It reminds me, I was talking with one of my programmers and he's a composer, and we were just talking actually today. And he said that he feels this pull to finish a composition that he had started, you know, years ago. And it feels like he doesn't know where it's going to go or what it is. It's like, you just sit down and write it like you can a paper, but there's a pull that I have to from deep inside me finish it. I feel like I cannot leave until I'm done with it. Cannot leave the world. And I feel like in some ways, writing academically is like that too. You just say I can't finish what I'm doing, I won't rest until this is done. It is kind of like trying to summarize all of the life's work, you know? And then there's the next one that starts after that because then now you have new ideas and things that have not, you know, that you weren't able to address. Stratos: Yeah, yeah. I think it's the same thing, whatever people are doing that they're really passionate about, you can't stop thinking about it. You have to keep going, you want this this to be perfect. You never believe it's perfect. You always want to make it a bit more elegant and a bit nicer. And then when it's done, you're going to do the next thing because you're looking for the next silence. And I think at that level, whether you talk with, you know, with a researcher in computer science or whatever other science or an artist, it's the same feelings and it's actually, it's wonderful feelings that keep you alive. Angelo: It is. know, I think about, further along in my career, I see myself going back, getting the PhD, finishing that work just to kind of capstone it all and continue to research just because it's a pull. I can't describe it. It's like a pull that I haven't finished this unfinished thing that needs to be done, so I definitely get it. Stratos: I would do, I would do a second PhD myself in a heartbeat if I could get the time. Angelo: Yeah, I know with a family it's really hard. Well Stratos thank you for joining us. Thank you for being part of the show. Is there anywhere people can find you on social media if they want to follow you? Anything they can follow or follow the lab? Stratos: I'm not on social media, myself, but the lab has a website. We have a Twitter account, DASlab [@HarvardDASlab], where we announce our research papers as they're being published. Or if you have a video, we'll post it there. So yeah, so by all means this is accessible there. Angelo: Sounds good. We'll put that in the show notes. Thank you for joining us. I'm your host, Angelo Kastroulis and this has been Counting Sand. Before you go, please take a moment to like, review, and subscribe to the podcast. Also, you can follow us on social media. You can find me on LinkedIn @angelok1, feel free to reach out to me there. You can also follow me on Twitter, my handle is @angelokastr. Or you can follow my company Ballista Group on Twitter @BallistaGroup. You can also find the show on countingsandshow.com or your favorite podcast platform.