Tuesday, October 7, 2014

MIT Media Lab computing director details the virtues of cloud for agility and disaster recovery

The next BriefingsDirect innovator case study interview focuses on the MIT Media Lab in Cambridge, Mass., and how they're exploring the use of cloud and hybrid cloud to enjoy such use benefits as IT speed, agility and robust, three-tier disaster recovery (DR).

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

To learn more about how the MIT Media Lab is exploiting cloud computing, we’re joined by Michail Bletsas, research scientist and Director of Computing at the MIT Media Lab. The discussion, at the recent VMworld 2014 Conference in San Francisco, is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Tell us about the MIT Media Lab and how it manages its own compute requirements.

Bletsas: The organization is one of the many independent research labs within MIT. MIT is organized in departments, which do the academic teaching, and research labs, which carry out the research.

http://web.media.mit.edu/~mbletsas/
Bletsas
The Media Lab is a unique place within MIT. We deviate from the normal academic research lab in the sense that a lot of our funding comes from member companies, and it comes in a non-direct fashion. Companies become members of the lab, and then we get the freedom to do whatever we think is best.

We try to explore the future. We try to look at what our digital life will look like 10 years out, or more. We're not an applied research lab in the sense that we're not looking at what's going to happen two or three years from now. We're not looking at short-term future products. We're looking at major changes 15 years out.

I run the group that takes care of the computing infrastructure for the lab and, unlike a normal IT department, we're kind of heavy on computing. We use computers as our medium. The Media Lab is all about human expression, which is the reason for the name and computers are one of the main means of expression right now. We're much heavier than other departments in how many devices you're going to see. We're on a pretty complex network and we run a very dynamic environment.

Major piece

A lot has changed in our environment in recent years. I've been there for almost 20 years. We started with very exotic stuff. These days, you still build exotic stuff, but you're using commodity components. VMware, for us, is a major piece of this strategy because it allows us a more efficient utilization of our resources and allows us to control a little bit the server proliferation that we experienced and that everybody has experienced.

We normally have about 350 people in the lab, distributed among staff, faculty members, graduate students, and undergraduate students, as well as affiliates from the various member companies. There is usually a one-to-five correspondence between virtual machines (VMs), physical computers, and devices, but there are at least 5 to 10 IPs per person on our network. You can imagine that having a platform that allows us to easily deploy resources in a very dynamic and quick fashion is very important to us.

We run a relatively small operation for the size of the scope of our domain. What's very important to us is to have tools that allow us to perform advanced functions with a relatively short learning curve. We don’t like long learning curves, because we just don’t have the resources and we just do too many things.

You are going to see functionality in our group that is usually only present in groups that are 10 times our size. Each person has to do too many things, and we like to focus on technologies that allow us to perform very advanced functions with little learning. I think we've been pretty successful with that.
We really need to interact with our infrastructure on a much shorter cycle than the average operation.

Gardner: How have you created a data center that’s responsive, but also protects your property?

Bletsas: Unlike most people, we tend to have our resources concentrated close to us. We really need to interact with our infrastructure on a much shorter cycle than the average operation. We've been fortunate enough that we have multiple, small data centers concentrated close to where our researchers are. Having something on the other side of the city, the state, or the country doesn’t really work in an environment that’s as dynamic as we are.

We also have to support a much larger community that consists of our alumni or collaborators. If you look at our user database right now, it’s something in the order of 3,500, as opposed to 350. It’s a very dynamic in that it changes month to month. The important attributes of an environment like this is that we can’t have too many restrictions. We don’t have an approved list of equipment like you see in a normal corporate IT environment.

Our modus operandi is that if you bring it to us, we’ll make it work. If you need to use a specific piece of equipment in your research, we’ll try to figure out how to integrate it into your workflow and into what we have in there. We don’t tell people what to use. We just help them use whatever they bring to us.

In that respect, we need a flexible virtualization platform that doesn’t impose too many restrictions on what operating systems you use or what the configuration of the VMs are. That’s why we find that solutions, like general public cloud, for us are only applicable to a small part of our research. Pretty much every VM that we run is different than the one next to it. 

Flexibility is very important to us. Having a robust platform is very, very important, because you have too many parameters changing and very little control of what's going on. Most importantly, we need a very solid, consistent management interface to that. For us, that’s one of the main benefits of the vSphere VMware environment that we’re on.

Public or hybrid

Gardner: What about taking advantage of cloud, public cloud, and hybrid cloud to some degree, perhaps for disaster recovery (DR) or for backup failover. What's the rationale, even in your unique situation, for using a public or hybrid cloud?

Bletsas: We use hybrid cloud right now that’s three-tiered. MIT has a very large campus. It has extensive digital infrastructure running our operations across the board. We also have facilities that are either all the way across campus or across the river in a large co-location facility in downtown Boston and we take advantage of that for first-level DR.

A solution like the vCloud Air allows us to look at a real disaster scenario, where something really catastrophic happens at the campus, and we use it to keep certain critical databases, including all the access tools around them, in a farther-away location.

It’s a second level for us. We have our own VMware infrastructure and then we can migrate loads to our central organization. They're a much larger organization that takes care of all the administrative computing and general infrastructure at MIT at their own data centers across campus. We can also go a few states away to vCloud Air [and migrate our workloads there in an emergency].
We know that remote events are remote, until they happen, and sometimes they do.

So it’s a very seamless transition using the same tools. The important attribute here is that, if you have an operation that small, 10 people having to deal with such a complex set of resources, you can't do that unless you have a consistent user interface that allows you to migrate those workloads using tools that you already know and you're familiar with.

We couldn’t do it with another solution, because the learning curve would be too hard. We know that remote events are remote, until they happen, and sometimes they do. This gives us, with minimum effort, the ability to deal with that eventuality without having to invest too much in learning a whole set of tools, a whole set of new APIs to be able to migrate.

We use public cloud services also. We use spot instances if we need a high compute load and for very specialized projects. But usually we don’t put persistent loads or critical loads on resources over which we don’t have much control. We like to exert as much control as possible.

Gardner: It sounds like you're essentially taking metadata and configuration data, the things that will be important to spin back up an operation should there be some unfortunate occurrence, and putting that into that public cloud, the vCloud Air public cloud. Perhaps it's DR-as-a-service, but only a slice of DR, not the entire data. Is that correct?

Small set of databases

Bletsas: Yes. Not the entire organization. We run our operations out of a small set of databases that tend to drive a lot of our websites. A lot of our internal systems drive our CRM operation. They drive our events management. And there is a lot of knowledge embedded in those databases.

It's lucky for us, because we're not such a big operation. We're relatively small, so you can include everything, including all the methods and the programs that you need to access and manipulate that data within a small set of VMs. You don’t normally use them out of those VMs, but you can keep them packaged in a way that in a DR scenario, you can easily get access to them.

Fortunately, we've been doing that for a very long time because we started having them as complete containers. As the systems scaled out, we tended to migrate certain functions, but we kept the basic functionality together just in case we have to recover from something.
We are fortunate enough to have a very good, intimate knowledge of our environment. We know where each piece lies. That’s the benefit of running a small organization

In the older days, we didn’t have that multi-tiered cloud in place. All we had was backups in remote data centers. If something happened, you had to go in there and find out some unused hardware that was similar to what you had, restore your backup, etc.

Now, because most of MIT's administrative systems run under VMware virtualization, finding that capacity is a very simple proposition in a data center across campus. With vCloud Air, we can find that capacity in a data center across the state or somewhere else.

Gardner: For organizations that are intrigued by this tiered approach to DR, did you decide which part of those tiers would go in which place? Did you do that manually? Is there a part of the management infrastructure in the VMware suite that allowed you to do that? How did you slice and dice the tiers for this proposition of vCloud Air holding a certain part of the data?

Bletsas: We are fortunate enough to have a very good, intimate knowledge of our environment. We know where each piece lies. That’s the benefit of running a small organization. We occasionally use vSphere’s monitoring infrastructure. Sometimes it reveals to us certain usage patterns that we were not aware of. That’s one of the main benefits that we found there.

We realized that certain databases were used more than we thought. Just looking at those access patterns told us, “Look, maybe you should replicate this." It doesn’t cost much to replicate this across campus and then maybe we should look into pushing it even further out.

It is a combination of having a visibility and nice dashboards that reveal patterns of activity that you might not be aware of even in an environment that's not as large as ours.

Gardner: At VMworld 2014, there was quite a bit of news, particularly in the vCloud Air arena. What intrigues you?

Standard building blocks

Bletsas: We like the move toward standardization of building blocks. That’s a good thing overall, because it allows you to scale out relatively quickly with a minor investment in learning a new system. That’s the most important trend out there for us. As I've said, we're a small operation. We need to standardize as much as possible, while at the same time, expanding the spectrum of services. So how do you do that? It’s not a very clear proposition.

The other thing that is of great interest to us is network virtualization. MIT is in a very peculiar situation compared to the rest of the world, in the sense that we have no shortage of IP addresses. Unlike most corporations where they expose a very small sliver of their systems to the outside world and everything happens on the back-end, our systems are mostly exposed out there to the public internet.

We don’t run very extensive firewalls. We're a knowledge dissemination and distribution organization and we don’t have many things to hide. We operate in a different way than most corporations. That shows also with networking. Our network looks like nothing like what you see in the corporate world. The ability to move whole sets of IPs around our domain, which is rather large and we have full control over, is a very important thing for us.

It allows for much faster DR. We can do DR using the same IPs across the town right now because our domain of control is large enough. That is very powerful because you can do very quick and simple DR without having to reprogram IP, DNS Servers, load balancers, and things like that. That is important.
That is very powerful because you can do very quick and simple DR without having to reprogram IP, DNS Servers, load balancers, and things like that.

The other trend that is also important is storage virtualization and storage tiering and you see that with all the vendors down in the exhibit space. Again, it allows you to match the application profile much easier to what resources you have. For a rather small group like ours, which can't afford to have all of its disk storage and very high-end systems, having a little bit of expensive flash storage, and then a lot of cheap storage, is the way for us to go.

The layers that have been recently added to VMware, both on the network side and the storage side help us achieve that in a very cost-efficient way.

For us, experimentation is the most important thing. Spinning out a large number of VMs to do a specific experiment is very valuable and being able to commandeer resources across campus and across data centers is a necessary requirement for something like an environment like this. Flexibility is what we get out of that and agility and speed of operations.
In the older days, you had to go and procure hardware and switch hardware around. Now, we rarely go into our data centers. We used to live in our data centers. We go there from time to time but not as often as we used to do, and that’s very liberating. It’s also very liberating for people like me because it allows me to do my work anywhere.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: VMware.

You may also be interested in:

No comments:

Post a Comment