Wednesday, July 1, 2020

How REI used automation to cloudify infrastructure and rapidly adjust its digital pandemic response

https://www.rei.com/about-rei

Like many retailers, Recreational Equipment, Inc. (REI) was faced with drastic and rapid change when the COVID-19 pandemic struck. REI’s marketing leaders wanted to make sure that their online e-commerce capabilities would rise to the challenge. They expected a nearly overnight 150 percent jump in REI’s purely digital business.

Fortunately REI’s IT leadership had already advanced their systems to heightened automation, which allowed the Seattle-based merchandiser to turn on a dime and devote much more of its private cloud to the new e-commerce workload demands.

The next BriefingsDirect Voice of Innovation interview uncovers how REI kept its digital customers and business leadership happy, even as the world around them was suddenly shifting.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

To explore what works for making IT agile and responsive enough to re-factor a private cloud at breakneck speed, we’re joined by Bryan Sullins, Senior Cloud Systems Engineer at REI in Seattle. The discussion is moderated by Dana Gardner, Principal Analyst at Interarbor Solutions.


Here are some excerpts:

Gardner: When the pandemic required you to hop-to, how did REI manage to have the IT infrastructure to actually move at the true pace of business? What put you in a position to be able to act as you did?

Digital retail demands rise 

Sullins: In addition to the pandemic stay-at-home orders a couple months ago, we also had a large sale previously scheduled for the middle of May. It’s the largest sale of the year, our anniversary sale.

Sullins
And ramping up to that, our marketing and sales department realized that we would have a huge uptick in online sales. People really wanted to get outside, because people could go outside without breaking any of the social distancing rules.

For example, bicycle sales were up 310 percent compared to the same time last year. So in ramping up for that, we anticipated our online presence at rei.com was going to go up by 150 percent, but we wanted to scale up by 200 percent to be sure. In order to do that, we had to reallocate a bunch of ESXi hosts in VMware vSphere. We either had to stand up new ones or reallocate from other clusters and put them into what we call our digital retail presence.

As a result of our fully automated process, using Hewlett Packard Enterprise (HPE) OneView, Synergy, and Image Streamer, we were able to reallocate 6 out of the 17 total hosts needed. We were able to do that in 18 minutes, all at once -- and that’s single touch, that’s launching the automation and then pulling them from one cluster and decommissioning them and placing them all the way into the digital retail clusters.

We also had to move some from our legacy platform, they aren’t at HPE Synergy yet, and those took an additional three days. But those are in transition, we are moving through to that fully automated platform all around.

Gardner: That’s amazing because just a few years ago that sort of rapid and automated transition would have been unheard of. Even at a slow pace you weren’t guaranteed to have the performance and operations you wanted.

If you were not able to do this using automation – if the pandemic had hit, heaven forbid, five or seven years ago – what would have been the outcome?
We needed to make sure we had the infrastructure capacity so that nothing failed under a heavy load. We were able to do it in the time-frame, and be able to get some sleep.

Sullins: There were actually two outcomes from this. The first is the fairly obvious issue of not being able to handle the online traffic on our rei.com retail presence. It could have been that people weren’t able to put stuff into a shopping cart, or inventory decrement, and so on. It could have been a very broad range of things. We needed to make sure we had the infrastructure capacity so that none of that fails under a heavy load. That was the first part.

Gardner: Right, and when you have people in the heat of a purchasing moment, if you’re not there and it’s not working, they have other options. Not only would you lose that sale, you might lose that customer, and your brand suffers as well.

Sullins: Oh, without a doubt, without a doubt.

The other issue, of course, would have been if we did not meet our deadline. We had just under a week to get this accomplished. And if we had to do this without a fully automated approach, we would have had to return to our managers and say, “Yeah, so like we can’t do it that quickly.” But with our approach, we were able to do it all in the time frame -- and be able to get some sleep in the interim. So it was a win-win.

Gardner: So digital transformation pays off after all?

Sullins: Without a doubt.

Gardner: Before we learn more about your journey to IT infrastructure automation, tell us about REI, your investments in advanced automation, and why you consider yourself a data-driven digital business?

Automation all the way 

Sullins: Well, a lot of that precedes me by quite a bit. Going back to the early 2000s, based on what my managers tell me, there was a huge push for REI become an IT organization that just happens to do retail. The priority is on IT being a driving force behind everything we do, and that is something that, at the time, REI really needed to do. There are other competitors, which we won’t name, but you probably know who they are. REI needed to stay ahead of that curve.

https://www.rei.com/
So since then there have been constant sweeping and cyclical changes for that digital transformation. The most recent one is the push for automating all things. So that’s the priority we have. It’s our marching orders.

Gardner: In addition to your company, culture, and technology, tell us about yourself, Bryan. What is it about your background and personal development that led you to be in a position to act so forthrightly and swiftly?

Sullins: I got my start in IT back in 1999. I was a public school teacher before that, and then I made the transition to doing IT training. I did IT training from 1999 to about 2012. During those years, I got a lot of technology certifications, because in the IT training world you have to.

I began with what was, at the time, called the Microsoft Certified Solutions Expert (MCSE) certification. Then I also did the Linux Professional Institute. I really glommed on to Linux. I wanted to set myself apart from the rest of the field back then, so I went all-in on Linux.

And then, 2008-2009-ish, I jumped on the VMware train and went all-in on VMware and did the official VMware curriculum. I taught that for about three years. Then, in 2012, I made the transition from IT training into actually doing this for real as an engineer working at Dell. At the time, Dell had an infrastructure-as-a-service (IaaS) healthcare cloud that was fairly large – 1,200-plus ESXi hosts. We were also responsible for the storage and for the 90-plus storage area network (SAN) arrays as well.
In a large environment, you really have to automate. It's been the focus of my career. I typically jump right into new technology.

In an environment that large, you really have to automate. I cut my teeth on automating through PowerCLI and Ansible. Since then, about 2015, it’s been the focus of my career. I’m not saying I’m a guru, by any means, but it’s been a focus of my career.

Then, in 2018, REI came calling. I jumped on that opportunity because they are a super-awesome company, and right off the bat I got free reign over: if you want to automate it, then you automate it. And I have been doing that ever since August of 2018.

Gardner: What helped you make the transition from training to cloud engineer?

Sullins: I typically jump right into new technology. I don’t know if that comes from the training or if that’s just me as a person. But one of the positives I’ve gotten from the training world is that you learn a 100 percent of the feature base that’s available with said technology. I was able to take what I learned and knew from VMware and then say, “Okay, well, now I am going to get the real-world experience to back that up as well.” So it was a good transition.

Gardner: Let’s look at how other organizations can anticipate the shift to automation. What are some of the challenges that organizations typically face when it comes to being agile with their infrastructure?

Manage resistance to cloud 

Sullins: The challenges that I have seen aren’t usually technical. Usually the technology that people use to automate things are ready at hand. Many are free; like Ansible, for example, is free. PowerCLI is free. Jenkins is free.

So, people can start doing that tomorrow. But the real challenge is in changing people’s mindset about a more automated approach. I think that it’s tough to overcome. It’s what I call provisioning by council. More traditional on-premises approaches have application owners who want to roll out x number of virtual machines (VMs), with all their particular specs and whatnot. And then a council of people typically looks at that and kind of scratches their chin and says, “Okay, we approve.” But if you need to scale up, that council approach becomes a sort of gate-keeping process.

https://www.hpe.com/us/en/solutions/infrastructure/composable-infrastructure.html

With a more automated approach, like we have at REI, we use a cloud management platform to automate the processes. We use that to enable self-service VMs instead of having a roll out by council, where some of the VMs can take days or weeks roll out because you have a lot of human beings touching it along the way. We have a lot of that process pre-approved, so everybody has already said, “Okay, we are okay with the roll out. We are okay with the way it’s done.” And then we can roll that out in 7 to 10 minutes rather than having a ticket-based model where somebody gets to it when they can. Self-service models are able to do that much better.

But that all takes a pretty big shift in psychology. A lot of people are used to being the gatekeeper. It can make them uncomfortable to change. Fortunately for me, a lot of the people at REI are on-board with this sort of approach. But I think that resistance can be something a lot of people run into.

Gardner: You can’t just buy automation in a box off of a shelf. You have to deal with an accumulation of manual processes and habits. Why is moving beyond the manual processes culture so important?

Sullins: I call it a private cloud because that means there is a healthy level of competition between what’s going in the public cloud and what we do in that data center.

The public cloud team has the capability of “selling” their solution side-by-side with ours. When you have application owners who are technically adept -- and pretty much all of them are at REI -- they can be tempted to say, “Well, I don’t want to wait a week or two to get a VM. I want to create one right now out on the public cloud.”
There is a healthy level of competition between what's going in the public cloud and what we do in the date center. We offer our customers a spectrum of services. And now they can do that in an automated way. That's a big win.

That’s a big challenge for us. So what we are trying to accomplish -- and we have had success so far through the transition – is to offer our customers a spectrum of services. So that’s great.

The stakeholders consuming that now gain flexibility. They can say, “Okay, yeah, I have this application. I want to run it in the public cloud, but I can’t based on the needs for that application. We have to run it on-premises.” And now they can do that in an automated way. That’s a big win, and that’s what people expect now, quite honestly.

Gardner: They want the look and feel of a public cloud but with all the benefits of the private cloud. It’s up to you to provide that. Let’s find out how you did.

How did you overcome the challenges that we talked about and what are the investments that you made in tools, platforms, and an ecosystem of players that accomplished it?

Sullins: As I mentioned previously, a lot of our utilities are “free,” the Ansibles of the world, PowerCLI, and whatnot. We also use Morpheus to do self-service and the implications behind automating things on what I call the front end, the customer face. The issue you have there is you don’t get that control of scaling up before you provision the VM. You have to monitor and then roll it out on the backend. So you have to monitor for usage and then scale up on the backend, and seamlessly. The end users aren’t supposed to know that you are scaling up. I don’t want them to know. It’s not their job to know. I want to remain out of their way.


In order to do that, we’ve used a combination of technologies. HPE actually has a GitHub link for a lot of Ansible playbooks that plug right in. And then the underlying hardware adjacent management ecosystem platform is HPE OneView with HPE Synergy and Image Streamer. With a combination of all of those technologies we were able to accomplish that 18-minute roll-out of our various titles.

Gardner: Even though you have an integrated platform and solutions approach, it sounds like you have also made the leap from ushering pets through the process into herding cattle. If you understand my metaphor, what has allowed you to stop treating each instance as a pet into being able to herd this stuff through on an automated basis?

From brittle pets to agile cattle 

Sullins: There is a psychological challenge with that. In the more traditional approach – and the VMware shop listeners are going to be very well aware of this -- I may need to have a four-node cluster with a number of CPUs, a certain amount of RAM, and so on. And that four-node cluster is static. Yes, if I need to add a fifth down the line I can do that, but for that four-node cluster, that’s its home, sometimes for the entire lifecycle of that particular host.

https://www.rei.com/
With our approach, we treat our ESXi hosts as cattle. The HPE OneView-Synergy-Image Streamer technology allows us to do that in conjunction with those tools we mentioned previously, for the end point in particular.

So rather than have a cluster, and it’s static and it stays that way -- it might have a naming convention that indicates what cluster it’s in and where -- in reality we have cattle-based DNS names for ESXi hosts. At any time, the understanding throughout the organization, or at least for the people who need to know, is that any host can be pulled from one cluster automatically and placed into another, particularly when it comes to resource usage on that cluster. My dream is that the robots will do this automatically.

So if you had a cluster that goes into the yellow, with its capacity usage based on a threshold, the robot would interpret that and say, “Oh, well, I have another cluster over here with a host that is underutilized. I’m going to pull it into the cluster that’s in the yellow and then bring it back into the green again.” This would happen all while we sleep. When we wake up in the morning, we’d say, “Oh, hey, look at that. The robots moved that over.”

Gardner: Algorithmic operations. It sounds very exciting.

Automation begets more automation 

Sullins: Yes, we have the push-button automation in place for that. It’s the next level of what that engine is that’s going to make those decisions and do all of those things.

Gardner: And that raises another issue. When you take the plunge into IT automation, you are making your way down the Chisholm Trail with your cattle, all of a sudden it becomes easier along the way. The automation begets more automation. As you learn and grow, does it become more automated along the way?

Sullins: Yes. Just to put an exclamation point on this topic, imagine the situation we opened the podcast with, which is, “Okay, we have to reallocate a bunch of hosts for rei.com.” If it’s fully automated, and we have robots making those decisions, the response is instantaneous. “Oh, hey, we want to scale up by 200 percent on rei.com.” We can say, “Okay, go ahead, roll out your VM. The system will react accordingly. It will add physical hosts as you see fit, and we don’t have to do anything, we have already done the work with the automation.” Right?

https://h20195.www2.hpe.com/v2/GetPDF.aspx/c04815217.pdf
But to the automation begetting automation, which is a great way of putting it, by the way, there are always opportunities for more automation. And on a career side note, I want to dispel the myth that you automate your way out of a job. That is a complete and total myth. I’m not saying it doesn’t happen, where people get laid off as a result of automation. I’m not saying that doesn’t happen, but that’s relatively rare because when you automate something, that automation is going to need to be maintained because things change over time.

The other piece of that is a lot of times you have different organizations at various states of automation. Once you get your head above water to where it's, “Okay, we have this process and now it's become trivial because it's been automated.” We can now concentrate on automating either more things -- or you have new things that need to be automated. And whether that’s the process for only VMs, a new feature base, monitoring, or auto-scaling -- whatever it is -- you have the capability of from day one to further automate these processes.

Gardner: What was it specifically about the HPE OneView and Synergy that allowed you to move past the manual processes, firefighting, and culture of gatekeeping into more herding of cattle and being progressively automated?

Sullins: It was two things. The Image Streamer was number one. To date, we don’t run PXE boots infrastructure, not that we can't, it’s just not something that we have traditionally done. We needed a more standard process for doing that, and Image Streamer fit that and solved that problem.

The second piece is the provided Ansible playbooks that HPE has to kick off the entire process. If you are somewhat versed in how HPE does things through OneView, you have a server profile that you can impose on a blade, and that can be fully automated through Ansible.
Image Streamer allows us to say, "Okay, we build a gold image. We can apply that gold image to any frame in the cluster." We needed a more standard process, and Image Streamer solved that problem.

And, by the way, you don’t have to use Image Streamer to use Ansible automation. This is really more of an HPE OneView approach, whereby you can actually use it to do automated profiles and whatnot. But the Image Streamer is really what allows us to say, “Okay, we build a gold image. We can apply that gold image to any frame in the cluster.” That’s the first part of it, and the rest is configuring the other side.

Gardner: Bryan, it sounds like the HPE Composable Infrastructure approach works well with others. You are able to have it your way because you like Ansible, and you have a history of certain products and skills in your organization. Does the HPE Composable Infrastructure fit well into an ecosystem? Is it flexible enough to integrate with a variety of different approaches and partners?

Sullins: It has been so far, yes. We have anticipated leveraging HPE for our bare metal Linux infrastructure. One of the additional driving forces and big initiatives right now is Kubernetes. We are going all-in on Kubernetes in our private cloud, as well as in some of our worker nodes. We eventually plan on running those as bare metal. And HPE OneView, along with Image Streamer, is something that we can leverage for that as well. So there is flexibility, absolutely, yes.

Coordinating containers 

Gardner: It’s interesting, you have seen the transition from having VMware and other hypervisor sprawl to finding a way to manage and automate all of that. Do you see the same thing playing out for containers, with the powerful endgame of being able to automate containers, too?

Sullins: Right. We have been utilizing Rancher as part of our coordination tool for our Kubernetes infrastructure and utilizing vSphere for that. So we are using that.

As far as the containerization approach, REI has been doing containers before containers was a big thing. Our containerization platform has been around since at least 2015. So REI has been pretty cutting edge as far as that is concerned.

https://www.rei.com/about-rei

And now that Kubernetes has won the orchestration wars, as it were, we are looking to standardize that for people who want to do things online, which is to say, going back to the digital transformation journey.

Basically, the industry has caught up with what our super-awesome developers have done with containerization. But we are looking to transition the heavy lifting of maintaining a platform away from the developers. Now that we have a standard approach with Kubernetes, they don’t have to worry so much about it. They can just develop what they need to develop. It will be a big win for us.

Gardner: As you look back at your automation journey, have you developed a philosophy about automation? How this should this best work in the future?

Trust as foundation of automation 

Sullins: Right. Have you read Gene Kim’s The Unicorn Project? Well, there is also his The Phoenix Project. My take from that is the whole idea of trust, of trusting other people. And I think that is big.

I see that quite a bit in multiple organizations. For REI, we are going to work as a team and we trust each other. So we have a pretty good culture. But I would imagine that in some places that is still big challenge.

https://www.hpe.com/us/en/home.html
And if you take a look at The Unicorn Project, a lot of the issues have to do with trusting other human beings. Something happened, somebody made a mistake, and it caused an outage. So they lock it up and lock it away and say only certain people can do that. And then if you multiply that happening multiple times -- and then different individuals walking that down -- it leads to not being able to automate processes without somebody approving it, right?

Gardner: I can't imagine you would have been capable, when you had to transition your private cloud for more online activity, if you didn’t have that trust built into your culture.

Sullins: Yes, and the big challenge that might still come up is the idea of trusting your end users, too. Once you go into the realm of self-service, you come up on the typical what-ifs. What if somebody adds a zero and they meant to only roll out 4 VMs but they roll out 40? That’s possible. How do you create guardrails that are seamless? If you can, then you can trust your users. You decrease the risk and can take that leap of faith that bad things won’t happen.

Gardner: Tell us about your wish list for what comes next. What you would like HPE to be doing?

Small steps and teamwork rewards 

Sullins: My approach is to first automate one thing and then work out from there. You don’t have to boil the ocean. Start with something small and work your way up.

As far as next steps, we want auto scaling a physical layer and having the robots do all of that. The robots will scale up and down our requesters while we sleep.

We will continue to do application programming interface (API)-capable automation with anything that has a REST API. If we can connect to that and manipulate it, we can do pretty much whatever automation we want.

https://www.briefingsdirectblog.com/2019/09/hpe-strategist-mark-linesch-on-surging.html

We are also containerizing all things. So if any application can be containerized properly, containerize it if you can.

As far as what decision-making engine we have to do the auto-scaling on the physical layer, we haven’t really decided upon what that is. We have some ideas but we are still looking for that.

Gardner: How about more predictive analytics using artificial intelligence (AI) with the data that you have emanating from your data center? Maybe AIOps?

Sullins: Well, without a doubt. I, for one, haven’t done any sort of deep dive into that, but I know it’s all the rage right now. I would be open to pretty much anything that will encompass what I just talked about. If that’s HPE InfoSight, then that’s what it is. I don’t have a lot of experience quite honestly with InfoSight as of yet. We do have it installed in a proof of concept (POC) form, although a lot of the priorities for that have been shifted due to COVID-19. We hope to revisit that pretty soon, so absolutely.


Gardner: To close out, you were ahead of the curve on digital transformation. That allowed you to be very agile when it came time to react to the COVID-19 pandemic.  What did that get you? Do you have any results?

Sullins: Yes, as a matter of fact, our boss’s boss, his boss -- so three bosses up from me -- he actually sits in on our load testing. It was an all-hands-on-deck situation during that May online sale. He said that it was the most seamless one that he had ever seen. There were almost no issues with this one.
We had done what we needed on the infrastructure side to make sure that we met dynamic demands. It was very successful. We went past our goals, so it was a win-win all the way around.

What I attribute that to is, yes, we had done what we needed on the infrastructure side to make sure that we met dynamic demands. Also, everybody worked as a team. Everybody, all the way up the stacks, from our infrastructure contribution, to the hypervisor and hardware layer, all the way on up to the application layer and the containers, and all of our DevOps stuff. It was very successful. We went past our goals of what we had thought for the sale, so it was a win-win all the way around.

Gardner: Even though you were going through this terrible period of adjustment, that’s very impressive.

Sullins: Yes.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: Hewlett Packard Enterprise.

You may also be interested in: