Tuesday, October 13, 2020

How The Open Group enterprise architecture portfolio enables an agile digital enterprise

The next BriefingsDirect agile business enablement discussion explores how a portfolio approach to standards has emerged as a key way to grapple with digital transformation.

As businesses seek to make agility a key differentiator in a rapidly changing world, applying enterprise architecture (EA) in concert with many other standards has never been more powerful. Stay with us here to explore how to define and corral a comprehensive standards resources approach for making businesses intrinsically agile and competitive. 

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

To learn more about attaining agility via an embrace of a broad toolkit of standards, we are joined by our panel, Chris Frost, Principal Enterprise Architect and Distinguished Engineer, Application Technology Consulting Division, at FujitsuSonia Gonzalez, The Open Group TOGAF® Product Manager, and Paul Homan, Distinguished Engineer and Chief Technology Officer, Industrial, at IBM Services. The discussion is moderated by Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Sonia, why is it critical to modernize businesses in a more comprehensive and structured fashion? How do standards help best compete in this digital-intensive era?

Gonzalez: The question is more important than ever. We need to be very quickly responding to changes in the market.

Gonzalez
It’s not only that we have more technology trends and competitors. Organizations are also changing their business models -- the way they offer products and services. And there’s much more uncertainty in the business environment.

The current situation with COVID-19 has made for a very unpredictable environment. So we need to be faster in the ways we respond. We need to make better use of our resources and to be able to innovate in how we offer our products and services. And since everybody else is also doing that, we must be agile and respond quickly. 

Gardner: Chris, how are things different now than a year ago? Is speed all that we’re dealing with when it comes to agility? Or is there something more to it?

Frost: Speed is clearly a very important part of it, and market trends are driving that need for speed and agility. But this has been building for a lot more than a year.

We now have, with some of the hyperscale cloud providers, the capability to deploy new systems and new business processes more quickly than ever before. And with some of the new technologies -- like artificial intelligence (AI), data analytics, and 5G – there are new technological innovations that enable us to do things that we couldn’t do before.

Faster, better, more agile

A combination of these things has come together in the last few years that has produced a unique need now for speed. That’s what I seek in the market, Dana.

Gardner: Paul, when it comes to manufacturing and industrial organizations, how do things change for them in particular? Is there something about the data, the complexity? Why are standards more important than ever in certain verticals?

Homan
Homan: The industrial world in particular, focusing on engineering and manufacturing, has brought together the physical and digital worlds. And whilst these industries have not been as quick to embrace the technologies as other sectors have, we can now see how they are connected. That means connected products, connected factories and places of work, and connected ecosystems.

There are still so many more things that need to be integrated, and fundamentally EA comes back to the how – how do you integrate all of these things? A great deal of the connectivity we’re now seeing around the world needs a higher level of integration.

Gardner: Sonia, to follow this point on broader integration, does applying standards across different parts of any organization now make more sense than in the past? Why does one part of the business need to be in concert with the others? And how does The Open Group portfolio help produce a more comprehensive and coordinated approach to integration?

Integrate with standards

Gonzalez: Yes, what Paul mentioned about being able to integrate and interconnect is paramount for us. Our portfolio of standards, which is more than just [The Open Group Architectural Forum (TOGAF®)]  Standard, is like having a toolkit of different open standards that you can use to address different needs, depending upon your particular situation.

For example, there may be cases in which we need to build physical products across an extended industrial environment. In that case, certain kinds of standards will apply. Also critical is how the different standards will be used together and pursue interoperability. Therefore, borderless information flow is one of our trademarks at The Open Group.

Other more intangible cases, such as digital services, need standards. For example, the Digital Practitioner Body of Knowledge (DPBoK™) supports a scale model to support the digital enterprise.

Other standards are coming around agile enterprises and best practices. They support how to make interconnections and interoperability faster -- but at the same time having the proper consistency and integration to align with the overall strategy. At the end of the day, it’s not enough to integrate for just a technical point of view. You need bring new value to your businesses. You need to be aligned with your business model, and with your business view, to your strategy.

Such change is not only to integrate technical platforms, even though that is paramount, but also to change your business and operational model and to go deeper to cover your partners and the way your company is put together.
Therefore, the change is not only to integrate technical platforms, even though that is paramount, but also to change your business and operational model and to go deeper to cover your partners and the way your company is put together.

So, therefore, we have different standards that cover all of those different areas. As I said at the beginning, these form a toolkit with which you can choose different standards and make them work together conforming a portfolio of standards.

Gardner: So, whether we look to standards individually or together as a toolkit, it’s important that they have a real-world applicability and benefits. I’m curious, Paul and Chris, what’s holding organizations back from using more standards to help them?

Homan: When we use the term traditional enterprise architecture, it always needs to be adapted to suit the environment and the context. TOGAF, for example, has to be tailored to the organization and for the individual assignment.

But I’ve been around in the industry long enough to be familiar with a number of what I call anti-patterns that have grown up around EA practices and which are not helping with the need for agility. This comes from the idea that EA has heavy governance.

We have all witnessed such core practices -- and I will confess to having being part of some of them. And these obviously fly in the face of the agility, flexibility, of being able to push decisions out to the edge and pivot quickly, and to make mistakes and be allowed to learn from them. So kind of an experimental attitude.

And so gaining such adaptation is more than just promoting good architectural decision-making within a set of guide rails -- it allows decision-making to happen at the point of need. So that’s the needed adaption that I see. 

Gardner: Chris, what challenges do you see organizations dealing with, and why are standards be so important to helping them attain a higher level of agility?

Frost: The standards are important, not so much because they are a standard but because they represent industry best practices. The way standards are developed in The Open Group are not some sort of theoretical exercise. It’s very much member-driven and brought together by the members drawing on their practical experiences.

Frost
To me, the point is more about industry best practice, and not so much the standard. There are good things about standard ways of working, being able to share things, and everybody having a common understanding about what things mean. But that aspect of the standard that represents industry best practices -- that’s the real value right now.

Coming back to what Paul said, there is a certain historical perspective here that we have to acknowledge. EA projects in the past -- and certainly things I have been personally involved in -- were often delivered in a very waterfall fashion. That created a certain perception that somehow EA means big-design-upfront-waterfall-style projects -- and that absolutely isn’t the case.

That is one of the reasons why a certain adaptation is needed. Guidance about how to adapt is needed. The word adapt is very important because it’s not as if all of the knowledge and fundamental techniques that we have learned over the past few years are being thrown away. It’s a question of how we adapt to agile delivery, and the things we have been doing recently in The Open Group demonstrate exactly how to do that.

Gardner: And does this concept of a minimum viable architecture fit in to that? Does that help people move past the notion of the older waterfall structure to EA?

Reach minimum viable architecture

Frost: Yes, very much it does. It’s something that you might regard as reaching first base. In architectural terms, that minimum viable architecture is like reaching first base, and that emphasizes a notion of rapidly getting to something that you can take forward to the next stage. You can get feedback and also an acknowledgment that you will improve and iterate in the future. Those are fundamental about agile working. So, yes, that minimum viable architecture concept is a really important one. 

Gardner: Sonia, if we are thinking about a minimum viable architecture we are probably also working toward a maximum value standards portfolio. How do standards like TOGAF work in concert with other open standards, standards not in The Open Group? How do we get to that maximum value when it comes to a portfolio of standards?

Gonzalez: That’s very important. First, it has to do with adapting the practice, and not only the standard. In order to face new challenges, especially ones with agile and digital, the practices need to evolve and therefore, the standards – including the whole portfolio of The Open Group standards which are constantly in evolution and improvement. Our members are the ones contributing with the content that follows the new trends, best practices, and uses for all of those practices.

The standards need to evolve to cover areas like digital and agile. And with the concept of minimal viable architecture, the standards are evolving to provide guidance on how EA as a practice supports agile. Actually, nothing in the standard says it has to be used in the waterfall way, even though some people may say that.

TOGAF is now building guidance for how people can use the standards supporting the agile enterprise, delivering that in an agile way, and also supporting an agile approach, which is having a different view of how the practice is applied following this new shift and this new adaption.

Adapt to sector-specific needs

The practice needs to be adapted, the standards need to evolve to fulfill that, and need to be applied to specific situations. For example, it’s not the same to architect organizations in which you have ground processes, especially in a back office than other ones that are more customer facing. For the first ones, their processes are heavier, they don’t need to be that agile. That agile architecture is for upfront customers that need to support a faster pace.

So, you might have cases in which you need to mix different ways to apply the practices and standards. Less agile approach for the back office and a more agile approach for customer facing applications such as, for example, online banking.

Adaptation also depends on the nature of companies. The healthcare industry is one example. We cannot experiment that much in that area because that’s more risk assessment and less subject to experimentation. For these kinds of organizations a different approach is needed.

Adaptation also depends on the nature of companies. The healthcare industry is one example. We cannot experiment that much in that area because that's more risk assessment and less subject to experimentation. For these kinds of organizations a different approach is needed.

There is work in progress in different sectors. For example, we have a very good guide and case study about how to use the TOGAF standard along with the ArchiMate® modeling notation in the banking industry using the BIAN®  Reference Model. That’s a very good use case in The Open Group library. We also have a work in progress in the forum around how governments architect. The IndEA Reference Model is another example of a reference model for that government and has been put together based on open standards.

We also have work in progress around security, such as with the SABSA [framework for Business Security Architecture], for example. We have developed guidance about standards and security along with SABSA. We also have a partnership with the Object Management Group (OMG), in which we are pioneers and have a liaison to build products that will go to market to help practitioners use external standards along with our own portfolio.

Gardner: When we look at standards as promoting greater business agility, there might be people who look to the past and say, “Well, yes, but it was associated with a structured waterfall approach for so long.”

But what happens if you don’t have architecture and you try to be agile? What’s the downside if you don’t have enough structure; you don’t put in these best practices? What can happen if you try to be agile without a necessary amount of architectural integrity?

Guardrails required

Homan: I’m glad that you asked, because I have a number of organizations that I have worked with that have experienced the results of diminishing their architectural governance. I won’t name who they are for obvious reasons, but I know of organizations that have embraced agility. They had great responses to being able to do things quickly, find things out, move fleet-of-foot, and then combined with that cloud computing capabilities. They had great freedom to exercise where they choose to source commodity cloud services.

And, as an enterprise architect, if I look in, that freedom created a massive amount of mini-silos. As soon as those need to come together and scale -- and scale is the big word -- that’s where the problems started. I’ve seen, for example, around common use of information and standards, processes and workflows that don’t cross between one cloud vendor and another. And these are end-customer-facing services and deliveries that frankly clash from the same organization, from the same brand.

And those sorts of things came about because they weren’t using common reference architectures. There wasn’t a common understanding of the value propositions that were being worked toward, and they manifested because you could rapidly spin stuff out.

A number of organizations that I have worked with have experienced the results of diminishing their architectural governance. [But] that freedom created a massive amount of mini-silos. As soon as the need comes to scale, that's where the problems started. ... [because] they weren't using common reference architectures.

When you have a small, agile model of everybody co-located in a relatively contained space -- where they can readily connect and communicate -- great. But unfortunately as soon as you go and disperse the model, have a round of additional development, distribute to more geographies and markets, with lots of different products, you behave like a large organization. It’s inevitable that people are going to plough their own furrow and go in different directions. And so, you need to have a way of bringing it back together again.

And that’s typically where people come in and start asking how to reintegrate. They love the freedom and we want to keep the freedom, but they need to combine that with a way of having some gentle guardrails that allow them to exercise freedom of speed but not diverge too much.

Frost: The word guardrails is really important because that is very much the emphasis of how agile architectures need to work. My observation is that, without some amount of architecture and planning, what tends to go wrong is some of the foundational things – such as using common descriptions of data or common underlying platforms. If you don’t get those right, different aspects of an overall solution can diverge and fail to integrate. 

Some of those things may include what we generally refer to as non-functional requirements, things like capacity, performance, and possibly safety or regulatory compliance. These rules are often things that easily tend to get overlooked unless there is some degree of planning and architecture, surrounding architecture definitions that think through how to incorporate some of those really important features.

A really important judgment point is what’s just enough architecture upfront to set down those important guardrails without going too far and going back into the big design upfront approach, which we want to avoid to still create the most freedom that we can.

Gardner: Sonia, a big part of the COVID-19 response has been rapidly reorganizing or refactoring supply chains. This requires extended enterprise cooperation and ultimately integration. How are standards like TOGAF and the toolkit from The Open Group important to allow organizations to enjoy agility across organizational boundaries, perhaps under dire circumstances?

COVID-19 necessitates holistic view

Gonzalez: That is precisely when more architecture is needed, because you need to be able to put together a landscape, a whole view of your organization, which is now a standard organization. Your partners, customers, customer alliances, all of your liaisons, are a part of your value chain and you need to have visibility over this.

You mentioned suppliers and providers. These are changing due to the current situation. The way they work, everything is going more digital and virtual, with less face-to-face. So we need to change processes. We need to change value streams. And we need to be sure that we have the right capabilities. Having standards, it’s spot-on, because one of the advantages of having standards, and open standards especially, is that you facilitate communication with other parties. If you are talking the same language it will be easier to integrate and get people together.

Now that most people are working virtually, that implies the need for very good management or your whole portfolio of products and lifecycle. For addressing all this complexity and to gain a holistic view of your capabilities you need to have an architecture focus. Therefore, there are different standards that can fit together in those different areas.

For example, you may need to deliver more digital capabilities to work virtually. You may need to change your whole process view to become more efficient and allow such remote work, and to do that you use standards. In the TOGAF standard we have a set of very good guidance for our business architecture, business models, business capabilities, and value streams; all of them are providing guidance on how to do that.

Another very good guide under the TOGAF standard umbrella for their organization is called Organization Map Guide. It’s much more than having a formal organizational chart to your company. It’s how you map to different resources to respond quickly to changes in your landscape. So, having a more dynamic view, having a cross-coding view of your working teams, is required to be agile and to have interdisciplinary teams work together. So you need to have architecture, and you need to have open standards to address those challenges.

Gardner: And, of course, The Open Group is not standing still, along with many other organizations, in trying to react to the environment and help organizations become more digital and enhance their customer and end-user experiences. What are some of the latest developments at The Open Group?

Standards evolve steadily

Gonzalez: First, we are evolving our standards constantly. The TOGAF standard is evolving to address more of these agile-digital trends, how to adopt new technology trends in a way that they will be adopted in accord with your business model for your strategy and organizational culture. That’s an improvement that is coming. Also, the structure of the standard has evolved to be easier to use and more agile. It has been designed to evolve through new and improved versions more frequently than in the past.

We also have other components coming into the portfolio. One of them is the Agile Architecture Standard, which is going to be released soon. That one is going straight into the agile space. It’s proposing a holistic view of the organization. This coupling between agile and digital is addressed in that standard. It is also suitable to be used along with the TOGAF standard. Both complement each other. The DPBoK is also evolving to address new trends in the market.

We also have other standards. The Microservice Architecture is a very active working group that is delivering guidance on microservices delivered using the TOGAF standard. Another important one is the Zero Trust Architecturein the security space. Now more than ever, as we go virtual and rely on platforms, we need to be sure that we are having proper consistency in security and compliance. We have, for example, the General Data Protection Regulation (GDPR) considerations, which are stronger than ever. Those kinds of security breaches are addressed in that specific context.

The IT4IT standard, which is another reference architecture, is evolving toward becoming more oriented to a digital product concept to precisely address all of those changes.

All of these standards are moving together. There will also be standards to serve specific areas like oil, gas, and electricity. ... We are aiming for every standard to have a certification program along with it. The idea is to continue increasing our portfolio of certification along with the portfolio of standards.

All of these standards, all the pieces, are moving together. There are other things coming, for example, delivering standards to serve specific areas like oil, gas, and electricity, which are more facility-oriented, more physically-oriented. We are also working toward those to be sure that we are addressing all of the different possibilities.

 Another very important thing here is we are aiming for every standard we deliver into the market to have a certification program along with it. We have that for the TOGAF standard, ArchiMate standard, IT4IT, and DPBoK. So the idea is to continue increasing our portfolio of certification along with the portfolio of standards.

Furthermore, we have more credentials as part of the TOGAF certification to allow people to go into specializations. For example, I’m TOGAF-certified but I also wanted to go for a Solution Architect Practitioner or a Digital Architect. So, we are combining the different products that we have, different standards, to have these building blocks we’re putting together for this learning curve around certifications, which is an important part of our offering.

Gardner: I think it’s important to illustrate where these standards are put to work and how organizations find the right balance between a minimum viable architecture and a maximum value portfolio for agility.

So let’s go through our panel for some examples. Are there organizations you are working with that come to mind that have found and struck the right balance? Are they using a portfolio to gain agility and integration across organizational boundaries?

More tools in the toolkit

Homan: The key part for me is do these resources help people do architecture? And in some of the organizations I’ve worked with, some of the greatest successes have been where they have been able to pick and choose – cherry pick, if you like -- bits of different things and create a toolkit. It’s not about following just one thing. It’s about having a kit.

The reason I mentioned that is because one of the examples I want to reference has to do with development of ecosystems. In ecosystems, it’s about how organizations work with each other to deliver some kind of customer-centric propositions. I’ve seen this in the construction industry in particular, where lots of organizations historically have had to come together to undertake large construction efforts.

And we’re now seeing what I consider to be an architected approach across those ecosystems. That helps build a digital thread, a digital twin equivalent of what is being designed, what is being constructed for safety reasons, both in terms of what is being built at the time for the people that are building it, but also for the people that then occupy it or use it, for the reasons of being able to share common standards and interoperate across the processes from end-to-end to be able to do these thing in a more agile way of course, but in a business agile way.

So that’s one industry that always had ecosystems, but IT has come in and therefore architects have had to better collaborate and find ways to integrate beyond the boundary of their organization, coming back to the whole mission of boundaryless information flow, if you will.

Gardner: Chris, any examples that put a spotlight on some of the best uses of standards and the best applicability of them particularly for fostering agility?

Frost: Yes, a number of customers in both the private and public sector are going through this transition to using agile processes. Some have been there for quite some time; some are just starting on that journey. We shouldn’t be surprised by this in the public and private sectors because everybody is reacting to the same market fundamentals driving the need for agile delivery.

We’ve certainly worked with a few customers that have been very much at the forefront of developing new agile practices and how that integrates with EA and benefits from all of the good architectural skills and experiences that are in frameworks like the TOGAF standard.

Paul talked about developing ecosystems. We’ve seen things such as organizations embarking on large-scale internal re-engineering where they are adjusting their own internal IT portfolios to react to the changing marketplace that they are confronted by.

I am seeing a lot of common problems about fitting together agile techniques and architecture and needing to work in these iterative styles. But overwhelmingly, these problems are being solved. We are seeing the benefits of this iterative way of working with rapid feedback and the more rapid response to changing market techniques.

I would say even inside The Open Group we’re seeing some of the effects of that. We’ve been talking about the development of some of the agile guidance for the TOGAF standard within The Open Group, and even within the working group itself we’ve seen adaption of more agile styles of working using some of the tools that are common in agile activities. Things like GitLab and Slack and these sorts of things. So it really is quite a pervasive thing we are seeing in the marketplace.

Gardner: Sonia, are there any examples that come to mind that illustrate where organizations will be in the coming few years when it comes to the right intersection of agile, architecture, and the use of open standard? Any early adopters, if you will, or trendsetters that come to mind that illustrate where we should be expecting more organizations to be in the near future?

Steering wheel for agility

Gonzalez: Things are evolving rapidly. In order to be agile and a digital enterprise there are different things that need to change around the organization. It’s a business issue, it’s not something related to only platforms of technology, or technology adoption. It’s going ahead of that to the business models.

For example, we now see more-and-more the need to have an outside-in view of the market and trends. Being efficient and effective is not enough anymore. We need to innovate to figure out what the market is asking for. And sometimes to even generate that demand and generate new digital offerings for your market.

That means more experimentation and more innovation, keeping in mind that in order to really deliver that digital offering you must have the right capabilities, so changes in your business and operational models, your backbone, need to be identified and then of course connected and delivered through technical platforms.

Data is also another key component. We have several working groups and Forums working around data management and data science. If you don’t have the information, you won’t be able to understand your customers. That’s another trend, having a more customer journey-oriented view. At the end, you need to give your value to your end users and of course also internally to your company.

To get a closer view of the customer .... practitioners need to be able to develop new skills and evolve rapidly. They need to study not just new technology trends, but how to communicate them to the business and to gain more innovation.

That’s why even internally, at The Open Group, we are considering having our own standards get a closer view of the customer. That is something that companies need to be addressing. And for them to do that, practitioners need to be able to develop new skills and to evolve rapidly. They will need to study not only the new technology trends, but how you can communicate that to your business, so more communications, marketing, and a more aggressive approach through innovation.

Sustainability is another area we are considering at The Open Group, being able to deliver standards that will support organizations make better use of resources internally and externally and selecting the tools to be sustainable within their environments.  

Those are some of the things we see for the coming years. As we have all learned this year, we should be able to shift very quickly. I was recently reading a very good blog that said agile is not only having a good engine, but also having a good steering wheel to be able to change direction quickly. That’s a very good metaphor for how you should evolve. It’s great to have a good engine, but you need to have a good direction, and that direction is precisely what they need to pay attention to, not being agile only for the sake of being agile.

So, that’s the direction we are taking with our whole portfolio. We are also considering other areas. For example, we are trying to improve our offering in vertical industry areas. We have other things on the move like Open Communities, especially for the ArchiMate Standard, which is one of our executable standards easier to be implemented using architecture tools.

So, those are the kinds of things in our strategy at The Open Group as we work to serve our customers.

Gardner: And what’s next when it comes to The Open Group events? How are you helping people become the types of architects who reach that balance between agility and structure in the next wave of digital innovation?

New virtual direction for events

Gonzalez: We have many different kinds of customers. We have our members, of course. We have our trainers. We have people that are not members but are using our standards and they are very important. They might eventually become members. So, we have been identifying those different markets on building customer journeys for all of them in order to serve them properly.

Serving them, for example, means providing better ways for them to find information on our website and to get access to our resources. All of our publications are free to be downloaded and used if you are and end user organization. You only need a commercial license if you will apply them to deliver services to others.

In terms of events, we have had a very good experience with virtual events. The good thing about our crisis is that you can use it for learning, and we have learned that virtual events are very good. First, because we can address more coverage. For example, if you organize a face-to-face event in Europe, probably people from Europe will attend, but it’s very unlikely that people from Asia or even the U.S. or Latin America will attend. But a virtual event, also being free events, are attracting people from different countries, different geographies.

We have very good attendance on those virtual events. This year, all four events, except the one that we had in San Antonio have been virtual. Besides the big ones that we have every three months, we also have organized other smaller ones. We had a very good one in Brazil, we have another one from the Latin-American community in Spanish, and we’re organizing more of these events.

For next year, probably we are going to have some kind of a mix of virtual and face-to-face, because, of course, face-to-face is very important. And for our members, for example, sharing experiences as a network is a value that you can only have if you’re physically there. So, probably for next year, depending on how the situation is evolving, it will be a mix of virtual and face-to-face events.

We are trying to get a closer view what the market is demanding from us, not only in the architecture space but in general.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: The Open Group.

You may also be interested in:

Thursday, October 8, 2020

The IT intelligence foundation for digital business transformation rests on HPE InfoSight AIOps

 

The next BriefingsDirect podcast explores how artificial intelligence (AI) increasingly supports IT operations.

One of the most successful uses of machine learning (ML) and AI for IT efficiency has been the InfoSight technology developed at Nimble Storage, now part of Hewlett Packard Enterprise (HPE).


Initially targeting storage optimization, HPE InfoSight has emerged as a broad and inclusive capability for AIOps across an expanding array of HPE products and services.

Please welcome a Nimble Storage founder, along with a cutting-edge machine learning architect, to examine the expanding role and impact of HPE InfoSight in making IT resiliency better than ever.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

To learn more about the latest IT operations solutions that help companies deliver agility and edge-to-cloud business continuity, we’re joined by Varun Mehta, Vice President and General Manager for InfoSight at HPE and founder of Nimble Storage, and David Adamson, Machine Learning Architect at HPE InfoSight. The discussion is moderated by Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Varun, what was the primary motivation for creating HPE InfoSight? What did you have in mind when you built this technology?

Mehta: Various forms of call home were already in place when we started Nimble, and that’s what we had set up to do. But then we realized that the call home data was used to do very simple actions. It was basically to look at the data one time and try and find problems that the machine was having right then. These were very obvious issues, like a crash. If you had had any kind of software crash, that’s what call home data would identify.

Mehta
We found that if instead of just scanning the data one time, if we could store it in a database and actually look for problems over time in areas wider than just a single use, we could come up with something very interesting. Part of the problem until then was that a database that could store this amount of data cheaply was just not available, which is why people would just do the one-time scan.

The enabler was that a new database became available. We found that rather than just scan once, we could put everyone’s data into one place, look at it, and discover issues across the entire population. That was very powerful. And then we could do other interesting things using data science such as workload planning from all of that data. So the realization was that if the databases became available, we could do a lot more with that data.

Gardner: And by taking advantage of that large data capability and the distribution of analytics through a cloud model, did the scope and relevancy of what HPE InfoSight did exceed your expectations? How far has this now come?

Mehta: It turned out that this model was really successful. They say that, “imitation is the sincerest form of flattery.” And that was proven true, too. Our customers loved it, our competitors found out that our customers loved it, and it basically spawned an entire set of features across all of our competitors.

The reason our customers loved it -- followed by our competitors -- was that it gave people a much broader idea of the issues they were facing. We then found that people wanted to expand this envelope of understanding that we had created beyond just storage.

Data delivers more than a quick fix

And that led to people wanting to understand how their hypervisor was doing, for example. And so, we expanded the capability to look into that. People loved the solution and wanted us to expand the scope into far more than just storage optimization.

Gardner: David, you hear Varun describing what this was originally intended for. As a machine learning architect, how has HPE InfoSight provided you with a foundation to do increasingly more when it comes to AIOps, dependability, and reliability of platforms and systems?

The database is full of data that not only tracks everything longitudinally across the installed base, but also over time. The richness of that data gives us features we otherwise could not have conceived of. Many issues can now be automated away.
Adamson: As Varun was describing, the database is full of data that not only tracks everything longitudinally across the installed base, but also over time. The richness of that data set gives us an opportunity to come up with features that we otherwise wouldn’t have conceived of if we hadn’t been looking through the data. Also very powerful from InfoSight’s early days was the proactive nature of the IT support because so many simple issues had now been automated away.
 

That allowed us to spend time investigating more interesting and advanced problems, which demanded ML solutions. Once you’ve cleaned up the Pareto curve of all the simple tasks that can be automated with simple rules or SQL statements, you uncover problems that take longer to solve and require a look at time series and telemetry that’s quantitative in nature and multidimensional. That data opens up the requirement to use more sophisticated techniques in order to make actionable recommendations.

Gardner: Speaking of actionable, something that really impressed me when I first learned about HPE InfoSight, Varun, was how quickly you can take the analytics and apply them. Why has that rapid capability to dynamically impact what’s going on from the data proved so successful? 

Support to succeed

Mehta: It turned out to be one of the key points of our success. I really have to compliment the deep partnership that our support organization has had with the HPE InfoSight team.

The support team right from the beginning prided themselves on providing outstanding service. Part of the proof of that was incredible Net Promoter scores (NPS), which is this independent measurement of how satisfied customers are with our products. Nimble’s NPS score was 86, which is even higher than Apple. We prided ourselves on providing a really strong support experience to the customer.

Whenever a problem would surface, we would work with the support team. Our goal was for a customer to see a problem only once. And then we would rapidly fix that problem for every other customer. In fact, we would fix it preemptively so customers would never have to see it. So, we evolved this culture of identifying problems, creating signatures for these problems, and then running everybody’s data through the signatures so that customers would be preemptively inoculated from these problems. That’s why it became very successful.

Gardner: It hasn’t been that long since we were dealing with red light-green light types of IT support scenarios, but we’ve come a long way. We’re not all the way to fully automated, lights-out, machines running machines operations.

David, where do you think we are on that automated support spectrum? How has HPE InfoSight helped change the nature of systems’ dependability, getting closer to that point where they are more automated and more intelligent?

Adamson: The challenge with fully automated infrastructure stems from the variety of different components in the environments -- and all of the interoperability among those components. If you look at just a simple IT stack, they are typically applications on top of virtual machines (VMs), on top of hosts -- they may or may not have independent storage attached – and then the networking of all these components. That’s discounting all the different applications and various software components required to run them.

Adamson
There are just so many opportunities for things to break down. In that context, you need a holistic perspective to begin to realize a world in which the management of that entire unit is managed in a comprehensive way. And so we strive for observability models and services that collect all the data from all of those sources. If we can get that data in one place to look at the interoperability issues, we can follow the dependency chains.

But then you need to add intelligence on top of that, and that intelligence needs to not only understand all of the components and their dependencies, but also what kinds of exceptions can arise and what is important to the end users.

So far, with HPE InfoSight, we go so far as to pull in all of our subject matter expertise into the models and exception-handling automation. We may not necessarily have upfront information about what the most important parts of your environment are. Instead, we can stop and let the user provide some judgment. It’s truly about messaging to the user the different alternative approaches that they can take. As we see exceptions happening, we can provide those recommendations in a clean and interpretable way, so [the end user] can bring context to bear that we don’t necessarily have ourselves.

Gardner: And the timing for these advanced IT operations services is very auspicious. Just as we’re now able to extend intelligence, we’re also at the point where we have end-to-end requirements – from the edge, to the cloud, and back to the data center.

And under such a hybrid IT approach, we are also facing a great need for general digital transformation in businesses, especially as they seek to be agile and best react to the COVID-19 pandemic. Are we able yet to apply HPE InfoSight across such a horizontal architecture problem? How far can it go?

Seeing the future: End-to-end visibility

Mehta: Just to continue from where David started, part of our limitation so far has been from where we began. We started out in storage, and then as Nimble became part of HPE, we expanded it to compute resources. We targeted hypervisors; we are expanding it now to applications. To really fix problems, you need to have end-to-end visibility. And so that is our goal, to analyze, identify, and fix problems end-to-end.

That is one of the axis of development we’re pursuing. The other axis of development is that things are just becoming more-and-more complex. As businesses require their IT infrastructure to become highly adaptable they also need scalability, self-healing, and enhanced performance. To achieve this, there is greater-and-greater complexity. And part of that complexity has been driven by really poor utilization of resources.

Go back 20 years and we had standalone compute and storage machines that were not individually very well-utilized. Then you had virtualization come along, and virtualization gave you much higher utilization -- but it added a whole layer of complexity. You had one machine, but now you could have 10 VMs in that one place.

Now, we have containers coming out, and that’s going to further increase complexity by a factor of 10. And right on the horizon, we have serverless computing, which will increase the complexity another order of magnitude.

Complexity is increasing, interconnectedness is increasing, and yet the demands on the business to stay agile, competitive, and scalable are also increasing. It's really hard for IT administrators to stay on top of this. That's why you need end-to-end automation.
So, the complexity is increasing, the interconnectedness is increasing, and yet the demands on businesses to stay agile and competitive and scalable are also increasing. It’s really hard for IT administrators to stay on top of this. And that’s why you need end-to-end automation and to collect all of the data to actually figure out what is going on. We have a lot of work cut out for us.
 
There is another area of research, and David spends a lot of time working on this, which is you really want to avoid false positives. That is a big problem with lots of tools. They provide so many false positives that people just turn them off. Instead, we need to work through all of your data to actually say, “Hey, this is a recommendation that you really should pay attention to.” That requires a lot of technology, a lot of ML, and a lot of data science experience to separate the wheat from the chaff.

 

One of the things that’s happened with the COVID-19 pandemic response is the need for very quick response stats. For example, people have had to quickly set up web sites for contact tracing, reporting on the diseases, and for vaccines use. That shows an accelerated manner in how people need digital solutions -- and it’s just not possible without serious automation.

Gardner: Varun just laid out the complexity and the demands for both the business and the technology. It sounds like a problem that mere mortals cannot solve. So how are we helping those mere mortals to bring AI to bear in a way that allows them to benefit – but, as Varun also pointed out, allows them to trust that technology and use it to its full potential?

Complexity requires automated assistance

Adamson: The point Varun is making is key. If you are talking about complexity, we’re well beyond the point where people could realistically expect to log-in to each machine to find, analyze, or manage exceptions that happen across this ever-growing, complex regime.

Even if you’re at a place where you have the observability solved, and you’re monitoring all of these moving parts together in one place -- even then, it easily becomes overwhelming, with pages and pages of dashboards. You couldn’t employ enough people to monitor and act to spot everything that you need to be spotting.

You need to be able to trust automated exception [finding] methods to handle the scope and complexity of what people are dealing with now. So that means doing a few things.

People will often start with naïve thresholds. They create manual thresholds to give alerts to handle really critical issues, such as all the servers went down.

But there are often more subtle issues that show up that you wouldn’t necessarily have anticipated setting a threshold for. Or maybe your threshold isn’t right. It depends on context. Maybe the metrics that you’re looking at are just the raw metrics you’re pulling out of the system and aren’t even the metrics that give a reliable signal.


What we see from the data science side is that a lot of these problems are multi-dimensional. There isn’t just one metric that you could set a threshold on to get a good, reliable alert. So how do you do that right?

 

For the problems that IT support provides to us, we apply automation and we move down the Pareto chart to solve things in priority of importance. We also turn to ML models. In some of these cases, we can train a model from the installed base and use a peer-learning approach, where we understand the correlations between problem states and indicator variables well enough so that we can identify a root cause for different customers and different issues.

Sometimes though, if the issue is rare enough, scanning the installed base isn’t going to give us a high enough signal to the noise. Then we can take some of these curated examples from support and do a semi-supervised loop. We basically say, “We have three examples that are known. We’re going to train a model on them.” Maybe it’s a few tens of thousands of data points, but it’s still in the three examples, so there’s co-correlation that we are worried about. 


In that case we say: “Let me go fishing in that installed base with these examples and pull back what else gets flagged.” Then we can turn those back over to our support subject matter experts and say, “Which of these really look right?” And in that way, you can move past the fact that your starting data set of examples is very small and you can use semi-supervised training to develop a more robust model to identify the issues.

Gardner: As you are refining and improving these models, one of the benefits in being a part of HPE is to access growing data sets across entire industries, regions, and in fact the globe. So, Varun, what is the advantage of being part of HPE and extending those datasets to allow for the budding models to become even more accurate and powerful over time?

Gain a global point of view

Mehta: Being part of HPE has enabled us to leapfrog our competition. As I said, our roots are in storage, but really storage is just the foundation of where things are located in an organization. There is compute, networking, hypervisors, operating systems, and applications. With HPE, we certainly now cover the base infrastructure, which is storage followed by compute. At some point we will bring in networking. We already have hypervisor monitoring, and we are actively working on application monitoring.

HPE has allowed us to radically increase the scope of what we can look at, which also means we can radically improve the quality of the solutions we offer to our customers. And so it’s been a win-win solution, both for HPE where we can offer a lot of different insights into our products, and for our customers where we can offer them faster solutions to more kinds of problems.

Gardner: David, anything more to offer on the depth, breadth, and scope of data as it’s helping you improve the models?

Adamson: I certainly agree with everything that Varun said. The one thing I might add is in the feedback we’ve received over time. And that is, one of the key things in making the notifications possible is getting us as close as possible to the customer experience of the applications and services running on the infrastructure.

Gaining additional measurements from the applications themselves is going to give us the ability to differentiate ourselves, to find the important exceptions to the end user, what they really want us to take action on, the events that are truly business-critical.
We’ve done a lot of work to make sure we identify what look like meaningful problems. But we’re fundamentally limited if the scope of what we measure is only at the storage or hypervisor layer. So gaining additional measurements from the applications themselves is going to give us the ability to differentiate ourselves, to find the important exceptions to the end user, what they really want to take action on. That’s critical for us -- not sending people alerts they are not interested in but making sure we find the events that are truly business-critical.
 

Gardner: And as we think about the extensibility of the solution -- extending past storage into compute, ultimately networking, and applications -- there is the need to deal with the heterogeneity of architecture. So multicloud, hybrid cloud, edge-to-cloud, and many edges to cloud. Has HPE InfoSight been designed in a way to extend it across different IT topologies?

Across all architecture

Mehta: At heart, we are building a big data warehouse. You know, part of the challenge is that we’ve had this explosion in the amount of data that we can bring home. For the last 10 years, since InfoSight was first developed, the tools have gotten a lot more powerful. What we now want to do is take advantage of those tools so we can bring in more data and provide even better analytics.

The first step is to deal with all of these use cases. Beyond that, there will probably be custom solutions. For example, you talked about edge-to-cloud. There will be locations where you have good bandwidth, such as a colocation center, and you can send back large amounts of data. But if you’re sitting as the only compute in a large retail store like a Home Depot, for example, or a McDonald’s, then the bandwidth back is going to be limited. You have to live within that and still provide effective monitoring. So I’m sure we will have to make some adjustments as we widen our scope, but the key is having a really strong foundation and that’s what we’re working on right now.

Gardner: David, anything more to offer on the extensibility across different types of architecture, of analyzing the different sources of analytics?

Adamson: Yes, originally, when we were storage-focused and grew to the hypervisor level, we discovered some things about the way we keep our data organized. If we made it more modular, we could make it easier to write simple rules and build complex models to keep turnaround time fast. We developed some experience and so we’ve taken that and applied it in the most recent release of recommendations into our customer portal.


We’ve modularized our data model even further to help us support more use cases from environments that may or may not have specific components. Historically, we’ve relied on having Nimble Storage, they’re a hub for everything to be collected. But we can’t rely on that anymore. We want to be able to monitor environments that don’t necessarily have that particular storage device, and we may have to support various combinations of HPE products and other non-HPE applications.

Modularizing our data model to truly accommodate that has been something that we started along the path for and I think we’re making good strides toward.

The other piece is in terms of the data science. We’re trying to leverage longitudinal data as much as possible, but we want to make sure we have a sufficient set of meaningful ML offerings. So we’re looking at unsupervised learning capabilities that we can apply to environments for which we don’t have a critical mass of data yet, especially as we onboard monitoring for new applications. That’s been quite exciting to work on.

Gardner: We’ve been talking a lot about the HPE InfoSight technology, but there also has to be considerations for culture. A big part of digital transformation is getting silos between people broken down.

Is there a cultural silo between the data scientists and the IT operations people? Are we able to get the IT operations people to better understand what data science can do for them and their jobs? And perhaps, also allow the data scientists to understand the requirements of a modern, complex IT operations organization? How is it going between these two groups, and how well are they melding?

IT support and data science team up

Adamson: One of the things that Nimble did well from the get-go was have tight coupling between the IT support engineers and the data science team. The support engineers were fielding the calls from the IT operations guys. They had their fingers on the pulse of what was most important. That meant not only building features that would help our support engineers solve their escalations more quickly, but also things that we can productize for our customers to get value from directly.

Gardner: One of the great ways for people to better understand a solution approach like HPE InfoSight is through examples.  Do we have any instances that help people understand what it can do, but also the paybacks? Do we have metrics of success when it comes to employing HPE InfoSight in a complex IT operations environment?

Mehta: One of the examples I like to refer to was fairly early in our history but had a big impact. It was at the University Hospital of Basel in Switzerland. They had installed a new version of VMware, and a few weeks afterward things started going horribly wrong with their implementation that included a Nimble Storage device. They called VMware and VMware couldn’t figure it out. Eventually they called our support team and using InfoSight, our support team was able to figure it out really quickly. The problem turned out to be a result of a new version of VMware. If there was a hold up in the networking, some sort of bottleneck in their networking infrastructure, this VMware version would try really hard to get the data through.

We were able to preemptively alert other people who had the same combinations of VMware and Nimble Storage and say, "Guys, your should either upgrade to this new patch that VMware has made or just be aware that you are susceptible to this problem."
So instead of submitting each write once to the storage array once, it would try 64 times. Suddenly, their traffic went up by 64 times. There was a lot of pounding on the network, pounding on the storage system, and we were able to tell with our analytics that, “Hey this traffic is going up by a huge amount.” As we tracked it back, it pointed to the new version of VMware that had been loaded. We then connected with the VMware support team and worked very closely with all of our partners to identify this bug, which VMware very promptly fixed. But, as you know, it takes time for these fixes to roll out to the field.

We were able to preemptively alert other people who had the same combination of VMware on Nimble Storage and say, “Guys, you should either upgrade to this new patch that VMware has made or just be aware that you are susceptible to this problem.”

So that’s a great example of how our analytics was able to find a problem, get it fixed very quickly -- quicker than any other means possible -- and then prevent others from seeing the same problem.

Gardner: David, what are some of your favorite examples of demonstrating the power and versatility of HPE InfoSight?

Adamson: One that comes to mind was the first time we turned to an exception-based model that we had to train. We had been building infrastructure designed to learn across our installed base to find common resource bottlenecks and identify and rank those very well. We had that in place, but we came across a problem that support was trying to write a signature for. It was basically a drive bandwidth issue.

But we were having trouble writing a signature that would identify the issue reliably. We had to turn to an ML approach because it was fundamentally a multidimensional problem. If we looked across, we have had probably 10 to 20 different metrics that we tracked per drive per minute on each system. We needed to, from those metrics, come up with a good understanding of the probability that this was the biggest bottleneck on the system. This was not a problem we could solve by just setting a threshold.

So we had to really go in and say, “We’re going to label known examples of these situations. We’re going to build the sort of tooling to allow us to do that, and we’re going to put ourselves in a regime where we can train on these examples and initiate that semi-supervised loop.”

We actually had two to three customers that hit that specific issue. By the time we wanted to put that in place, we were able to find a few more just through modeling. But that set us up to start identifying other exceptions in the same way.

We’ve been able to redeploy that pattern now several times to several different problems and solve those issues in an automated way, so we don’t have to keep diagnosing the same known flavors of problems repeatedly in the future.

Gardner: What comes next? How will AI impact IT operations over time? Varun, why are you optimistic about the future?

Software eats the world 

Mehta: I think having a machine in the loop is going to be required. As I pointed out earlier, complexity is increasing by leaps and bounds. We are going from virtualization to containers to serverless. The number of applications keeps increasing and demand on every industry keeps increasing. 

Andreessen Horowitz, a famous venture capital firm once said, “Software is eating the world,” and really, it is true. Everything is becoming tied to a piece of software. The complexity of that is just huge. The only way to manage this and make sure everything keeps working is to use machines.

That’s where the challenge and opportunity is. Because there is so much to keep track of, one of the fundamental challenges is to make sure you don’t have too many false positives. You want to make sure you alert only when there is a need to alert. It is an ongoing area of research.

There’s a big future in terms of the need for our solutions. There’s plenty of work to keep us busy to make sure we provide the appropriate solutions. So I’m really looking forward to it.


There’s also another axis to this. So far, people have stayed in the monitoring and analytics loop and it’s like self-driving cars. We’re not yet ready for machines to take over control of our cars. We get plenty of analytics from the machines. We have backup cameras. We have radars in front that alert us if the car in front is braking too quickly, but the cars aren’t yet driving themselves.

 

It’s all about analytics yet we haven’t graduated from analytics to control. I think that too is something that you can expect to see in the future of AIOps once the analytics get really good, and once the false positives go away. You will see things moving from analytics to control. So lots of really cool stuff ahead of us in this space.

Gardner: David, where do you see HPE InfoSight becoming more of a game changer and even transforming the end-to-end customer experience where people will see a dramatic improvement in how they interact with businesses?

Adamson: Our guiding light in terms of exception handling is making sure that not only are we providing ML models that have good precision and recall, but we’re making recommendations and statements in a timely manner that come only when they’re needed -- regardless of the complexity.

A lot of hard work is being put into making sure we make those recommendation statements as actionable and standalone as possible. We’re building a differentiator through the fact that we maintain a focus on delivering a clean narrative, a very clear-cut, “human readable text” set of recommendations. 

And that has the potential to save a lot of people a lot of time in terms of hunting, pecking, and worrying about what’s unseen and going on in their environments.

Gardner: Varun, how should enterprise IT organizations prepare now for what’s coming with AIOps and automation? What might they do to be in a better position to leverage and exploit these technologies even as they evolve?

Pick up new tools

Mehta: My advice to organizations is to buy into this. Automation is coming. Too often we see people stuck in the old ways of doing things. They could potentially save themselves a lot of time and effort by moving to more modern tools. I recommend that IT organizations make use of the new tools that are available.


HPE InfoSight is generally available for free when you buy an HPE product, sometimes with only the support contract. So make use of the resources. Look at the literature with HPE InfoSight. It is one of those tools that can be fire-and-forget, which is you turn it on and then you don’t have to worry about it anymore.

It’s the best kind of tool because we will come back to you and tell you if there’s anything you need to be aware of. So that would be the primary advice I would have, which is to get familiar with these automation tools and analytics tools and start using them.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: Hewlett Packard Enterprise.

You may also be interested in: