Tuesday, October 26, 2021

Now’s the time for more industries to adopt a culture of operational resilience

n the last
BriefingsDirect sustainable business innovation discussion, we explored how operational resiliency has become a top priority in the increasingly interconnected financial services sector.

We now expand our focus to explore the best ways to anticipate, plan for, and swiftly implement the means for nearly any business to avoid disruption.

New techniques allow for rapid responses to many of the most pressing threats. By predefining root causes and implementing advance responses, many businesses can create a culture of safer and sustained operations.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

To learn more about the many ways that businesses can reach a high level of assured business availability despite persistent threats, please welcome Steve Yon, Executive Director of the EY ServiceNow Practice, and Andrew Zarenski, Senior Manager and ServiceNow Innovation Leader at EY. The discussion is moderated by Dana Gardner, Principal Analyst at Interarbor Solution.

Here are some excerpts:

Gardner: Steve, our last chat explored how financial firms are adjusting to heightened threats and increased regulation by implementing operational resiliency plans and platforms. But with so many industries disrupted these days in so many ways, is there a need for a broader adoption of operational resiliency best practices?

Yon: Yes, Dana. Just as we discussed, the pandemic has widened people’s eyes -- not only in financial services but across other industries. And now, with hurricane season and those impacts, we’re continuing to see strong interest to improve operational resiliency capabilities within many firms. Being able to continuously serve clients is how the world works – and it’s not just about technology.

Gardner: What has EY done specifically to make operational resiliency a horizontal capability, if you will, that isn’t specific to any vertical industry?

Resilience solutions for all sectors

Yon: The platform we built the solution on is an integration and automation platform. We set it up in anticipation of, and with the full knowledge that it’s going to become a horizontal capability.

When you think about resiliency and doing work in operational models, it’s a verb-based system, right? How are you going to do it? How are you going to serve? How are you going to manage? How are you going to change, modify, and adjust to immediate recovery? All of those verbs are what make resiliency happen.

What differentiates one business sector from another aren’t those verbs. Those are immutable. It’s the nouns that change from sector to sector. So, focusing on all the same verbs, that same perspective we looked at within financial services, is equally as integratable when you think about telecommunications or power.

With financial services, the nouns might be things around trading and how you keep that capability always moving. Or payments. How do I keep those seems going? In an energy context, the nouns would be more about power distribution, capacity, and things like that.

With our solutions we want to ensure that you don’t close any doors by creating stove pipes -- because the nature of the interconnectedness of the world is not one of stove pipes. It’s one of huge cross-integration and horizontal integration. And when information and knowledge are set up in a system designed appropriately, it benefits whichever firm or whatever sector you’re in.

Gardner: You’ve created your platform and solution for complex, global companies. But does this operational resiliency capability also scale down? Should small- to medium-size businesses (SMBs) be thinking about this as well?

Yon: Yes. Any firm that cares about being able to operate in the event of potential disruptions, if that’s something meaningful to them, especially in the more highly regulated industries, then the expectation of resiliency needs to be there.

How to Build Resiliency into Operations

We’re seeing resiliency in the top five concerns for board-level folks. They need a solution that can scale up and down. You cannot take a science fair project and impact an industry nor provide value in the quick way these firms are looking for.

The idea is to be able to try it out and experiment. And when they figure out exactly how to calibrate the solution for their culture and level of complexity, then they can rinse, repeat, and replicate to scale it out. Your comment on being able to start small and grow large is absolutely true. It’s a guiding principle in any operational resiliency solution.

Gardner: It sounds like there are multiple adoption vectors, too. You might have a risk officer maturity level, or you might just have a new regulatory hurdle and that’s your on-ramp.

Are there a variety of different personas within organizations that should be thinking about how to begin that crawl, walk, run adoption for business continuity?

Yon: Yes. We think a proper solution should be persona-based. Am I talking to someone with responsibilities with risk, resilience, and compliance? Or am I talking to someone at the board level? Am I talking to a business service owner?

And the solution should also be inclusive of all the people who are remediating the problems on the operational side, and so unifying that entire perspective. That’s irrespective of how your firm may work. It focuses broadly on aligning the people who need to build things at the top level, to understanding the customer experience perspective, and to know what’s going on and how things are being remediated. Unifying with those operational folks is exceptionally important.

The capability to customize a view, if you will, for each of those personas -- irrespective of their titles – in a standard way so they are all able to view, monitor, and manage a disruption, or an avoidance of a disruption, is critical.

Gardner: Because the solution is built on a process and workflow platform, ServiceNow, which is highly integratable, it sounds like you can bring in third parties specific to many industries. How well does this solution augment an existing ecosystem of partners?

Yon: ServiceNow is a market-ubiquitous capability. When you look under the hood of most firms, you’ll find a workflow process capability there. With that comes the connectivity and framework by which you can have transparency into all the assets and actors.

ServiceNow is a market-ubiquitous capability. When you look under the hood of most firms, you'll find a workflow process capability there. With that comes the connectivity and framework to gain transparency into all the assets and actors.

What better platform to then develop a synthesis view of, “Hey, here’s where I’m now detecting the signal that could be something that’s a disruption”? That then allows you to be able to automatically light up a business continuity plan (BCP) and put it into action before a problem actually occurs.

We integrate not only with ServiceNow, but with any other system that can throw a signal -- whether it’s a facilities-based system, order management system, or a human resources system. That includes anything a firm defines as a critical business service, and all the actors and assets that participate in it, along with what state they need for it to be considered valid.

All of that needs to be ingested and synthesized to determine if there’s an issue that needs to be monitored and then a failover plan enacted.

Gardner: Andrew, please tell us about the core EY ServiceNow alliance operational resilience offering.

Detect disruptions with data

Zarenski: Corporations already have so many mitigation policies in place that understanding and responding to disruptions in real time is obviously essential. Everyone likes to think about the use case of plugging cybersecurity holes as soon as possible to prevent hackers from taking advantage of an exploit. That’s a relatively easy, relatable scenario. But think about a physical office service. For example, an elevator goes down that then prevents your employees from getting to their desks or people in a financial firm getting to their trading floor.

Understanding that disruption is just as important as understanding a cybersecurity threat or if someone has compromised one of your systems or processes. Detection today is generally harder than it’s been in the past because corporations’ physical and logical assets are so fragmented. They’re hard to track in that or any building.

Steve alluded to how service mapping, to understand what assets support services, is incredibly difficult. Detection has become very complicated, and the older ways of picking up the phone just isn’t enough because most corporations don’t know what the office is supporting. Having that concrete business service map and understanding that logical mapping of assets to services makes a solution such as this help our operators or chief risk officers (CROs) able to respond in near real time, which is the new industry standard.

Gardner: So, on one hand, it’s more difficult than ever. But the good news is that nowadays there’s so much more data available. There’s telemetry, edge computing, and sensors. So, while we have a tougher challenge to detect disruptions, we’re also getting some help from the technology side.

Zarenski: Yes, absolutely. And everyone thinks of this generally as just a technology exercise, but there’s so much more to it than the tech. There is the process. The key to enterprise resiliency is understanding what the services are both internally to employees as well as externally to the customers.

We find that most of our clients are just beginning to head down the journey of what we call business service mapping to identify and understand the critical services ahead of time. What are my five critical services? How can I build up those maps to show the quick wins and understand how can I be resilient today? How can I understand those sensors? What are the networks? What objects let me understand what a disruption is and have a dashboard show services that flip from green to red or yellow when something goes wrong?

There's so much signal out there to let you know what's going on. But to be bale to cut through and synthesize those material aspects of what's truly important is what makes this solution fit for duty and usable. And it does not take a lot of time to get done.

Yon: And, Dana, there’s so much signal out there to let you know what’s going on. But to be able to cut through and synthesize those material aspects of what’s truly important is what makes this solution fit for duty and usable. It’s not a big processing sync and does not take a lot of time to get done.

A business needs to know what to focus on, from what you imprint the system with to how you define your service map and how you calibrate what the signals represent. Those have to be the minimal number of things you want to ingest and synthesize to provide good, fast telemetry.  That’s where the value comes from, knowing how to define it best so the system works in a very fast and efficient way.

Gardner: Clearly, operational resiliency is not something you just buy in a box and deploy. There’s technology, business service mapping, and there’s also culture. Do you put in the technology and processes and then hope you develop a culture of resiliency? Or do you try to instill a culture of resiliency and then put in the ingredients? What’s the synergy between them?

Cultural shift from reactivity

Zarenski: There is synergy, for sure. Obviously, every corporation wants to have a culture of resilience. But at the same time, it’s hard to get there without the enabling technology. If you think about the solution that we at EY have developed, it takes resiliency beyond being just a reactive solution.

How to Build Resiliency into Operations

It’s easy for a corporation to understand the need for having a BCP or disaster recovery plan in place. That’s generally the first line of enabling a resilient culture. But bringing in another layer of technology that enables investment in the things that are listening for disruption? That is the next layer.

If you look at financial institutions, they all have different tools and processes that look at things like trade execution volume, and so forth. One person may have a system looking to see if trade execution volume has a significant blip and can then compare that to prior history. But to understand if that dip means something is wrong is not an easy process. Using EY’s operational resilience tool helps understand the patterns, catalog the patterns, and brings in technology that ultimately further enables that culture of resilience.

Yon: Yes, you want to know if something like that blip happens naturally or not. I liken this back to the days when we went through the evolution from quality control (QC)-oriented thinking to quality assurance (QA)-oriented thinking. QC lets you test stuff out, and lets you know what to do in the event of a failure. That’s what a BCP plan is all about -- when something happens, you pick up and follow the playbook. And there you go.

QA, which went through some significant headwinds, is about embedding that thought process into the very fabric of your planning and the design to enable the outcomes you really want. If there is QA, you can avoid disruptions.

And that’s exactly the same perspective we’re applying here. Let’s think about how continuity management and the BCP are put together. Yes, they exist, but you know what when you’re using them? You’re down. Value destruction is actually occurring.

So, think about this culture of resilience as analogous to the evolution to QA, which is, “Be more predictive and know what I’m going to be dealing with.” That is better than, “Test it out and know how to respond later.” I can actually get a heck of a lot better value and keep myself off the front page of the newspaper if I am more thoughtful in the first place.

That also goes back to the earlier point of how to accelerate time to value. That’s why Andrew was asking, “Hey, what are your five critical business services?” This is where we start off. Let’s pick one and find a way to make it work and get lasting value from that.

The best way to get people to change is quickly use data and show an outcome. That’s difficult to disagree with.

Gardner: Andrew, what are the key attributes of the EY ServiceNow resilience solution that helps get organizations past firefighting mode and more into a forward-looking, intelligent, and resilient culture?

React, respond, and reduce risk

Zarenski: The key is preventative and proactive decision support. Now, if you think about what preventative decision support means, the capability lets you build in thresholds for when a service maybe approaching a lag in its operational resilience. For example, server capacity may be decreasing for a web site that delivers an essential business service to external customers. As that capacity decreases, the service would begin to flash yellow as it approaches a service threshold. Therefore, someone can be intelligent and quickly do something about it.

But you can do that for virtually any service by setting policies in the database layer to understand what the specific thresholds are. Secondly, broad transparency and visibility is very important.

We’re expanding the usefulness of data for the chief risk officer (CRO). They can log into the dashboard two or three times a day, look at their 10 or 15 critical business services, and all the subservices that support them, and understand the health of each one individually. In an ideal situation, they log in in the morning and see everything as green, then they log in at lunchtime, and see half the stuff as yellow. Then they are going to go do something about it. But they don’t need to drill into the data to understand that something is wrong, they can simply see the service, see the approaching threshold, and – boom – they call the service owner and make sure they take care of it.

Yon: By the way, Andrew, they can also just pick up their phone if they get a pushed notification that’s something’s askew, too.

Zarenski: Yes, exactly. The major incident response is built into the backend. Of course, we’re proactively allowing the CROs and services owners to understand that something’s gone wrong. Then, by very simply drilling into that alert, they will understand immediately which assets are broken, know the 10 people responsible for those assets, and immediately get them on the phone. Or they can set up a group chat, get them paged, and any number of ways to get the problem taken care of.

The key is offering not just the visibility into what's gone wrong, but also the ability to react, respond, and have full traceability behind that response -- all in one platform. That really differentiates that solution from what else is in the market.

The key is offering not just the visibility into what’s gone wrong, but also the ability to react, respond, and have full traceability behind that response -- all in one platform. That really differentiates the solution from what else is in the market.

Gardner: It sounds like one of the key attributes is the user experience and interfaces that rapidly broaden the number of appropriate people and to get them involved.

Zarenski: You’re spot on. Another extremely important part is the direct log and record of what people did to help fix the problem. Regulations require recording what the disruption was, but also recording every single step and every person who interacted with the disruption. That can then be reported on in the future should they want to learn from it or should regulators and auditors come in. This solution provides that capability all in one place.

Yon: Such post-disruption forensics are very important for a lot of reasons.

Zarenski: Yes, exactly. A regulator will be able to look back and ask the question, “Did this firm act reasonably with respect to its responsibility?”

Easy question, but tough to answer. You would need to go back and recreate your version of what the truth was. This traps the truth. It traps the sequence, and it makes the forensics on answering that question very simple.

Gardner: While we’re talking about the payoffs when you do operational resiliency correctly, what else do you get?

Yon: I’ll give you a couple. One is we don’t have to get a 3 am phone call because something has broken because someone is already working on the issue.

Another benefit impacts the “pull-the-plug test,” where once a year or two we hold our breath to determine if our BCP plans are working and that we can recover. In that test, a long weekend is consumed with a Friday night fault or disconnection of something. And then we monitor the recovery and hope everything goes back to normal so we can resume business on the following Tuesday.

How to Build Resiliency into Operations

When we already understand what the critical business services are, we can quickly hone down essential causes and responses. When service orientation took hold, people bragged about how many services they had, perhaps as many as 900 services. Wow, that seems like a lot.

But are they all critical? Well, no, right? This solution allows you to materially keep what’s important in front of you so you can save money by not needing to drive the same level of focus across too wide of a beachfront.

Secondly, rather than force a test fault and pray, you can do simulations and tests in real time. “Do I think my resiliency strategy is working? Do I believe my resiliency machinery is fit for duty?” Well, now you can prove it, saying, “I know it is because I test this thing every quarter.”

You can frequently simulate all the different pieces, driving up the confidence with regulators, your leadership, and the auditors. That takes the nightmare out of your systems. These are but some of the other ancillary benefits that you get. They may seem intangible, but they’re very real. You can clean out unnecessary spend as well as unnecessary brand-impacting issues with the very people you need to prove your abilities to.

Gardner: Andrew, any other inputs on the different types of value you get when you do operational resiliency right?

Zarenski: If you do this right and set up your service mapping infrastructure correctly, we’ve had clients use this to do comparisons for how they might want to change their infrastructure. Having fully mapped out a digital twin of your business provides many more productivity and efficiency capabilities. That’s a prime example.

Gardner: Well, this year we’ve had many instances of how things can go very wrong -- from wildfires to floods, hurricanes, and problems with electric grids. As a timely use case, how would an organization in the throes of a natural disaster make use of this soluiton?

Prevent a data deep freeze

Zarenski: This specific use case stemmed from the deep freeze last winter in Dallas. It provides a real-life example. The same conditions can be translated over to hurricanes. Before the deep freeze hit back in the winter, we were adjusting signals from NOAA into the EY operational resiliency platform to understand and anticipate anomalies in temperatures in places that normally don’t see them.

We were able to run simulations in our platform for how some Dallas data centers were going to be hit by the deep freeze and how the power grid would be impacted. We could see every single physical asset being supported by that power grid and therefore understand how it might impact the business operations around the world.

There may be a server there that, in turn, supports servers in Hong Kong. Knowing that, we were able to prepare teams for a failover preemptively over to a data center in Chicago. That’s one example of how we can adjust data from multiple sources, tie that data to what the disruption may be, and be proactive about the response -- before that impact actually occurs.

Gardner: How broadly can these types of benefits go? What industries after power and energy should be considering these capabilities?

Yon: The most relevant ones are the regulated industries. So, finance, power, utilities, gas, and telecom. Those are the obvious ones. But other businesses need to ensure their firm is operational irrespective of whether it’s a regulatory expectation. The horizontal integration to offset disruption is still going to be important.

We’re also seeing interdependency across business sectors. So, talking to telecom, they’re like, “Yup, we need to be able to provide service. I want to be able to let people know when the service is going to go up when our power is down. But I have no visibility into what’s going on there.” So, sometimes the interdependencies cross sectors, cross industries and those are the things that are now starting to highlight.

Understanding where those dependencies on other industries are, can allow you to make better decisions on how you want to position yourself for what might be happening upstream so you can protect your downstream operations and clients.

It’s fascinating when we talk now about how each industry can gain transparency into the others, because there are clear interdependencies. Once that visibility happens, you’ll start to see firms and their ecosystem of suppliers leverage that transparency to their mutual benefit to reduce the impacts and the value disruption that may happen anywhere upstream.

Gardner: Andrew, how are organizations adopting this? Is it on a crawl-walk-run basis?

Map your service terrain

Zarenski: It all starts with identifying your critical services. And while that may seem simple at face value, it’s, in fact, not. By having such broad exposure in so many industries, we’ve developed initial service maps for what a financial institution is, or what an insurance institution looks like.

That head-start helps our clients gain a baseline to define their organizations from a service infrastructure standpoint. Once they have a baseline template, then they can map physical assets, along with the logical assets to those services.

Most organizations start with one or two critical services to prove out the use case. If you can prove out one or two, you can take that as a road show out to the rest of the organization. You’re basically setting yourself up for success because you’ve proven that it works.

Yon: This goes back to the earlier point about scale. You can put something together in a simple way, calibrating to what service you want to clear as resilient. And by calibrating what that service map looks like, you can optimize the spread of the service map, the coverage it provides, and the signals that it ingests. By doing so, you can synthesize its state right away and make very important decisions.

The cool thing about where the technology is now, we’re able to rapidly take advantage of that. You can create a service map and tomorrow you can add to it. It can evolve quickly over time.

How to Build Resiliency into Operations

You can have a simplistic view of what a service looks like internally and track that to see the nature of where faults enter the system and predict what might materialize in that service map, to see how that evolves with a different signal or an integration to another source system.

These organizations can gain continuous improvement, ensuring that they consistently raise the probability of avoiding disruptions. They can say, “I’m now resilient to the following types of faults,” and tick down that list. The business can make economic choices in terms of how complex it wants to build itself out to be able to answer the question, “Am I acting in a reasonable way for my shareholders, my employees, and for the industry? I’m not going to cause any systemic problems.”

Gardner: You know, there’s an additional pay back to focusing on resiliency that we haven’t delved into, and it gets back to the notion of culture. If you align multiple parts of your organization around the goal of resiliency, it forces people to work across siloes that they might not have easily forded in the past.

So, as we focus on a high-level objective like resilience, does that foster a broader culture of cooperation for different parts of the organization?

Responsible resiliency collaboration

Yon: It definitely does. Resiliency is becoming a sound engineering principle generally. It can be implemented in many different ways. It can be implemented not only with technology, but with product, people, machinery, and governance.

A lot of this rolls up with being compliant to different regulations. We're providing a capability for virtually anyone to support risk and compliance activities -- without even knowing that you're supporting risk and compliance activities. It makes compliance easy to understand.

So many different people participate in the construction of an architectural capability like resiliency that it almost demands that collaboration occur. You can’t just do it from a silo. IT just can’t do this on their own. The compliance people can’t do this on their own. It’s not only a horizontal integration across the systems and the signals for which you detect where things are -- but it’s an integration of collaboration itself across those responsibility areas and the people who make it so.

Gardner: Andrew, what in the way the product is designed and used helps facilitate more cultural cooperation and collaboration?

Zarenski: Providing a capability for everyone to understand what’s going on is so important. For me to see that something going wrong in my business may impact someone else’s business gives a sense of shared responsibility. It gives you ownership in understanding the impacts across all the organizations.

Secondly, a lot of this all rolls up to being compliant in different regulations. We’re providing a capability for virtually anyone to support risk and compliance activities -- without even knowing that you’re supporting risk and compliance activities. It makes the job of compliance visual and easy to understand. That ultimately supports the downstream processes that your risk and compliance officers must perform -- but it also impacts and benefits the frontline workers. I think it gives everyone an important role in resiliency without them even knowing it.

Gardner: How do I start the process of getting this capability on-boarded in my company regardless of my persona?

Yon: The quick answer is to turn on the news. Resiliency and continual operation awareness are now at the board level. It’s one of the top-five priorities firms say are important for them to survive through the next 10 years.

Witness all the different things that are being thrown at us all -- whether it’s weather, geopolitical, and pandemic-related. The awareness is there. The interest is definitely there. Then the demand comes from that interest.

Based on the feedback and conversations were having with so many clients across so many industries, it is resonating with them. It’s now obvious that this needs to be looked at because turning your digital storefront off is no longer an option. We’ve had too many people see the impact of that over the past year.

And the nature of disruptions just keeps getting more complex. We’ve had near-death business experiences. They’ve had the wake-up call, and that was enough of a motivation to have awareness and interests in it that’s now moving us toward how to best fulfill it.

Gardner: A nice thing about our three-part series is we first focused on the critical timing around the financial industry. We’re talking more specifically today about the solution itself and its wider applicability.

The third part of our series will share the experiences of actual customers and explore how they went about the journey of getting that germ of operational resilience planted and then growing it within their company. Meanwhile, where can our audience go for more information and to learn more about how to make operation resiliency a culture, a technology, and a company-wide capability?

Yon: For those folks who already have responsibilities in this area, their industry trade shows, conversations, and dialogues are actively covering these issues. Second, for those who are EY or ServiceNow customers, talk to your team because they can lead you back to folks like Andrew and myself to confer about more specifics based on where you are on your journey.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: ServiceNow and EY.

You may also be interested in:

Tuesday, October 19, 2021

How FinTech innovator Razorpay uses open-source tracing to manage fast-changing APIs

The speed and complexity of microservices-intense applications often leave their developers in the dark. The creators too often struggle to track and visualize the actual underlying architecture of their distributed services.

The designers, builders, and testers of modern API-driven apps, therefore, need an ongoing and instant visibility capability into the rapidly changing data flows, integration points, and assemblages of internal and third-party services.

Thankfully, an open-source project to advance the sophisticated distributed tracing and observability platform called Hypertrace is helping.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

Stay with us here as BriefingsDirect explores the evolution and capabilities of Hypertrace and how an early adopter in the online payment suite business, Razorpay, has gained new insights and deeper understanding of their overall services components.

To learn how Hypertrace discovers, monitors, visualizes, and optimizes increasingly complex services architectures, please welcome Venkat Vaidhyanathan, Architect at Razorpay in Bangalore, India, and Jayesh Ahire, Founding Engineer at Traceable AI and Product Manager for Hypertrace. The discussion is moderated by Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Venkat, what does Razorpay do and why is tracing and understanding your services architecture so important?

Venkat: Razorpay’s mission is to enable frictionless banking and payment experiences by powering the entire financial infrastructure for businesses of all shapes and sizes. It’s a full-stack financial solution that enables thousands of small- to medium-sized enterprises (SMEs) and enterprises to accept, process, and disburse payments at scale.


Today, we process billions of dollars of payments from millions of businesses across India. As a leading payments provider, we have been the first to bring to market most of the major online innovations in payments for the last five years.

For the last two years, we have successfully curated neo banking and lending services. We have seen outstanding growth in the last five years and attracted close to $300 million-plus in funding from investors such as Sequoia, Tiger Global, Rebate, Matrix Partners, and others.

One of the fundamental principles about designing Razorpay has been to build a largely API-driven ecosystem. We are a developer-first company. Our general principle of building is, “It is built by developers for developers,” which means that every single product we build is always going to be API-driven first. In that regard, we must ensure that our APIs are resilient. That they perform to the best and most optimum capacity is of extreme importance to us.

Gardner: What is it about being an API-driven organization that makes tracing and observability such an important undertaking?

Venkat: We are an extremely Agile organization. As a startup, we have an obsession around our customers. Focus on building quality products is paramount to creating the best user experience (UX).

Our customers have amazing stories around our projects, products, and ecosystem. We have worked through extreme times (for example, demonetization, and the Yes Bank outage), and that has helped our customers build a lot of trust in what we do -- and what we can do.

Learn More 

We have quickly taken up the challenge and turned the tables for most of our customers to build a lot of trust in the kinds of things we do.

After all, we are dealing with one of the most sensitive aspects of human lives, which is their money. So, in this regard, the resiliency, security, and all the useability parameters are extremely important for our success.

Gardner: Jayesh, why is Razorpay a good example of what businesses are facing when it comes to APIs? And what requirements for such users are you attempting to satisfy with your distributed tracing and observability platform?

Observability offers scale, insight, resilience

Ahire: Going back to the days when it all started, people began building applications using monoliths. And it was easier then to begin with monolithic applications to get the business moving.


But in recent times, that is not the only important thing for businesses. As we heard, Venkat needs scale and resiliency in the platform while building with APIs. Most modern organizations use microservices, which complicates these modern architectures. They become hard to manage, especially at large-scale organizations where you can have 100 to 300 microservices, with thousands of APIs communicating between those microservices.

It’s just hard now for businesses to have visibility and observability to determine if they have any issues and to see if the APIs are performing as they are expected.

I use a list of four brief questions that every organization needs to answer at some point. Are their APIs:

  • Providing the functionality they are supposed to deliver?

  • Performing in the way they are supposed to?

  • Secure for their business users?

  • Understood across all their APIs and microservices uses?

They must understand if the APIs and microservices are performing up to the actual expectations and required functionality. They need something that can provide the answers to these questions, at the very least.

Observability helps answer these essential questions without having to open the black box and go to each service and every API. Instead, the instrumentation data provides those insights. You can ask questions of your system and it will give you the answers. You can ask, for example, how your system is performing -- and it will give you some answers. Such observability helps large-scale organizations keep up with the scale and with the increasing number of users. And that keeps the systems resilient.

Gardner: Venkat, what are your business imperatives for using Hypertrace? Is it for UX? What is the business case for gaining more observability in your services development?

Metrics, logs, and traces limit trouble

Venkat: There are three fundamental legs to what we define as modern observability. One part is with respect to metrics, the next part has to do with the logs, and the third part is in respect to the traces.

Up until recently, we had application performance monitoring (APM) systems that monitored some of these things, with a single place to gather some metrics and insights. However, as microservices grew wider in use, APMs are no longer necessarily the right way to do these things. For such metrics, a lot of work is already going on in the open-source ecosystem with respect to Prometheus and others. I wrote a blog about our journey into scaling our metrics platform to trillions of data points.

Once you can get logs -- whether it is from open-source ELK Stack [Elasticsearch, Logstash, and Kibana], or whether it is from a lot of platform as a service (PaaS) and software as a service (SaaS) log providers -- fundamentally the issue comes down to traces.

As microservices evolve, you're talking about a lot more problems, such as how much time would a network call take? How much time would a database call take? Was my DNS request the biggest impediment? What really happened?

Now, traces can be visualized in a very primitive way, such as for instrumenting a particular piece of code to understand its behavior. It could be for a timing function, for example.

However, as microservices evolve, you’re talking about a lot more problems, such as how much time would a network call take? How much time would the database call take? Was my DNS request the biggest impediment? What really happened in the last mile?

And when you’re talking about an entire graph of services, it’s very important to know what particular point in the entire graph breaks down often – or doesn’t break down very often.

Understanding all these things, as Jayesh said, and asking the right questions cannot happen only by using metrics or just logs. They only give different slices of the problems. And it cannot happen only by using tracing, which also only gives a different slice of the problem.

In an ideal, nirvana world, you need to combine all these things and create a single place that can correlate these various things and allow a deep dive with respect to a specific component, module, function, system, query, or whatever. Being able to identify root causes and the mean time to detect (MTTD), these are some of the most paramount things that we probably need to worry about.

In complex, large-scale systems, things go wrong. Why things went wrong is one part, when did things go wrong is another part, and being able to arrive and fix things – the MTTD and the mean time to recovery (MTTR) -- those largely define the success of any business.

We are just one of the many financial ecosystem providers. There are tons of providers in the world. So, the customer has many options to switch from one provider to another. For any business, how they react to these performance issues is the most important.

Observability tools like Hypertrace puts us in control, rather than just leaving it for hypothesis.

Gardner: Jayesh, how does Hypertrace improve on such key performance controls as MTTD and MTTR? How is Hypertrace being used to cut down on that all important time to remediation that makes the user experience more competitive?

Tracing eases uncovering the unknown

Ahire: As Venkat pointed out, in these modern systems, there are too many unknown unknowns. Finding out what caused any problem at any point in time is hard.

At Hypertrace, in trying to help businesses, we present entity-focused, API-first views. Hypertrace provides a very detailed service dashboard, an overview, an out-of-the-box service overview. Such a backend API overview helps find what different services are talking to each other, how they are talking to each other, the interactions between the different services, and then what different APIs are talking to the services. It provides a list of APIs.

Hypertrace provides a single pane view into the services and API trace data. The insights gained from the trace data makes it easier to find which API or service has some issue. That’s where the entity-first API view makes the most sense. The API dashboard helps people get to the issue very easily and helps reduce the MTTD and MTTR.

Venkat: Just to add to what Jayesh mentioned, in our world our ecosystem is internally a Kubernetes ecosystem. And Kubernetes is extremely dynamic in nature. You’re not anymore dealing with single, private IDs or public IDs, or any of those things. Services can come up. Parts can come up. Deployments can come up, go down.

So, service discoverability becomes a problem, which means that tying back a particular behavior to these services, which are themselves a collection of services, and to the underlying infrastructure -- whether you’re talking about queues or network calls -- you’re talking about any number of interconnected infrastructure components as well. That becomes extremely challenging.

Cardinality becomes an extremely important issue. Metrics alone cannot solve that [service discoverability] problem. Logs alone cannot solve that problem. A very simple payments request carries at least 35 different cardinality dimensions.

The second aspect is implicitly most of our ecosystems run on preemptive workloads, or smart workloads. So, nodes can come up, nodes can go down. How do you put these things together? While we can identify a particular service as problematic, I want to find out if it is the service that is problematic or the underlying cloud provider. And within the cloud provider, is it the network or the actual hardware or operating system (OS)? If it is OS, which part precisely? Is it just a particular part that is problematic, or is the entire hardware problematic? That’s one view.

The other view is that cardinality becomes an extremely important issue. Metrics alone cannot solve that problem. Logs alone cannot solve that problem. A very simple request, for example, a payment-create-request in our world, carries at least 30 to 35 different cardinality dimensions (e.g.: the merchant identity, gateway, terminal, network, and whether the payment is domestic vs international, etc.).

Learn More 

A variety of these parameters comes into play. You need to know if it’s an issue overall, is it at a particular merchant, and at what dimension? So, you need to narrow down the problem in a tight production scenario.

To manage those aspects, tools like Hypertrace, or any observability tool, for that matter -- tracing in general -- makes it a lot easier to arrive at the right conclusions.

Gardner: You mentioned there are other options for tracing. How did you at Razorpay come to settle on Hypertrace? What’s the story behind your adoption of Hypertrace after looking at the tracing options landscape?

The why and how of choosing Hypertrace

Venkat: When we began our observability journey, we realized we had to go further into visibility tracing because the APMs were not answering a lot of questions we were asking of the APM tool. The best open-source version was that offered by Jaeger. We evaluated a lot of PaaS/SaaS solutions. We really didn't want to build an in-house observability stack.

There were a few challenges in all the PaaS offerings including storage, ability to drill down, retention, and cost versus value offered. Additionally, many of the providers were just giving us Jaeger with add-ons. The overall cost-to-benefit ratio suffered because we were growing with both the number of services and users. Any model that charges us on the user level, data storage level, or services level -- these become prohibitive over time.

Although maintaining an in-house observability tool is not the most natural business direction for us, we soon realized that maybe it’s best for us to do it in-house. We were doing some research and hit upon this solution called Hypertrace. It looked interesting so we decided to give it a try.

They offered the ability for me to jump into a Slack call. And that’s all I did. I just signed up. In fact, I didn’t even sign up with my company email address. I signed up with my personal email address and I just jumped on to their Slack call.

I started asking the Hypertrace team lots of questions. Started with a Docker-compose, straight out of their GitHub repo. The integration was quite straightforward. We did a set of proof-of-concepts and said, “Okay, this sort of makes sense.” The UX was on par with any commercial SaaS provider. That blew my mind. How can an open-source product build such a fantastic user interface (UI)? I think that was the first thing that hit most of our heads. And I think that was the biggest sell. We said, “Let’s just jump in and see how it evaluates.” And that’s the story.

Gardner: What sort of paybacks or metrics of success have you enjoyed since adopting Hypertrace? As open source, are you injecting your own requirements or desired functions and features into it?

Venkat: First and foremost, we wanted to understand the beast we were dealing with in our APIs, which meant we had to build in the instrumentation and software development kits (SDKs), including OpenCensus, OpenTracing, and OpenTelemetry agents.

We had to make internal developer adoption easier by building the right toolkits, the right frameworks, and the right SDKs because applications have their own business asks, and you shouldn't be adding woes to their existing development life cycles.

The next step was integrating these tools within our services and ecosystem. There are challenges in terms of internally standardizing all our instrumentation, using best practices, and ensuring that applications are adopted. We had to make internal developer adoption easier by building the right toolkits, the right frameworks, and the right SDKs because applications have their own business asks, and you shouldn’t be adding woes to their existing development life cycle. Integration should be simple! So, we formulated a virtual team internally within Razorpay to build the observability stack.

As we built the SDKs and tooling and started instrumenting, we did a lot of adoption exercises within the organization. Now, we have more than 15 critical services and a lot more in the pipeline. Over a period of time, we were able to make tracing a habit rather than just another “nice to have.”

One of the biggest benefits we started seeing from the production monitoring is our internal engineering teams figured out how to run performance tests in pre-production. Some of these wouldn’t have been possible before; being able to pin down the right problem areas.

Learn More 

Now, during the performance testing, our engineers can early-on pinpoint the root cause of the problems. And they’ve gone back to fix their code even before the code goes into production. And believe me that it’s a lot more valuable for us than the code going into production and then facing these problems.

The misfortune about all monitoring tools is typical metrics might not be applicable. Why? Because when things go right, nobody wants to look at monitoring. It’s only when things go wrong that people log into a monitoring tool.

The benefits of Hypertrace come in terms of how many issues you’re able to detect much earlier in the stages of development. That’s probably the biggest benefit we have gotten.

Gardner: Jayesh, what makes Hypertrace unique in the tracing market?

Democratic data for API analytics

Ahire: There are two different ways to analyze, visualize, and use the data to better understand the systems. The first important thing is how we do data collection. Hypertrace provides data collection from any standard instrumentation.

If your application is instrumented with Jaeger, Zipkin, or OpenTelemetry, and you start sending the instrumentation data to Hypertrace, it will be able to analyze it and show you the dashboard. You then will be able to slice and dice the data using our explorer. You can discover a lot of different things.

That democratization of the data collection aspect is one important thing Hypertrace provides. And if you want to use any other tracing platform you can do that with Hypertrace because we support all the standard instrumentation.

Next is how we utilize that data. Most tracing platforms provide a way to slice and dice their data. So that’s just one explorer view where there’s all the data from the instrumentation available and you can find the information you want. Ask the question and then you will get the information. That’s one way to look at it.

Hypertrace provides, in addition to that explorer view, a detailed service graph. With it, you can go to applications, see the service interactions, the latency markings, and learn which services are having errors right away. Out-of-the-box services derived from instrumentation data provide many necessary metrics and visualizations, including latency, error rate, and call rate.

You can see more of the API interactions. You can see comparison data to current data, for example. Whatever your latency was in the last one day to the last hour. It provides you a comparison for that. And it’s pretty helpful by being able to compare between deployments, such as if the performance, latency, or error rate is affected. There are a lot of use cases you can solve with Hypertrace.

With such observability used in early problem detection, you can reduce MTTD and MTTR using these dashboard services. You can achieve early problem detection easily.

The expectation is for availability of 99.99 percent. In the case of Razorpay, it's very critical. Any downtime has a business impact. For most businesses, that's the case.

Then there’s availability. The expectation is for availability of 99.99 percent. In the case of Razorpay, it’s very critical. Any downtime has a business impact. For most businesses, that’s the case. So, availability is a critical issue.

The Hypertrace dashboards help you to maintain that as well. Currently, we are working on alerting features on deviations -- and those deviations are calculated automatically. We calculate baselines from the previous data, and whenever a deviation happens, we give an alert. That obviously helps in reducing MTTD as well as increasing availability generally.

Hypertrace strives to make the UX seamless. As Venkat mentioned, we have a beautiful UI that looks professional and attractive. The UI work we put into our SaaS security solution, Traceable AI, this functionality also goes into Hypertrace, and so helps the community. It helps people such as Venkat at Razorpay to solve the problems in their environment. That’s pretty good.

Gardner: Venkat, for other organizations facing similar complexity and a need to speed remediation, what recommendations do you have? What should other companies be thinking about as they evaluate observability and tracing choices? What do you recommend they do as they get more involved with API resiliency?

Evaluate then invest in your journey

Venkat: A fundamental problem today in the open-source world with tracing is the quality of standards. We have OpenCensus on one side going to OpenTelemetry and OpenTracing going to OpenTelemetry. In trying to keep it all compatible, and because it’s all so nascent, there is not a lot of automation.

For most startups, it is quite daunting to build their own observability stack.

My recommendation is to start with an existing tracing provider and evaluate that against your past solutions. Over time it may become cost prohibitive. At some point, you must start looking inward. That’s the time when systems like Hypertrace become quite useful for an organization.

The truth is it’s not easy to build on an observability stack. So, experiment with a SaaS provider on a lower scale. Then invest in the right tooling, one that gives the liberty to not maintain the stack, such as Hypertrace. Keep the internal tooling separate, experiment, and come back. That’s what I would recommend.

The cost is not just the physical infrastructure cost, or the licensing cost. Cost is also engineering cost of the stack. If the stack goes down, who monitors the monitor? It’s a big question. So, there are trade-offs. There is no right answer, but it’s a journey.

After our experience with Hypertrace, I have connected with a couple of my friends in different organizations, and I’ve told them of the benefits. I do not know their results, but I’ve told them some of the benefits that we have leveraged using Hypertrace.

Gardner: And just to follow up on your advice for others, Venkat, what is it about open source that helps with those trade-offs?

Venkat: One advantage we have with open-source is there is no vendor lock-in. That’s one major advantage. One of our critical services is in PHP. And hence, we needed to only use OpenCensus for instrumenting it.

We're working with the Hypertrace community to build in some new features, such as tool design, Blue Coat, knowledge sharing, and bug-fixing. For us, it's been an interesting and exciting journey.

But there were a lot of performance and resilience issues with this codebase. Today, the original OpenCensus PHP implementation points to Razorpay’s fork.

And we are working with the Hypertrace community, too, to build some features, whether it is in tool design, Blue Coat, knowledge sharing, and bug-fixing. For us it’s been an interesting and exciting journey.

Ahire: Yes, that has been the mutual experience from our end as well. We learned a lot of things. We had made assumptions in the beginning about what users might expect or want.

But Razorpay worked with us. On some things they said, “Okay, this is not going to work. You have to change this part.” And we modified some things, we added a few features, and we removed a few things. That’s how it came to where it is today. The whole collaboration aspect has been very rewarding.

Venkat: Even though we have a handful of critical services, the data that are instrumented from them, it was over two terabytes a day. And while that is a good problem to have, we have other interesting scaling challenges we need to deal with.

So how do you optimize these things at scale? In the SaaS form, we could have just gone and said, “Hey, this sort of doesn’t work.” We stick with them for a few months then we go ahead with another SaaS provider and say, “Are you going to solve this problem or not?”

The flexibility we get with open source is to say, “Okay, here’s the problem. How do we fix it?” Because, of course, they’re not under our control, right? I think that’s super powerful.

Ahire: Here we all learn together.

Gardner: Yes, it certainly sounds like a partnership relationship. Jayesh, tell us a little bit about the roadmap for Hypertrace, and particularly for the smaller organizations who might prefer a SaaS model, what do you have in store for them?

Ahire: We are currently working on alerting. We’ll soon release dynamic anomaly-based alerting.

We are also working on metric ingestion and integrations throughout the Hypertrace platform. An important aspect of tracing and observability is being able to correlate the data. To propagate context throughout the system is very important. That’s what we will be doing with our metric integration. You will be able to send application metrics, and you will be able to correlate back to base data and log data.

Learn More 

And talking of SaaS, when it comes to smaller organizations with maybe 10, 20, or 30 developers and a not very well-defined DevOps team, it can be hard to deploy and manage this kind of platform.

So, for those users, we are working toward a SaaS model so smaller companies will be able to use the Hypertrace stack functionality.

Where can organizations go to learn more about Hypertrace and start to use some of these features and functions?

Ahire: You can head on to hypertrace.org, our website, and find the details of our use cases. There’s a Slack channel link, GitHub, and everything is available there. Those are good places to start.

Venkat: Just try it first and just go to GitHub and within a few minutes you should have the entire stack up and running. I mean, that’s as simple as simplicity can get.

For further details, just go to the Slack channel and start communicating. Their team is super-duper responsive and super-duper helpful. In fact, we have never had to talk to them saying, “Hey, what’s this?” because we sort of realized that they come back with a patch much faster than you can imagine.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: Traceable AI.  

You may also be interested in: