Tuesday, October 19, 2021

How FinTech innovator Razorpay uses open-source tracing to manage fast-changing APIs

The speed and complexity of microservices-intense applications often leave their developers in the dark. The creators too often struggle to track and visualize the actual underlying architecture of their distributed services.

The designers, builders, and testers of modern API-driven apps, therefore, need an ongoing and instant visibility capability into the rapidly changing data flows, integration points, and assemblages of internal and third-party services.

Thankfully, an open-source project to advance the sophisticated distributed tracing and observability platform called Hypertrace is helping.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

Stay with us here as BriefingsDirect explores the evolution and capabilities of Hypertrace and how an early adopter in the online payment suite business, Razorpay, has gained new insights and deeper understanding of their overall services components.

To learn how Hypertrace discovers, monitors, visualizes, and optimizes increasingly complex services architectures, please welcome Venkat Vaidhyanathan, Architect at Razorpay in Bangalore, India, and Jayesh Ahire, Founding Engineer at Traceable AI and Product Manager for Hypertrace. The discussion is moderated by Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Venkat, what does Razorpay do and why is tracing and understanding your services architecture so important?

Venkat: Razorpay’s mission is to enable frictionless banking and payment experiences by powering the entire financial infrastructure for businesses of all shapes and sizes. It’s a full-stack financial solution that enables thousands of small- to medium-sized enterprises (SMEs) and enterprises to accept, process, and disburse payments at scale.

Venkat

Today, we process billions of dollars of payments from millions of businesses across India. As a leading payments provider, we have been the first to bring to market most of the major online innovations in payments for the last five years.

For the last two years, we have successfully curated neo banking and lending services. We have seen outstanding growth in the last five years and attracted close to $300 million-plus in funding from investors such as Sequoia, Tiger Global, Rebate, Matrix Partners, and others.

One of the fundamental principles about designing Razorpay has been to build a largely API-driven ecosystem. We are a developer-first company. Our general principle of building is, “It is built by developers for developers,” which means that every single product we build is always going to be API-driven first. In that regard, we must ensure that our APIs are resilient. That they perform to the best and most optimum capacity is of extreme importance to us.

Gardner: What is it about being an API-driven organization that makes tracing and observability such an important undertaking?

Venkat: We are an extremely Agile organization. As a startup, we have an obsession around our customers. Focus on building quality products is paramount to creating the best user experience (UX).

Our customers have amazing stories around our projects, products, and ecosystem. We have worked through extreme times (for example, demonetization, and the Yes Bank outage), and that has helped our customers build a lot of trust in what we do -- and what we can do.

Learn More 

We have quickly taken up the challenge and turned the tables for most of our customers to build a lot of trust in the kinds of things we do.

After all, we are dealing with one of the most sensitive aspects of human lives, which is their money. So, in this regard, the resiliency, security, and all the useability parameters are extremely important for our success.

Gardner: Jayesh, why is Razorpay a good example of what businesses are facing when it comes to APIs? And what requirements for such users are you attempting to satisfy with your distributed tracing and observability platform?

Observability offers scale, insight, resilience

Ahire: Going back to the days when it all started, people began building applications using monoliths. And it was easier then to begin with monolithic applications to get the business moving.

Ahire

But in recent times, that is not the only important thing for businesses. As we heard, Venkat needs scale and resiliency in the platform while building with APIs. Most modern organizations use microservices, which complicates these modern architectures. They become hard to manage, especially at large-scale organizations where you can have 100 to 300 microservices, with thousands of APIs communicating between those microservices.

It’s just hard now for businesses to have visibility and observability to determine if they have any issues and to see if the APIs are performing as they are expected.

I use a list of four brief questions that every organization needs to answer at some point. Are their APIs:

  • Providing the functionality they are supposed to deliver?

  • Performing in the way they are supposed to?

  • Secure for their business users?

  • Understood across all their APIs and microservices uses?

They must understand if the APIs and microservices are performing up to the actual expectations and required functionality. They need something that can provide the answers to these questions, at the very least.

Observability helps answer these essential questions without having to open the black box and go to each service and every API. Instead, the instrumentation data provides those insights. You can ask questions of your system and it will give you the answers. You can ask, for example, how your system is performing -- and it will give you some answers. Such observability helps large-scale organizations keep up with the scale and with the increasing number of users. And that keeps the systems resilient.

Gardner: Venkat, what are your business imperatives for using Hypertrace? Is it for UX? What is the business case for gaining more observability in your services development?

Metrics, logs, and traces limit trouble

Venkat: There are three fundamental legs to what we define as modern observability. One part is with respect to metrics, the next part has to do with the logs, and the third part is in respect to the traces.

Up until recently, we had application performance monitoring (APM) systems that monitored some of these things, with a single place to gather some metrics and insights. However, as microservices grew wider in use, APMs are no longer necessarily the right way to do these things. For such metrics, a lot of work is already going on in the open-source ecosystem with respect to Prometheus and others. I wrote a blog about our journey into scaling our metrics platform to trillions of data points.

Once you can get logs -- whether it is from open-source ELK Stack [Elasticsearch, Logstash, and Kibana], or whether it is from a lot of platform as a service (PaaS) and software as a service (SaaS) log providers -- fundamentally the issue comes down to traces.

As microservices evolve, you're talking about a lot more problems, such as how much time would a network call take? How much time would a database call take? Was my DNS request the biggest impediment? What really happened?

Now, traces can be visualized in a very primitive way, such as for instrumenting a particular piece of code to understand its behavior. It could be for a timing function, for example.

However, as microservices evolve, you’re talking about a lot more problems, such as how much time would a network call take? How much time would the database call take? Was my DNS request the biggest impediment? What really happened in the last mile?

And when you’re talking about an entire graph of services, it’s very important to know what particular point in the entire graph breaks down often – or doesn’t break down very often.

Understanding all these things, as Jayesh said, and asking the right questions cannot happen only by using metrics or just logs. They only give different slices of the problems. And it cannot happen only by using tracing, which also only gives a different slice of the problem.

In an ideal, nirvana world, you need to combine all these things and create a single place that can correlate these various things and allow a deep dive with respect to a specific component, module, function, system, query, or whatever. Being able to identify root causes and the mean time to detect (MTTD), these are some of the most paramount things that we probably need to worry about.

In complex, large-scale systems, things go wrong. Why things went wrong is one part, when did things go wrong is another part, and being able to arrive and fix things – the MTTD and the mean time to recovery (MTTR) -- those largely define the success of any business.

We are just one of the many financial ecosystem providers. There are tons of providers in the world. So, the customer has many options to switch from one provider to another. For any business, how they react to these performance issues is the most important.

Observability tools like Hypertrace puts us in control, rather than just leaving it for hypothesis.

Gardner: Jayesh, how does Hypertrace improve on such key performance controls as MTTD and MTTR? How is Hypertrace being used to cut down on that all important time to remediation that makes the user experience more competitive?

Tracing eases uncovering the unknown

Ahire: As Venkat pointed out, in these modern systems, there are too many unknown unknowns. Finding out what caused any problem at any point in time is hard.

At Hypertrace, in trying to help businesses, we present entity-focused, API-first views. Hypertrace provides a very detailed service dashboard, an overview, an out-of-the-box service overview. Such a backend API overview helps find what different services are talking to each other, how they are talking to each other, the interactions between the different services, and then what different APIs are talking to the services. It provides a list of APIs.

Hypertrace provides a single pane view into the services and API trace data. The insights gained from the trace data makes it easier to find which API or service has some issue. That’s where the entity-first API view makes the most sense. The API dashboard helps people get to the issue very easily and helps reduce the MTTD and MTTR.

Venkat: Just to add to what Jayesh mentioned, in our world our ecosystem is internally a Kubernetes ecosystem. And Kubernetes is extremely dynamic in nature. You’re not anymore dealing with single, private IDs or public IDs, or any of those things. Services can come up. Parts can come up. Deployments can come up, go down.

So, service discoverability becomes a problem, which means that tying back a particular behavior to these services, which are themselves a collection of services, and to the underlying infrastructure -- whether you’re talking about queues or network calls -- you’re talking about any number of interconnected infrastructure components as well. That becomes extremely challenging.

Cardinality becomes an extremely important issue. Metrics alone cannot solve that [service discoverability] problem. Logs alone cannot solve that problem. A very simple payments request carries at least 35 different cardinality dimensions.

The second aspect is implicitly most of our ecosystems run on preemptive workloads, or smart workloads. So, nodes can come up, nodes can go down. How do you put these things together? While we can identify a particular service as problematic, I want to find out if it is the service that is problematic or the underlying cloud provider. And within the cloud provider, is it the network or the actual hardware or operating system (OS)? If it is OS, which part precisely? Is it just a particular part that is problematic, or is the entire hardware problematic? That’s one view.

The other view is that cardinality becomes an extremely important issue. Metrics alone cannot solve that problem. Logs alone cannot solve that problem. A very simple request, for example, a payment-create-request in our world, carries at least 30 to 35 different cardinality dimensions (e.g.: the merchant identity, gateway, terminal, network, and whether the payment is domestic vs international, etc.).

Learn More 

A variety of these parameters comes into play. You need to know if it’s an issue overall, is it at a particular merchant, and at what dimension? So, you need to narrow down the problem in a tight production scenario.

To manage those aspects, tools like Hypertrace, or any observability tool, for that matter -- tracing in general -- makes it a lot easier to arrive at the right conclusions.

Gardner: You mentioned there are other options for tracing. How did you at Razorpay come to settle on Hypertrace? What’s the story behind your adoption of Hypertrace after looking at the tracing options landscape?

The why and how of choosing Hypertrace

Venkat: When we began our observability journey, we realized we had to go further into visibility tracing because the APMs were not answering a lot of questions we were asking of the APM tool. The best open-source version was that offered by Jaeger. We evaluated a lot of PaaS/SaaS solutions. We really didn't want to build an in-house observability stack.

There were a few challenges in all the PaaS offerings including storage, ability to drill down, retention, and cost versus value offered. Additionally, many of the providers were just giving us Jaeger with add-ons. The overall cost-to-benefit ratio suffered because we were growing with both the number of services and users. Any model that charges us on the user level, data storage level, or services level -- these become prohibitive over time.

Although maintaining an in-house observability tool is not the most natural business direction for us, we soon realized that maybe it’s best for us to do it in-house. We were doing some research and hit upon this solution called Hypertrace. It looked interesting so we decided to give it a try.

They offered the ability for me to jump into a Slack call. And that’s all I did. I just signed up. In fact, I didn’t even sign up with my company email address. I signed up with my personal email address and I just jumped on to their Slack call.


I started asking the Hypertrace team lots of questions. Started with a Docker-compose, straight out of their GitHub repo. The integration was quite straightforward. We did a set of proof-of-concepts and said, “Okay, this sort of makes sense.” The UX was on par with any commercial SaaS provider. That blew my mind. How can an open-source product build such a fantastic user interface (UI)? I think that was the first thing that hit most of our heads. And I think that was the biggest sell. We said, “Let’s just jump in and see how it evaluates.” And that’s the story.

Gardner: What sort of paybacks or metrics of success have you enjoyed since adopting Hypertrace? As open source, are you injecting your own requirements or desired functions and features into it?

Venkat: First and foremost, we wanted to understand the beast we were dealing with in our APIs, which meant we had to build in the instrumentation and software development kits (SDKs), including OpenCensus, OpenTracing, and OpenTelemetry agents.

We had to make internal developer adoption easier by building the right toolkits, the right frameworks, and the right SDKs because applications have their own business asks, and you shouldn't be adding woes to their existing development life cycles.

The next step was integrating these tools within our services and ecosystem. There are challenges in terms of internally standardizing all our instrumentation, using best practices, and ensuring that applications are adopted. We had to make internal developer adoption easier by building the right toolkits, the right frameworks, and the right SDKs because applications have their own business asks, and you shouldn’t be adding woes to their existing development life cycle. Integration should be simple! So, we formulated a virtual team internally within Razorpay to build the observability stack.

As we built the SDKs and tooling and started instrumenting, we did a lot of adoption exercises within the organization. Now, we have more than 15 critical services and a lot more in the pipeline. Over a period of time, we were able to make tracing a habit rather than just another “nice to have.”

One of the biggest benefits we started seeing from the production monitoring is our internal engineering teams figured out how to run performance tests in pre-production. Some of these wouldn’t have been possible before; being able to pin down the right problem areas.

Learn More 

Now, during the performance testing, our engineers can early-on pinpoint the root cause of the problems. And they’ve gone back to fix their code even before the code goes into production. And believe me that it’s a lot more valuable for us than the code going into production and then facing these problems.

The misfortune about all monitoring tools is typical metrics might not be applicable. Why? Because when things go right, nobody wants to look at monitoring. It’s only when things go wrong that people log into a monitoring tool.

The benefits of Hypertrace come in terms of how many issues you’re able to detect much earlier in the stages of development. That’s probably the biggest benefit we have gotten.

Gardner: Jayesh, what makes Hypertrace unique in the tracing market?

Democratic data for API analytics

Ahire: There are two different ways to analyze, visualize, and use the data to better understand the systems. The first important thing is how we do data collection. Hypertrace provides data collection from any standard instrumentation.

If your application is instrumented with Jaeger, Zipkin, or OpenTelemetry, and you start sending the instrumentation data to Hypertrace, it will be able to analyze it and show you the dashboard. You then will be able to slice and dice the data using our explorer. You can discover a lot of different things.

That democratization of the data collection aspect is one important thing Hypertrace provides. And if you want to use any other tracing platform you can do that with Hypertrace because we support all the standard instrumentation.

Next is how we utilize that data. Most tracing platforms provide a way to slice and dice their data. So that’s just one explorer view where there’s all the data from the instrumentation available and you can find the information you want. Ask the question and then you will get the information. That’s one way to look at it.

Hypertrace provides, in addition to that explorer view, a detailed service graph. With it, you can go to applications, see the service interactions, the latency markings, and learn which services are having errors right away. Out-of-the-box services derived from instrumentation data provide many necessary metrics and visualizations, including latency, error rate, and call rate.

You can see more of the API interactions. You can see comparison data to current data, for example. Whatever your latency was in the last one day to the last hour. It provides you a comparison for that. And it’s pretty helpful by being able to compare between deployments, such as if the performance, latency, or error rate is affected. There are a lot of use cases you can solve with Hypertrace.

With such observability used in early problem detection, you can reduce MTTD and MTTR using these dashboard services. You can achieve early problem detection easily.

The expectation is for availability of 99.99 percent. In the case of Razorpay, it's very critical. Any downtime has a business impact. For most businesses, that's the case.

Then there’s availability. The expectation is for availability of 99.99 percent. In the case of Razorpay, it’s very critical. Any downtime has a business impact. For most businesses, that’s the case. So, availability is a critical issue.

The Hypertrace dashboards help you to maintain that as well. Currently, we are working on alerting features on deviations -- and those deviations are calculated automatically. We calculate baselines from the previous data, and whenever a deviation happens, we give an alert. That obviously helps in reducing MTTD as well as increasing availability generally.

Hypertrace strives to make the UX seamless. As Venkat mentioned, we have a beautiful UI that looks professional and attractive. The UI work we put into our SaaS security solution, Traceable AI, this functionality also goes into Hypertrace, and so helps the community. It helps people such as Venkat at Razorpay to solve the problems in their environment. That’s pretty good.

Gardner: Venkat, for other organizations facing similar complexity and a need to speed remediation, what recommendations do you have? What should other companies be thinking about as they evaluate observability and tracing choices? What do you recommend they do as they get more involved with API resiliency?

Evaluate then invest in your journey

Venkat: A fundamental problem today in the open-source world with tracing is the quality of standards. We have OpenCensus on one side going to OpenTelemetry and OpenTracing going to OpenTelemetry. In trying to keep it all compatible, and because it’s all so nascent, there is not a lot of automation.

For most startups, it is quite daunting to build their own observability stack.

My recommendation is to start with an existing tracing provider and evaluate that against your past solutions. Over time it may become cost prohibitive. At some point, you must start looking inward. That’s the time when systems like Hypertrace become quite useful for an organization.

The truth is it’s not easy to build on an observability stack. So, experiment with a SaaS provider on a lower scale. Then invest in the right tooling, one that gives the liberty to not maintain the stack, such as Hypertrace. Keep the internal tooling separate, experiment, and come back. That’s what I would recommend.

The cost is not just the physical infrastructure cost, or the licensing cost. Cost is also engineering cost of the stack. If the stack goes down, who monitors the monitor? It’s a big question. So, there are trade-offs. There is no right answer, but it’s a journey.

After our experience with Hypertrace, I have connected with a couple of my friends in different organizations, and I’ve told them of the benefits. I do not know their results, but I’ve told them some of the benefits that we have leveraged using Hypertrace.

Gardner: And just to follow up on your advice for others, Venkat, what is it about open source that helps with those trade-offs?

Venkat: One advantage we have with open-source is there is no vendor lock-in. That’s one major advantage. One of our critical services is in PHP. And hence, we needed to only use OpenCensus for instrumenting it.

We're working with the Hypertrace community to build in some new features, such as tool design, Blue Coat, knowledge sharing, and bug-fixing. For us, it's been an interesting and exciting journey.

But there were a lot of performance and resilience issues with this codebase. Today, the original OpenCensus PHP implementation points to Razorpay’s fork.

And we are working with the Hypertrace community, too, to build some features, whether it is in tool design, Blue Coat, knowledge sharing, and bug-fixing. For us it’s been an interesting and exciting journey.

Ahire: Yes, that has been the mutual experience from our end as well. We learned a lot of things. We had made assumptions in the beginning about what users might expect or want.

But Razorpay worked with us. On some things they said, “Okay, this is not going to work. You have to change this part.” And we modified some things, we added a few features, and we removed a few things. That’s how it came to where it is today. The whole collaboration aspect has been very rewarding.

Venkat: Even though we have a handful of critical services, the data that are instrumented from them, it was over two terabytes a day. And while that is a good problem to have, we have other interesting scaling challenges we need to deal with.

So how do you optimize these things at scale? In the SaaS form, we could have just gone and said, “Hey, this sort of doesn’t work.” We stick with them for a few months then we go ahead with another SaaS provider and say, “Are you going to solve this problem or not?”

The flexibility we get with open source is to say, “Okay, here’s the problem. How do we fix it?” Because, of course, they’re not under our control, right? I think that’s super powerful.

Ahire: Here we all learn together.

Gardner: Yes, it certainly sounds like a partnership relationship. Jayesh, tell us a little bit about the roadmap for Hypertrace, and particularly for the smaller organizations who might prefer a SaaS model, what do you have in store for them?

Ahire: We are currently working on alerting. We’ll soon release dynamic anomaly-based alerting.

We are also working on metric ingestion and integrations throughout the Hypertrace platform. An important aspect of tracing and observability is being able to correlate the data. To propagate context throughout the system is very important. That’s what we will be doing with our metric integration. You will be able to send application metrics, and you will be able to correlate back to base data and log data.

Learn More 

And talking of SaaS, when it comes to smaller organizations with maybe 10, 20, or 30 developers and a not very well-defined DevOps team, it can be hard to deploy and manage this kind of platform.

So, for those users, we are working toward a SaaS model so smaller companies will be able to use the Hypertrace stack functionality.


G
ardner:
Where can organizations go to learn more about Hypertrace and start to use some of these features and functions?

Ahire: You can head on to hypertrace.org, our website, and find the details of our use cases. There’s a Slack channel link, GitHub, and everything is available there. Those are good places to start.

Venkat: Just try it first and just go to GitHub and within a few minutes you should have the entire stack up and running. I mean, that’s as simple as simplicity can get.

For further details, just go to the Slack channel and start communicating. Their team is super-duper responsive and super-duper helpful. In fact, we have never had to talk to them saying, “Hey, what’s this?” because we sort of realized that they come back with a patch much faster than you can imagine.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: Traceable AI.  

You may also be interested in:

Friday, October 1, 2021

Traceable AI platform builds usage knowledge that detects and thwarts API vulnerabilities


The rapidly expanding use of application programming interfaces (APIs) to accelerate application development and advanced business services has created a vast constellation of interrelated services -- often now called the API Economy.

Yet the speed and complexity of this API adoption spree has largely outrun the capability of existing tools and methods to keep tabs on the services topology -- let alone keep these services secure and resilient.

Stay with us here as BriefingsDirect explores a new platform designed from the ground up specifically to define, manage, secure, and optimize the API underpinnings for so much of what drives today’s digital businesses.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy.

To learn more about how Traceable AI aims to make APIs reach their enormous potential safely and securely, please welcome Sanjay Nagaraj, Chief Technology Officer (CTO) and Co-Founder at Traceable AI. The interview is moderated by Dana Gardner, Principal Analyst at Interarbor Solutions.

Here are some excerpts:

Gardner: Why is addressing API security different from the vulnerabilities of traditional applications and networks? Why do we need a different way to head off API vulnerabilities?

Nagaraj: If you compare this to the analogy of protecting a house, previously there was a single house with a single door. You only had to protect that door to block someone from coming into the house. It was a lot easier.

Nagaraj

Now, you have to multiply that because there are many rooms in the house, each with an open window. That means an attacker can come in through any of these windows, rather than only through a single door to the house.

To extend the analogy across the API economy, most businesses today are API-driven businesses. They expose APIs. They also use third-party libraries that connect to even more APIs. All of these APIs are powering the business but are also interacting with both internal and third-party APIs.

APIs and services are everywhere. The microservices are developed to power an entire application, which is then powering a business. That’s why it is getting so complex compared to what used to be a typical network security or a basic application security solution. Before, you would take care of the perimeter for a particular application and secure the business. Now, that extends to all these services and APIs. 

And when you look at network security, that operated at a different layer. It used to be more static. You therefore had a good understanding of how the network was set up and where the different application components were deployed.

Nowadays, with rapidly changing services coming online all the time, and APIs coming online all the time, there is no single perimeter. In this complex world, where it is all APIs across the board, you must take into consideration more aspects to understand the security risks for your APIs, and -- in turn -- what your business risks are. Business is riskier when it comes to today’s security.

Because it’s so very complex, the older security solutions can’t keep up. We at Traceable AI choose to take care of security by looking at the data that comes in as part of the calls hitting the URLs. We take into consideration more context to detect whether something is an attack or some anomaly that is not necessarily malicious but may be a reconnaissance-type of attack.

All of these issues mean we need more sophisticated solutions that frankly the industry hasn’t caught up to even though developer and development, security, and operations (DevSecOps) advances have moved a lot faster. 

Gardner: And, of course, these are business-critical services. We’re talking about mission-critical data moving among and between these APIs, in and out of organizations and across their perimeters. With such critical data at hand, the reputation of your business is at stake because you could end up in a headline tomorrow.

Data is everywhere, exposed

Nagaraj: Exactly. At the end of the day, APIs are exposing data to their business users. That means the data flowing through might be part of the application, or it might be from another business-to-business API. You might be taking the user’s data and pushing it to a third-party service.

We’ve all seen the attacks on very sophisticated technology companies. These are very hard problems. As a developer myself, I can tell you what keeps me up most of the time: Am I doing the right thing when it comes to the functionality of my application? Am I doing the right thing when it comes to the overall quality of it? Am I doing the right thing when it comes to delivering the right kind of performance? Am I meeting the performance expectations of my users?

We've all seen the attacks on very sophisticated technology companies. These are very hard problems. As a developer myself, I can tell you what keeps me up most of the time: Am I doing the right thing when it comes to the functionality of my application?

What do I, as a developer, think about the security of every single API that I’m writing? At the end of the day, it’s about the data that is getting exposed through these APIs. It’s important now to understand how this data is getting used. How is this data getting passed around through internal services and third-party APIs? That’s where the risk associated with your API is.

Gardner: Given that we have a different type of security problem to solve, what was your overarching vision for making APIs both powerful and robust? What is it in your background that helped you get to this vision of how the world should be?

Nagaraj: If you dial back the clock for myself and Jyoti Bansal, my co-founder at Traceable, we built the company AppDynamics, which was on the forefront of helping developers and DevOps teams understand their applications’ performance. When that product started, there was a basic understanding of how applications performed and were delivered to the customers. Over time, we started to think about this in a different way. One of the goals at AppDynamics was to understand applications from the ground up. You had to understand how these applications with their modules and sub-modules, and with the sub-services, were interacting with each other.

Learn More 

A basic understanding was required to learn if the end-user experience was being delivered with the expected performance. That gave rise to application performance management (APM) in terms of a fuller understanding of an application’s underlying performance itself.

From an AppDynamics’ perspective, it was very important for us to know how the services were impacting each other. That means when a call gets made from service A to service B, you should understand how much time was consumed on the call and what was happening between the two, as well as how much time was spent within the service, between the services, and how much total time was spent delivering the data back to the user.

This is all in the performance context. But one of the key things we clearly knew as we started Traceable AI was that APIs were exploding. As we talked about with the API Economy, every one of the customers Traceable started to talk to asked us about more than just the performance aspects of APIs. They also wanted to know whether these APIs and applications were secure. That’s where they were having a difficult time. As much as developers like to make sure that APIs are secure, they are unable to do it simply because they don’t understand what goes into securing APIs.

That’s when we started to think about how to bring some of the learning we had in the past around application performance for developers and DevOps teams, and bring that to an understanding of APIs and services. We had to think about application security in a new way.

We started Traceable AI to find the best way to understand applications and the interactions of the applications, as well as understanding the uses. The way to do it was the technology built over the last decade for distributed tracing. By helping us trace the calls from one service to another, we were able to tap the data flowing through the services to understand the context of the data and services.

From the context and the data, you can learn who the users of these APIs are, what type of data is flowing, and which APIs are interacting with each other. You can see which APIs are getting called as part of a single-user session, for example, and from which third-party APIs the data is being pulled from or pushed to.

This overall context is what we wanted to understand. That’s where we started, and we built on the existing tracing technology to deliver an open-source platform, called Hypertrace. Developers can easily use it for all kinds of tracing use cases, including performance. We have quite a few customers that have started to use it as an open-source resource.

But the goal for us was to use that distributed tracing technology to solve application security challenges. It all starts with so many customers saying, “Hey, I don’t even know where my APIs exist. Developers seem to be pushing out a lot of APIs, and we don’t understand where these APIs are. How are they impacting our overall business in terms of security? What if some of these things get exposed, what happens then? If you must do a forensic analysis of these, what happens then?”

See it to secure it with tracing

We said, “Let’s use this technology to understand the applications from the ground up, detect all these APIs from the ground up.” If the customers don’t understand where the APIs exist, and what the purpose of these APIs are, then they won’t be able to secure them. For us, the basic concept was bringing the discovery of these applications and APIs into focus so that customers can understand it. That’s the vision of where we started.

Then, based on that, we said, “Once they discover and understand what APIs they have, let’s go further to understand what the normal behavior of these APIs are.”

Once APIs are published there are tools to document those APIs in the form of an OpenAPI or a Swagger spec. But if you talk to most enterprises, there are rarely maintained records of those things. What developers do very well is ship code. They ship good functionality; they try to ship bug-free code that performs well.

But, at the same time, the documentation aspects of it are where it gets weak because they’re continuously shipping. Because the code is changing continuously, from a continuous integration/continuous delivery (CI/CD) perspective, the developers are not able to continuously keep the spec documentation up-to-date, especially as it continuously gets deployed and redeployed into production.

The whole DevSecOps movement needs to come together so the security practitioners are embedded with the developer and DevOps teams. That means the security folks have to have a continuous understanding of the security practices to ensure the APIs that are coming online are understood.

The whole DevSecOps movement needs to come together so the security practitioners are embedded with the developer and DevOps teams. That means the security folks have to have a continuous understanding of the security practices to ensure the APIs that are coming online continuously are understood.

Our customers now also are expecting our solution to help them automate these things. They want to automatically understand the risks of APIs -- which APIs should be blocked from being deployed into production and which APIs should be monitored more. There needs to be a cycle of observing these APIs on a continuous basis. It’s very, very critical.

From our perspective, once we build this ongoing understanding of the APIs – as we discover and build an understanding of the APIs – we then want to protect those APIs before they get into production.

The inability to properly protect these APIs is not because some small company doesn’t have the technology skills or the proper engineering. It’s not about developers not having the right kind of training. We are talking about capable companies like Facebook, Shopify, and Tesla. These are technology-rich companies that are still having these issues because the APIs are continuously evolving. And there are still siloed pieces of development. That means in some cases they might understand the dependencies of the services, but in a lot of cases they don’t fully understand the dependencies and the security implications because of those dependencies.

This reality exposes a lot of different types of attacks, such as business logic attacks, as you and Jyoti talked about in your previous conversations. We know why those are very, very critical, right?

Learn More 

How do you protect against these business logic vulnerabilities? The API discovery and understanding the API risk are very key. Then, on top of those, the protection aspects are very, very key. So, that was where we started. This is part of the vision that we have built out.

Because of the way our new platform has been built, we enable all these understandings. We want to expose these understandings to our customers so they can go and hunt for different types of attacks that may be lurking. They can also use and analyze this information not just for heading off prospective attacks but to help influence all the different types of development and security activities.

This was the vision we began with. How do you bring observability into application security? That’s what we built. We help evolve their overall application security practices.

Gardner: In now understanding your vision, and to avoid a firehose of data and observations, how did you design the Traceable platform to attain automation around API intelligence? How did you make API observability a value that scales?

Continuous comprehension

Nagaraj: One of the key aspects of building a solution is to not just throw data at your customers. That means you’re correcting the data; you’re not just presenting a data lake and asking them to slice and dice and analyze it using manual processes. The goal from the get-go for us was to understand the APIs and to categorize them in useful ways.

That means we must understand which APIs are external-facing, which are internal-facing, and where the sensitive data is. What amount and type of sensitive data is getting carried through these APIs? Who are the users of these APIs? What roles do they have with an API?

We are also building a wealth of insights into how the APIs themselves behave. This helps our customers know what to focus on. It is not just about the data. Data forms a basis for all these other insights. It’s not about presenting the data to the customers and saying, “Hey, go ahead and figure things out yourself.” 

We bring insights that enable the security and operations teams -- along with the developers and DevSecOps teams -- to know what security aspects to focus on. That was a key principle we started to build the product on.

The second principle is that we know the security and operations teams are very swamped. Most of the time they are under-resourced in terms of the right people. It was therefore very important that the data we present to those teams is actionable. The types of protection we provide from detection of anomalies must have very low levels of false positives. That was one of the key aspects of building our solution as well.

A third guiding principle for us, from the DevSecOps team’s perspective, is to give them actionable data to understand the code that is being deployed even when the services are deployed in a cloud-native fashion. How do you understand at the code level, which ones are making a database call and where that data is flowing to? How do you know which cloud-based APIs are making third-party API calls to know if there are vulnerabilities? That is also very important to manage.

We have taken these principles very seriously as we built the solution. We bring our deep understanding of these APIs together with artificial intelligence (AI) and machine learning (ML) on top of the data to extract the right insights -- and make sure those are actionable insights for our users. That is how we built the platform from the ground up. Because continuous delivery (CD) is how applications are deployed today, it’s very important that we are continuously providing these insights.

We have taken these principles very seriously as we built the solution. We bring our deep understanding of these APIs together with AI and ML on top of the data to extract the right insights -- and make those actionable for our users.

It’s not enough to just say, “Hey, here are your APIs. Here are the insights on top of those, and here is where you should be focusing from a risk perspective.” We must also continuously adjust and gain new insights as the APIs evolve and change.

There was one last thing we set out to do. We knew our customers are in a journey to microservices. That means we must provide the solution across diverse infrastructures, for customers fully in a cloud-native microservices environment as well as customers making their journey from legacy, monolithic applications; and everything in-between. We must provide a bridge for them to get to their destinations regardless of where they are.

Gardner: Yes, Traceable AI recently released your platform’s first freely available offering in August. Now that it’s in the marketplace, you’re providing a strong value to developers, by helping them to iterate, improve, and catch mistakes in their APIs design and use. Additionally, by being able to define vulnerabilities in production, you’re also helping security operations teams. They can limit the damage when something goes wrong.

By serving both of those two constituencies, you’re able to bridge the gap between them. Consequently, there’s a cultural assimilation value between the developers and the security teams. Is that cultural bond what you expected?

Reduce risk with secure interactions

Nagaraj: Absolutely. I think you said it right. In a lot of cases, these organizations are rapidly getting bigger and bigger. Typically, today’s microservices-based, API-driven development teams have six to eight members building many pieces of functionality, which eventually form an overall application. That’s the case internally at Traceable AI, too, as we build out our product and platform.

And so, in those cases, it’s very important that there is an understanding around how API requests come into an overall application. How do they translate across all the different services deployed? What are the services – defined as part of those small teams -- and how are they interacting with each other to deliver a single customer’s request? That has a huge impact on understanding the overall risk to the application itself.

The overall risk in a lot of cases is based on a combination of factors driven by all the APIs being exposed to those applications. But knowing all the APIs interacting with these services -- and the data that’s going through these services -- is very important to get a holistic understanding of the application, and the overall application infrastructure, to make sure you’re delivering security at an application level.

Learn More 

It’s no longer enough just to say, “Yes, we are secure. We’re practicing all the secure-coding practices.” You must also ask, “But what are the interactions with the rest of the organization?” That’s why it was essential for us to build what we call API Intelligence from the ground up based on the actual data. We attain a deeper understanding of the data itself.

That intelligence now helps us say, “Hey, here are all the APIs used across your organization. Here’s how they’re interacting with each other. Here’s how the data goes between them. Here are the third-party APIs being accessed as part of those services.”

We get that holistic understanding. That broad and inclusive view is very important because it’s just not about external APIs being accessed. It includes all the internal APIs being built and used, as well, from the many small teams.

Customers often tell me after using our solution that their developers are shocked there are so many APIs in use. In some cases, they thought they were duplicate APIs. They never expected those APIs to show up as part of any single service. It feels good to hear that we are bringing that level of visibility and realization. 

Next, based on our API Intelligence, comes the understanding of the risks. And that is so very important because once the developers understand the risks associated with a particular API, the way they go about protecting them also becomes very important. It means the vulnerabilities are going to get prioritized and then the fixes are going to be prioritized the right way, too. The ways they protect the APIs and put in the guards against these API vulnerabilities will change.

At the end of the day, the goal for us is to bring together the developers and the DevOps and security teams. Whether you look at them as a single team or separate teams, it doesn’t matter for an organization. They all must work together to make security happen. We wanted to provide a single pane of glass for them to all see the same types of data and insights.

Gardner: I have been impressed that the single pane view can impact so many different roles and cultures. I also was impressed with the interface. It allows those different personas to drill down specific to the context of their roles and objectives.

Tell us how that drilling down capability within the Traceable AI user interface (UI) gives the developers an opportunity to compress the time of gaining an understanding of what’s going on in API production and bring that knowledge back into pre-production for the next iteration?

Ounce of pre-production prevention

Nagaraj: One of the key things in any development lifecycle is the stages of testing you go through. Typically, applications get tested in the development and quality assurance (QA) stages along the way.

But one of the “testing” opportunities that can get missed in pre-production is to learn from the production data itself. That is what we are addressing here. As a developer, I like to think that all the tests being written in my pre-production environment cover all the use cases. But the reality is that the way customers use the applications in production can be different than expected. And the type of data that flows through can be different too.

This is even more true now because of API-driven applications. With API-driven applications, the developer has an intent of how their APIs are used, and most of their tests mimic that intent. But once you give the APIs to third-party developers – or hackers -- they might see the same APIs that the developer sees yet use them in unintended ways. Once they gain an understanding of how the API logic has been built internally the external users might be able to get a lot more information than they should be able to.

If we understand the true risks associated with these APIs in use, we can present that in-production-use knowledge back into pre-production. That means decisions about which APIs need to be protected differently can be made by using the right kinds of controls.

This is where it gets complex. This means that rather than treating production and pre-production as silos, the thought process is to bring the production learning and knowledge to help improve the application’s  security posture in pre-production because we know how certain APIs are actually being used.

If we understand the true risks associated with these APIs in use, we can present that in-production-use knowledge back into pre-production, such as users accessing APIs they aren’t supposed to be accessing. That means decisions about which APIs need to be protected differently can be made by using the right kinds of controls.

The core benefit to customers is that they can understand their API risks earlier so that they can protect their APIs better.

Gardner: The good news is there’s new value in post-production and pre-production. But who oversees bringing the Traceable AI platform into the organization? Who signs the PO? Who are the people who should be most aware of this value?

APIs behavior in a single pane of glass

Nagaraj: Yes, there are typically various types of organizations at work. It’s no longer a case of a central security team making all the decisions. There are engineering-driven, DevOps teams that are security-conscious. That means many of our customers are engineering leaders who are making security their top priority. It means that the Traceable AI deployment aspects also come to pre-production and production as part of their total development lifecycle.

One of the things we are exploring as part of our August launch is to make the solution increasingly self-service. We’ve provided low friction way for developers and DevOps teams to get value from Traceable AI in their pre-production and production systems, to make it part of their full lifecycle. We are heavily focused on enabling our customers to have easy deployment as a self-service experience.

On the other hand, when the security and operations teams need to encourage the developers or DevOps teams to deploy Traceable AI, then, of course, that ease-of-use experience is also very important.

A big value for the developers is that they get a single pane of glass, that means they are seeing the same information that the security teams are seeing. It is no longer the security people saying, “There are these vulnerabilities which is a problem;” or, “There are these attacks we are seeing,” and the developers don’t have the same data. Now, we are offering the same types of data by bringing observability from a security perspective to provide the same analysis to both sides of the equation. This makes everyone into a more effective team solving the security problems.

Gardner: And, of course, you’re also taking advantage of the ability to enter an organization through the open-source model. You have a free open-source edition, in addition to your commercial edition, that invites people to customize, experiment, and tailor the observability to their particular use cases -- and then share that development back. How does your open-source approach work?

Nagaraj: We built a distributed tracing platform, which was needed to support all the security use cases. That forms a core component for our platform because we wanted to bring in tracing and observability for API security.

That distributed tracing platform, called Hypertrace, as part of the Traceable AI solution, will enable developers to adopt the distributed tracing element by itself. As you mentioned we are making it available for free and as open source.

We’ve also launched a free tier of the Traceable AI security solution which includes the basic versions of API discovery, risk monitoring, and basic protection, for securing your applications. This is available to everybody.

Our idea was we wanted to democratize access to good API security tools, to help developers easily get the functionality of API observability and risk assessment so that everyone can be a pro-active part of the solution. To do this we launched the Free tier and the Team tier, which includes more of the functionality that our Enterprise tier includes.

Learn More 

That means, as a DevOps team, you’re able to understand your APIs and the risks associated with them, and to enable basic protections on those APIs. We’re very excited about opening this up to everyone.

But the thing that excites the engineer in me is that we are making our distributed tracing platform source code available for people to go build solutions on top of. They can use it in their own environments. At the end of the day, the developers can solve their own business problems. We are in the business of helping them solve the security problems, and they can solve their other business needs.

For us, it is about how do we secure their APIs. How do we help them understand their APIs? How can they best discover and understand the risks associated with those APIs? And that’s our core. We are putting it out there for developers and DevOps teams to use.

Gardner: Sanjay, going back to your vision and the rather large task you set out for yourselves, as Traceable AI becomes embedded in organizations, is there an opportunity for the API economy to further blossom?

How big of an impact do you expect to have over the next few years, and how important is that for not only the API economy, but the whole economy?

Economy thrives with continuous delivery

Nagaraj: From an API economy perspective, it’s thriving because of the robust use of these APIs and the reuse of services. Any time we hear news about APIs getting hacked or data getting lost, there is an inclination to say, “Hey, let’s stop the code from shipping,” or, “Let's not ship too many features,” or, “Let's make sure it is secure enough before it ships.”

The only way we can get better at this is by bringing in the technology that enables the continuous delivery of code that is secured in pre-production and not just at runtime.

But that means the continuous delivery benefits powering the API economy are not going to work. We, as a community of developers, must come up with ways of ensuring security and privacy so we can continue to maintain the pace of a continuous software development life cycle. Otherwise, this will all stall. And these challenges will only get bigger because APIs are here to stay. The API economy is here to stay. APIs will be continuously evolving, and they will be delivering more and more functionality on a continuous basis.

The only way we can get better at this is by bringing in the technology that enables the continuous delivery of code that is secured in pre-production and not just at runtime. And that’s the goal from our perspective, to build that long-term and viable solution for enterprises.

Listen to the podcast. Find it on iTunes. Read a full transcript or download a copy. Sponsor: Traceable AI.

You may also be interested in: