Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or download a copy.
Here to share how analytics paves the way to better understanding of end-user needs and wants, we're joined by Joel Minton, Director of Data Science and Engineering for TurboTax at Intuit in San Diego. The discussion is moderated by me, Dana Gardner, Principal Analyst at Interarbor Solutions.
Here are some excerpts:
Gardner: Let’s start at a high-level, Joel, and understand what’s driving the need for greater analytics, greater understanding of your end-users. What is the big deal about big-data capabilities for your TurboTax applications?
Minton: There were several things, Dana. We were looking to see a full end-to-end view of our customers. We wanted to see what our customers were doing across our application and across all the various touch points that they have with us to make sure that we could fully understand where they were and how we can make their lives better.
And last but not least, there was the explosion of available technologies to ingest, store, and gain insights that was not even possible two or three years ago. All of those things have made leaps and bounds over the last several years. We’ve been able to put all of these technologies together to garner those business benefits that I spoke about earlier.
Gardner: So many of our listeners might be aware of TurboTax, but it’s a very complex tax return preparation application that has a great deal of variability across regions, states, localities. That must be quite a daunting task to be able to make it granular and address all the variables in such a complex application.
Minton: Our goal is to remove all of that complexity for our users and for us to do all of that hard work behind the scenes. Data is absolutely central to our understanding that full end-to-end process, and leveraging our great knowledge of the tax code and other financial situations to make all of those hard things easier for our customers, and to do all of those things for our customers behind the scenes, so our customers do not have to worry about it.
Gardner: In the process of tax preparation, how do you actually get context within the process?
Minton: We're always looking at all of those customer touch points, as I mentioned earlier. Those things all feed into where our customer is and what their state of mind might be as they are going through the application.
To give you an example, as a customer goes though our application, they may ask us a question about a certain tax situation.
When they ask that question, we know a lot more later on down the line about whether that specific issue is causing them grief. If we can bring all of those data sets together so that we know that they asked the question three screens back, and then they're spending a more time on a later screen, we can try to make that experience better, especially in the context of those specific questions that they have.
As I said earlier, it's all about bringing all the data together and making sure that we leverage that when we're making the application as easy as we can.
Gardner: And that's what you mean by a 360-degree view of the user: where they are in time, where they are in a process, where they are in their particular individual tax requirements?
Minton: And all the touch points that they have with not only things on our website, but also things across the Internet and also with our customer-care employees and all the other touch points that we use try to solve our customers’ needs.
Gardner: This might be a difficult question, but how much data are we talking about? Obviously you're in sort of a peak-use scenario where many people are in a tax-preparation mode in the weeks and months leading up to April 15 in the United States. How much data and how rapidly is that coming into you?
Minton: We have a tremendous amount of data. I'm not going to go into the specifics of the complete size of our database because it is proprietary, but during our peak times of the year during tax season, we have billions and billions of transactions.
We have all of those touch points being logged in real-time, and we basically have all of that data flowing through to our applications that we then use to get insights and to be able to help our customers even more than we could before. So we're talking about billions of events over a small number of days.
Gardner: So clearly for those of us that define big data by velocity, by volume, and by variety, you certainly meet the criteria and then some.
Minton: The challenges are unique for TurboTax because we're such a peaky business. We have two peaks that drive a majority of our experiences: the first peak when people get their W-2s and they're looking to get their refunds, and then tax day on April 15th. At both of those times, we're ingesting a tremendous amount of data and trying to get insights as quickly as we can so we can help our customers as quickly as we can.
Gardner: Let’s go back to this concept of user experience improvement process. It's not just something for tax preparation applications but really in retail, healthcare, and many other aspects where the user expectations are getting higher and higher. People expect more. They expect anticipation of their needs and then delivery of that.
This is probably only going to increase over time, Joel. Tell me a little but about how you're solving this issue of getting to know your user and then being able to be responsive to an entire user experience and perception.
Minton: Every customer is unique, Dana. We have millions of customers who have slightly different needs based on their unique situations. What we do is try to give them a unique experience that closely matches their background and preferences, and we try to use all of that information that we have to create a streamlined interaction where they can feel like the experience itself is tailored for them.
It’s very easy to say, “We can’t personalize the product because there are so many touch points and there are so many different variables.” But we can, in fact, make the product much more simplified and easy to use for each one of those customers. Data is a huge part of that.
Specifically, our customers, at times, may be having problems in the product, finding the right place to enter a certain tax situation. They get stuck and don't know what to enter. When they get in those situations, they will frequently ask us for help and they will ask how they do a certain task. We can then build code and algorithms to handle all those situations proactively and be able to solve that for our customers in the future as well.
So the most important thing is taking all of that data and then providing super-personalized experience based on the experience we see for that user and for other users like them.
Gardner: In a sense, you're a poster child for a lot of elements of what you're dealing with, but really on a significant scale above the norm, the peaky nature, around tax preparation. You desire to be highly personalized down to the granular level for each user, the vast amount of data and velocity of that data.
What were some of your chief requirements at your architecture level to be able to accommodate some of this? Tell us a little bit, Joel, about the journey you’ve been on to improve that architecture over the past couple of years?
Lot of detail
Minton: There's a lot of detail behind the scenes here, and I'll start by saying it's not an easy journey. It’s a journey that you have to be on for a long time and you really have to understand where you want to place your investment to make sure that you can do this well.
One area where we've invested in heavily is our big-data infrastructure, being able to ingest all of the data in order to be able to track it all. We've also invested a lot in being able to get insights out of the data, using Hewlett Packard Enterprise (HPE) Vertica as our big data platform and being able to query that data in close to real time as possible to actually get those insights. I see those as the meat and potatoes that you have to have in order to be successful in this area.
On top of that, you then need to have an infrastructure that allows you to build personalization on the fly. You need to be able to make decisions in real time for the customers and you need to be able to do that in a very streamlined way where you can continuously improve.
We use a lot of tactics using machine learning and other predictive models to build that personalization on-the-fly as people are going through the application. That is some of our secret sauce and I will not go into in more detail, but that’s what we're doing at a high level.
Gardner: It might be off the track of our discussion a bit, but being able to glean information through analytics and then create a feedback loop into development can be very challenging for a lot of organizations. Is DevOps a cultural parallel path along with your data-science architecture?
I don’t want to go down the development path too much, but it sounds like you're already there in terms of understanding the importance of applying big-data analytics to the compression of the cycle between development and production.
Minton: There are two different aspects there, Dana. Number one is making sure that we understand the traffic patterns of our customer and making sure that, from an operations perspective, we have the understanding of how our users are traversing our application to make sure that we are able to serve them and that their performance is just amazing every single time they come to our website. That’s number one.
Number two, and I believe more important, is the need to actually put the data in the hands of all of our employees across the board. We need to be able to tell our employees the areas where users are getting stuck in our application. This is high-level information. This isn't anybody's financial information at all, but just a high-level, quick stream of data saying that these people went through this application and got stuck on this specific area of the product.
We want to be able to put that type of information in our developer’s hands so as the developer is actually building a part of the product, she could say that I am seeing that these types of users get stuck at this part of the product. How can I actually improve the experience as I am developing it to take all of that data into account?
We have an analyst team that does great work around doing the analytics, but in addition to that, we want to be able to give that data to the product managers and to the developers as well, so they can improve the application as they are building it. To me, a 360-degree view of the customer is number one. Number two is getting that data out to as broad of an audience as possible to make sure that they can act on it so they can help our customers.
Gardner: Joel, I speak with HPE Vertica users quite often and there are two major areas that I hear them talk rather highly of the product. First, has to do with the ability to assimilate, so that dealing with the variety issue would bring data into an environment where it can be used for analytics. Then, there are some performance issues around doing queries, amid great complexity of many parameters and its speed and scale.
Your applications for TurboTax are across a variety or platforms. There is a shrink-wrap product from the legacy perspective. Then you're more along the mobile lines, as well as web and SaaS. So is Vertica something that you're using to help bring the data from a variety of different application environments together and/or across different networks or environments?
Minton: I don't see different devices that someone might use as a different solution in the customer journey. To me, every device that somebody uses is a touch point into Intuit and into TurboTax. We need to make sure that all of those touch points have the same level of understanding, the same level of tracking, and the same ability to help our customers.
Whether somebody is using TurboTax on their computer or they're using TurboTax on their mobile device, we need to be able to track all of those things as first-class citizens in the ecosystem. We have a fully-functional mobile application that’s just amazing on the phone, if you haven’t used it. It's just a great experience for our customers.
From all those devices, we bring all of that data back to our big data platform. All of that data can then be queried, because you want to understand, many questions, such as when do users flow across different devices and what is the experience that they're getting on each device? When are they able to just snap a picture of their W-2 and be able to import it really quickly on their phone and then jump right back into their computer and finish their taxes with great ease?
We need to be able to have that level of tracking across all of those devices. The key there, from a technology perspective, is creating APIs that are generic across all of those devices, and then allowing those APIs to feed all of that data back to our massive infrastructure in the back-end so we can get those insights through reporting and other methods as well.
Gardner: We've talked quite a bit about what's working for you: a database column store, the ability to get a volume variety and velocity managed in your massive data environment. But what didn't work? Where were you before and what needed to change in order for you to accommodate your ongoing requirements in your architecture?
Minton: Previously we were using a different data platform, and it was good for getting insights for a small number of users. We had an analyst team of 8 to 10 people, and they were able to do reports and get insights as a small group.
But when you talk about moving to what we just discussed, a huge view of the customer end-to-end, hundreds of users accessing the data, you need to be able to have a system that can handle that concurrency and can handle the performance that’s going to be required by that many more people doing queries against the system.
So we moved away from our previous vendor that had some concurrency problems and we moved to HPE Vertica, because it does handle concurrency much better, handles workload management much better, and it allows us to pull all this data.
The other thing that we've done is that we have expanded our use of Tableau, which is a great platform for pulling data out of Vertica and then being able to use those extracts in multiple front-end reports that can serve our business needs as well.
So in terms of using technology to be able to get data into the hands of hundreds of users, we use a multi-pronged approach that allows us to disseminate that information to all of these employees as quickly as possible and to do it at scale, which we were not able to do before.
Gardner: Of course, getting all your performance requirements met is super important, but also in any business environment, we need to be concerned about costs.
Is there anything about the way that you were able to deploy Vertica, perhaps using commodity hardware, perhaps a different approach to storage, that allowed you to both accomplish your requirements, goals in performance, and capabilities, but also at a price point that may have been even better than your previous approach?
Minton: From a price perspective, we've been able to really make the numbers work and get great insights for the level of investment that we've made.
How do we handle just the massive cost of the data? That's a huge challenge that every company is going to have in this space, because there's always going to be more data that you want to track than you have hardware or software licenses to support.
So we've been very aggressive in looking at each and every piece of data that we want to ingest. We want to make sure that we ingest it at the right granularity.
Vertica is a high-performance system, but you don't need absolutely every detail that you’ve ever had from a logging mechanism for every customer in that platform. We do a lot of detail information in Vertica, but we're also really smart about what we move into there from a storage perspective and what we keep outside in our Hadoop cluster.
We have a Hadoop cluster that stores all of our data and we consider that our data lake that basically takes all of our customer interactions top to bottom at the granular detail level.
We then take data out of there and move things over to Vertica, in both an aggregate as well as a detail form, where it makes sense. We've been able to spend the right amount of money for each of our solutions to be able to get the insights we need, but to not overwhelm both the licensing cost and the hardware cost on our Vertica cluster.
The combination of those things has really allowed us to be successful to match the business benefit with the investment level for both Hadoop and with Vertica.
Gardner: Measuring success, as you have been talking about quantitatively at the platform level, is important, but there's also a qualitative benefit that needs to be examined and even measured when you're talking about things like process improvements, eliminating bottlenecks in user experience, or eliminating anomalies for certain types of individual personalized activities, a bit more quantitative than qualitative.
Do you have any insight, either anecdotal or examples, where being able to apply this data analytics architecture and capability has delivered some positive benefits, some value to your business?
Minton: We basically use data to try to measure ourselves as much as possible. So we do have qualitative, but we also have quantitative.
Just to give you a few examples, our total aggregate number of insights that we've been able to garner from the new system versus the old system is a 271 percent increase. We're able to run a lot more queries and get a lot more insights out of the platform now than we ever could on the old system. We have also had a 41 percent decrease in query time. So employees who were previously pulling data and waiting twice as long had a really frustrating experience.
Now, we're actually performing much better and we're able to delight our internal customers to make sure that they're getting the answers they need as quickly as possible.
We've also increased the size of our data mart in general by 400 percent. We've massively grown the platform while decreasing performance. So all of those quantitative numbers are just a great story about the success that we have had.
From a qualitative perspective, I've talked to a lot of our analysts and I've talked to a lot of our employees, and they've all said that the solution that we have now is head and shoulders over what we had previously. Mostly that’s because during those peak times, when we're running a lot of traffic through our systems, it’s very easy for all the users to hit the platform at the same time, instead of nobody getting any work done because of the concurrency issues.
Because we have much better tracking of that now with Vertica and our new platform, we're actually able to handle that concurrency and get the highest priority workloads out quickly, allow them to happen, and then be able to follow along with the lower-priority workloads and be able to run them all in parallel.
The key is being able to run, especially at those peak loads, and be able to get a lot more insights than we were ever able to get last year.
Gardner: And that peak load issue is so prominent for you. Another quick aside, are you using cloud or hybrid cloud to support any of these workloads, given the peak nature of this, rather than keep all that infrastructure running 365, 24×7? Is that something that you've been doing, or is that something you're considering?
Minton: Sure. On a lot of our data warehousing solutions, we do use cloud in points for our systems. A lot of our large-scale serving activities, as well as our large scale ingestion, does leverage cloud technologies.
We don't have it for our core data warehouse. We want to make that we have all of that data in-house in our own data centers, but we do ingest a lot of the data just as pass-throughs in the cloud, just to allow us to have more of that peak scalability that we wouldn’t have otherwise.
Gardner: We're coming up toward the end of our discussion time. Let’s look at what comes next, Joel, in terms of where you can take this. You mentioned some really impressive qualitative and quantitative returns and improvements. We can always expect more data, more need for feedback loops, and a higher level of user expectation and experience. Where would you like to go next? How do you go to an extreme focus even more on this issue of personalization?
Minton: There are a few things that we're doing. We built the infrastructure that we need to really be able to knock it out of the park over the next couple of years. Some of the things that are just the next level of innovation for us are going to be, number one, increasing our use of personalization and making it much easier for our customers to get what they need when they need it.
So doubling down on that and increasing the number of use cases where our data scientists are actually building models that serve our customers throughout the entire experience is going to be one huge area of focus.
Another big area of focus is getting the data even more real time. As I discussed earlier, Dana, we're a very peaky business and the faster than we can get data into our systems, the faster we're going to be able to report on that data and be able to get insights that are going to be able to help our customers.
Our goal is to have even more real-time streams of that data and be able to get that data in so we can get insights from it and act on it as quickly as possible.
The other side is just continuing to invest in our multi-platform approach to allow the customer to do their taxes and to manage their finances on whatever platform they are on, so that it continues to be mobile, web, TVs, or whatever device they might use. We need to make sure that we can serve those data needs and give the users the ability to get great personalized experiences no matter what platform they are on. Those are some of the big areas where we're going to be focused over the coming years.
Gardner: Now you've had some 20/20 hindsight into moving from one data environment to another, which I suppose is equivalent of keeping the airplane flying and changing the wings at the same time. Do you have any words of wisdom for those who might be having concurrency issues or scale, velocity, variety type issues with their big data, when it comes to moving from one architecture platform to another? Any recommendations you can make to help them perhaps in ways that you didn't necessarily get the benefit of?
Minton: To start, focus on the real business needs and competitive advantage that your business is trying to build and invest in data to enable those things. It’s very easy to say you're going to replace your entire data platform and build everything soup to nuts all in one year, but I have seen those types of projects be tried and fail over and over again. I find that you put the platform in place at a high-level and you look for a few key business-use cases where you can actually leverage that platform to gain real business benefit.
When you're able to do that two, three, or four times on a smaller scale, then it makes it a lot easier to make that bigger investment to revamp the whole platform top to bottom. My number one suggestion is start small and focus on the business capabilities.
Number two, be really smart about where your biggest pain points are. Don’t try to solve world hunger when it comes to data. If you're having a concurrency issue, look at the platform you're using. Is there a way in my current platform to solve these without going big?
Frequently, what I find in data is that it’s not always the platform's fault that things are not performing. It could be the way that things are implemented and so it could be a software problem as opposed to a hardware or a platform problem.
So again, I would have folks focus on the real problem and the different methods that you could use to actually solve those problems. It’s kind of making sure that you're solving the right problem with the right technology and not just assuming that your platform is the problem. That’s on the hardware front.
As I mentioned earlier, looking at the business use cases and making sure that you're solving those first is the other big area of focus I would have.
Listen to the podcast. Find it on iTunes. Get the mobile app. Read a full transcript or download a copy. Sponsor: Hewlett Packard Enterprise.
You may also be interested in:
- Spirent leverages big data to keep user experience quality a winning factor for telcos
- Powerful reporting from YP's data warehouse helps SMBs deliver the best ad campaigns
- IoT brings on development demands that DevOps manages best, say experts
- Big data generates new insights into what’s happening in the world's tropical ecosystems
- DevOps and security, a match made in heaven
- How Sprint employs orchestration and automation to bring IT into DevOps readiness
- How fast analytics changes the game and expands the market for big data value
- How HTC centralizes storage management to gain visibility and IT disaster avoidance
- Big data, risk, and predictive analysis drive use of cloud-based ITSM, says panel
- Rolta AdvizeX experts on hastening big data analytics in healthcare and retail
- The future of business intelligence as a service with GoodData and HP Vertica
- Enterprises opting for converged infrastructure as stepping stone to hybrid cloud