Monday, November 9, 2009

Part 3 of 4: Web data services--Here's why text-based content access and management plays crucial role in real-time BI

Listen to the podcast. Find it on iTunes/iPod and Podcast.com. View a full transcript or download a copy. Learn more. Sponsor: Kapow Technologies.

Text-based content and information from across the Web are growing in importance to businesses. The need to analyze web-based text in real-time is rising to where structured data was in importance just several years ago.

Indeed, for businesses looking to do even more commerce and community building across the Web, text access and analytics forms a new mother lode of valuable insights to mine.

As the recession forces the need to identify and evaluate new revenue sources, businesses need to capture such web data services for their business intelligence (BI) to work better, deeper, and faster.

In this podcast discussion, Part 3 of a series on web data services for BI, we discuss how an ecology of providers and a variety of content and data types come together in several use-case scenarios.

In Part 1 of our series we discussed how external data has grown in both volume and importance across the Internet, social networks, portals, and applications. In Part 2, we dug even deeper into how to make the most of web data services for BI, along with the need to share those web data services inferences quickly and easily.

Our panel now looks specifically at how near real-time text analytics fills out a framework of web data services that can form a whole greater than the sum of the parts, and this brings about a whole new generation of BI benefits and payoffs.

To help explain the benefits of text analytics and their context in web data services, we're joined by Seth Grimes, principal consultant at Alta Plana Corp., and Stefan Andreasen, co-founder and chief technology officer at Kapow Technologies. The discussion is moderated by me, Dana Gardner, principal analyst at Interarbor Solutions.

Here are some excerpts:
Grimes: "Noise free" is an interesting and difficult concept when you're dealing with text, because text is just a form of human communication. Whether it's written materials, or spoken materials that have been transcribed into text, human communications are incredibly chaotic ... and they are full of "noise." So really getting to something that's noise-free is very ambitious.

... It's become an imperative to try to deal with the great volume of text -- the fire hose, as you said -- of information that's coming out. And, it's coming out in many, many different languages, not just in English, but in other languages. It's coming out 24 hours a day, 7 days a week -- not only when your business analysts are working during your business day. People are posting stuff on the web at all hours. They are sending email at all hours.

If you want to keep up, if you want to do what business analysts have been referring to as a 360-degree analysis of information, you've got to have automated technologies to do it.



... There are hundreds of millions of people worldwide who are on the Internet, using email, and so on. There are probably even more people who are using cell phones, text messaging, and other forms of communication.

If you want to keep up, if you want to do what business analysts have been referring to as a 360-degree analysis of information, you've got to have automated technologies to do it. You simply can't cope with the flood of information without them.

Fortunately, the software is now up to the job in the text analytics world. It's up to the job of making sense of the huge flood of information from all kinds of diverse sources, high volume, 24 hours a day. We're in a good place nowadays to try to make something of it with these technologies.

Andreasen: ... There is also a huge amount of what I call "deep web," very valuable information that you have to get to in some other way. That's where we come in and allow you to build robots that can go to the deep web and extract information.

... Eliminating noise is getting rid of all this stuff around the article that is really irrelevant, so you get better results.

The other thing around noise-free is the structure. ... The key here is to get noise-free data and to get full data. It's not only to go to the deep web, but also get access to the data in a noise-free way, and in at least a semi-structured way, so that you can do better text analysis, because text analysis is extremely dependent on the quality of data.

Grimes: ... [There are] many different use-cases for text analytics. This is not only on the Web, but within the enterprise as well, and crossing the boundary between the Web and the inside of the enterprise.

Those use-cases can be the early warning of a Swine flu epidemic or other medical issues. You can be sure that there is text analytics going on with Twitter and other instant messaging streams and forums to try to detect what's going on.

... You also have brand and reputation management. If someone has started posting something very negative about your company or your products, then you want to detect that really quickly. You want early warning, so that you can react to it really quickly.

We have some great challenges out there, but . . . we have great technologies to respond to those challenges.



We have a great use case in the intelligence world. That's one of the earliest adopters of text analytics technology. The idea is that if you are going to do something to prevent a terrorist attack, you need to detect and respond to the signals that are out there, that something is pending really quickly, and you have to have a high degree of certainty that you're looking at the right thing and that you're going to react appropriately.

... Text analytics actually predate BI. The basic approaches to analyzing textual sources were defined in the late '50s. Actually, there is a paper from an IBM researcher from 1958, that defines BI as the analysis of textual sources.

...[Now] we want to take a subset of all of the information that's out there in the so-called digital universe and bring in only what's relevant to our business problems at hand. Having the infrastructure in place to do that is a very important aspect here.

Once we have that information in hand, we want to analyze it. We want to do what's called information extraction, entity extraction. We want to identify the names of people, geographical location, companies, products, and so on. We want to look for pattern-based entities like dates, telephone numbers, addresses. And, we want to be able to extract that information from the textual sources.

Suitable technologies

All of this sounds very scientific and perhaps abstruse -- and it is. But, the good message here is one that I have said already. There are now very good technologies that are suitable for use by business analysts, by people who aren't wearing those white lab coats and all of that kind of stuff. The technologies that are available now focus on usability by people who have business problems to solve and who are not going to spend the time learning the complexities of the algorithms that underlie them.

Andreasen: ... Any BI or any text analysis is no better than the data source behind it. There are four extremely important parameters for the data sources. One is that you have the right data sources.

There are so many examples of people making these kind of BI applications, text analytics applications, while settling for second-tier data sources, because they are the only ones they have. This is one area where Kapow Technologies comes in. We help you get exactly the right data sources you want.

The other thing that's very important is that you have a full picture of the data. So, if you have data sources that are relevant from all kinds of verticals, all kinds of media, and so on, you really have to be sure you have a full coverage of data sources. Getting a full coverage of data sources is another thing that we help with.

Noise-free data

We already talked about the importance of noise-free data to ensure that when you extract data from your data source, you get rid of the advertisements and you try to get the major information in there, because it's very valuable in your text analysis.

Of course, the last thing is the timeliness of the data. We all know that people who do stock research get real-time quotes. They get it for a reason, because the newer the quotes are, the surer they can look into the crystal ball and make predictions about the future in a few seconds.

The world is really changing around us. Companies need to look into the crystal ball in the nearer and nearer future. If you are predicting what happens in two years, that doesn't really matter. You need to know what's happening tomorrow.
Listen to the podcast. Find it on iTunes/iPod and Podcast.com. View a full transcript or download a copy. Learn more. Sponsor: Kapow Technologies.