By Garth Wittles
Traditional business intelligence (BI) vendors tend to agree that clean data is essential to analysis and reporting. But GARTH WITTLES, district manager for Verity in South Africa, believes organisations are losing a huge amount of information if they rigidly apply this approach to data gathering and analysis.
Typically 20% of an organisation's data is structured, which means that it can be used in BI data analysis. The remaining 80% is usually cast aside and is not factored into the analysis and reporting that business executives find so crucial to making decisions.
The volumes associated with those percentages are growing rapidly every year as data storage costs continue to decline and regulatory issues drive data retention. With the ubiquitous deployment of ERP, customer relationship management, supply chain management and other information-gathering systems, the amount of data companies can collect is also rapidly increasing.
The collection of this data does not always follow strict processes - or they are not always strictly applied - resulting in unstructured data. For instance, a company may collect customer information through its call centre, a website and paper forms. Processes are not strictly applied in a uniform method across these different formats. Much of an organisation's information is also stored in text files and Word documents.
Even if all the information is gathered in a database, popular opinion says it remains unsuitable for analysis due to the incomplete fields in rows of information gathered in the relational databases. By using the right tools, dirty XML (eXtensible Markup Language), incomplete relational data or metadata still has value.
The BI world has evolved on a base of structured data with good reason: it is best defined as giving business executives the facts on which to base their decisions. These facts, in the BI technology world that is the framework for delivering them to modern executives, are inherently relational, as they are expressed in tables and reduced to the smallest piece of data possible. For instance, customer data will express a first name; surname; date of birth by day, month and year; identity number; vehicle registration number; and street address, by number, street, suburb and city, along with a postal code. If the postal code is missing, is there no value left in the record?
Many businesses have attempted to draw their unstructured data into this format, because they realise there is value to be had. But what they have been unable to find is a method of getting it there. Typically, they have employed statistical concept extraction and topic-based categorisation. This has, however, proven inadequate from a traditional BI point of view.
Text files or Word documents are unstructured, being nothing more than a collection of concepts or topics to database operators, making it difficult to extract meaningful data that can be used to develop a relational structure. But that view is changing.
One method of attaining structure from text is entity extraction. For instance, unstructured data containing street addresses is analysed by the entity extractor, and assisted by a geographic dictionary, pulls out all the addresses and standardises the format using grammars that are layered above the dictionary.
Besides the data itself, metadata is key to good structure. Once the unstructured data is processed, key fields can be extracted from the data, and the missing or incorrect metadata fields updated.
One company that used this entity extraction approach is Superpages.com, a US-based online yellow pages directory. It deployed a multifaceted search function servicing nine million unique visitors generating 150 million page views each month across structured and unstructured data, which is stored in both English and Spanish. The result of searching both types of data - structured and unstructured - is that traffic to the site grew in 2002 over the previous year by 75%.
The advent of these new entity extraction software tools allows companies to make use of a large portion of their data that until now only consumed vast amounts of disk space.
Companies need only put this into context when dealing with the information and be aware that the data they are dealing with may be incomplete, but also that there is no absence of value, as Superpages.com has found.
The ability to analyse unstructured data will give organisations the information they need to deal with pressure from regulators, shareholders and a growing need for transparency and accountability.
Headquartered in Sunnyvale, California, Verity is a leading provider of intellectual asset management software. Verity software gives businesses a multitude of ways to improve access to vital information and perform a range of e-business operations, while enhancing the end-user experience.
Verity-powered business portals, which include corporate intranets used for sharing information within an enterprise; e-commerce sites for online selling; and market exchange portals for B2B activities; all provide personalised information to employees, partners, customers and suppliers.
Verity products are used by 80% of the Fortune 50 and by more than 4 000 corporations in various markets. Global customers include Adobe Systems, AT&T, Cap Gemini Ernst & Young, Cisco, CNET, Compaq, Dow Jones, EDGAR Online, FairMarket, Financial Times, Globe and Mail, Home Depot, Lotus, SAP, Siemens, Sybase, Time New Media and Timex.
African customers include Absa, Ananzi.co.za, Armscor, BHP Billiton, Caltex Oil, Cipro, Debis Fleet Management, Discovery Health, the Independent Development Corporation (IDC), South African Government Communication and Information System (GCIS), Shell Nigeria, Swaziland Deeds Office and more.
Garth Wittles, Verity, (011) 447-0655, firstname.lastname@example.org
Renee Conradie, FHC Strategic Communications, (011) 608-1228, email@example.com