Text mining

Information often lies, unstructured and inaccessible, in disparate document formats. Data mining can open up this valuable seam of data and deliver valuable business intelligence from it.

Raoul Jetley ABB Corporate Research, Bangalore, India, raoul.jetley@in.abb.com

It is estimated that up to 80 percent of all information in organizations is stored in an unstructured text format. This information includes customer requirements, sales dossiers, technical specifications, maintenance reports and stakeholder feedback. It is difficult to extract business intelligence from such disparate data using traditional data analysis methods so, instead, text-based data mining, or text mining, is used→1.

01 Large amounts of valuable data can be hidden in company documentation. Data mining can bring it to light.
01 Large amounts of valuable data can be hidden in company documentation. Data mining can bring it to light.

Simply put, text mining is the set of processes required to transform unstructured text documents or resources into meaningful, structured information. The structured information can then be used to automatically discover hidden patterns and predict future outcomes using a combination of statistical, linguistic and pattern-recognition techniques.

Text mining is an interdisciplinary field that draws on information retrieval, data mining, machine learning, statistics and computational linguistics. These techniques are used to discover and present knowledge – facts, business rules and relationships – that is otherwise locked in textual form, impenetrable to automated processing.

A typical text mining process includes the following steps:
• Identify and preprocess the text to be mined. This step involves text clean-up to remove unnecessary information from the text, splitting the text into individual tokens (ie, smaller components) and identifying parts-of-speech based on the grammar of the language used.
• Extract relevant information and transform it into structured data. Information is retrieved by searching through the tokenized text and storing the results in a more structured, organized manner that is amenable to further analyses.
• Select important features to build concept and category models. The number of concepts present in unstructured data is typically very large. The key to this step is to identify the most relevant features and use these to build meaningful models based on data categories and relationships.
• Analyze the structured data to discover relationships between the concepts. At this point, the text mining process merges with the traditional data mining process. Classic data mining techniques, such as clustering, prediction and classification can be used on the structured data resulting from the previous steps.

Common applications resulting from these analyses include recognition of named entities, automatic summarization, categorization based on relevant features, and mining for customer sentiments and opinions expressed within the text→2.

02 Data mining can deliver important, hitherto unavailable, insights for sales and marketing, business decision making, investment or purchase decisions, customer relationships, etc.
02 Data mining can deliver important, hitherto unavailable, insights for sales and marketing, business decision making, investment or purchase decisions, customer relationships, etc.

Links

Contact us

Downloads

Share this article

Facebook LinkedIn Twitter WhatsApp