With the tremendous growth in the volume of semi-structured and unstructured content within enterprises
(e.g., email archives, customer support databases, etc.), there is increasing interest in harnessing this
content to power search and business intelligence applications. Traditional enterprise infrastruture
or analytics is geared towards analytics on structured data (in support of OLAP-driven reporting and
analysis) and is not designed to meet the demands of large-scale compute-intensive analytics over semi-
structured content. At the IBM Almaden Research Center, we are developing an “enterprise content
analytics platform” that leverages the Hadoop map-reduce framework to support this emerging class of
analytic workloads. Two core components of this platform are SystemT, a high-performance rule-based
information extraction engine, and Jaql, a declarative language for expressing transformations over
semi-structured data. In this paper, we present our overall vision of the platform, describe how SystemT
and Jaql fit into this vision, and briefly describe some of the other components that are under active
development.
Attachment | Size |
---|---|
Simmen2009Towardsascalableenterprisecontentanalyticsplatform.pdf | 86.74 KB |