Enterprise Thinking - Home
Security, Privacy, Big Data, and Informatica: Making Data Safe at the Source of Use
Josh Greenbaum
MAY 21, 2014 01:32 AM
A+ A A-

It’s hard to find a set of topics more relevant to the interplay of technology and society than security and privacy. From Glenn Greenwald’s new book on NSA leaker Edward Snowden to the recent finding of a European Union court that Google has to drastically alter the persistence of user data in its services, the societal fallout from the Internet as it enters its Big Data phase is everywhere.

So it was with no small amount of interest that I sat through the first day of Informatica’s user conference last week, listening to how this former somewhat boring and still very nerdy data integration company is transforming itself into a front-line player in what has become an all-out war to protect the privacy and security of our companies and persons.  

The position of Informatica is simple: for optimal usability and control, manage data at the point of use, not at the point of origin. Companies still get to run their back-end data centers using all those legacy tools and skills the IT department cherishes, but when it comes to managing the multi-petabyte world of wildly disparate data from every conceivable (and a few inconceivable) sources, trying to manage, massage, transform, protect, reject, and otherwise deal with data at the source is a Sisyphean task best left to the realm of mythology.

Of course, it’s still easy to walk out of a presentation like Agile Data Integration for Big Data Analytics at GE Aviation and miss this not-so-hidden message: GE Aviation tried doing data transformation at the source for the dozens of engine types and thousands of engines GE Aviation monitors, and realized after pushing that boulder up the hill that it was better to do the transformation as the data were being loaded in a “data lake” for analysis. Faster, more agile, better results were the key take-aways from GE Aviation’s efforts.

As the conference wore on, the customer stories and the announcements of new capabilities like Project Springbok, the Intelligent Data Platform, and Secure@Source, it became clear that Informatica’s brand is poised to become synonymous with something far removed from the collection of three-letter acronyms – MDM, TDM, ILM, DQ, and others – that characterizes much of Informatica’s messaging today.

The big picture problem that Informatica solves is a not-so-hidden side of the Big Data gold rush now under way. As data grows exponentially in quantity and sources, the ability of companies to manage those data diminishes proportionally. Indeed, what constitutes “managing” data itself is changing at an unmanageable clip.

In the new world of Big Data, data quality has to be managed along five main parameters: is it the right data for the job, is it the right amount of data, is it in the right format to be useful, is its access and use being controlled appropriately, and is it being analyzed and deployed appropriately?

These big, broad parameters in turn beg  a whole set of questions about data and its uses: data has to be safe and secure, it has to be reliable and timely, it has to be  blended and transformed in order to be useful, it has to be moved in and out of the right kind of databases, it has to be analyzed, archived, tested for quality, made as accessible as necessary and hidden from unauthorized use. Data has to journey from an almost infinite number of potential sources and formats to an equally infinite number of targets, pass increasingly rigorous regulatory regimes and controls, and emerge safe, useful, reliable, and defensible.

Our data warehouse legacy treats data like water, and models data management on the central utility model that delivers potable water to our communities: Centralize all the sources of water into a single water treatment plant, treat the water according to the most rigorous drinking water standard,  and send it out to our homes and businesses. There it would move through a single set of pipes to the sinks, tubs, dishwashers, scrubbers, irrigation systems, and the like, where it would be used once and sent on down the drain.

But data isn’t like water in so many ways. Primarily, big data comes from many sources in many many different formats, and desperately requires an enormous quantity of work before it can be useful. And being useful is very different depending on which data is to be used in which way.  Time series data is useful for spotting anomalies, sentiment data has a lot of noise that needs to be filtered, customer data is fraught with error and duplicates, sensor data is voluminous in the extreme, financial and health-related data are highly regulated and controlled. And if you want to develop new apps and services, you’ll need to figure out how to get your hands on a data set for testing purposes that accurately reflects the real data you’ll eventually want to use without actually using real data that might have confidential or regulated information in it.

Trying to deal with these issues as data emerges from its myriad sources isn’t just hard, at times it’s impossible. All too often the data a company uses for mission critical processes like planning and forecasting comes from a third party – a retailer’s POS data or a supply chain partner’s inventory data – over which the user has no control. All the more reason why Informatica’s notion of dealing with data at the point of use makes the most sense.

So where does Informatica go from here? Judging by my conversations with its customers, there’s a huge market demand, though much of it is not necessarily understood in precisely the terms that Informatica is now addressing. Data at the point of use issues abound in the enterprise, the trick for Informatica is to see that its brand is identified as the solution to the problem at all levels of the enterprise.

Right now there are lots of ways in which these problems are solved that don’t involve Informatica – I was just at Anaplan’s user conference listening to yet another example of how a customer is using Anaplan’s planning tool to do basic master data management at the point of use by training business users to spot data anomalies in the analytics they run against their data.

Using Anaplan to do this isn’t a bad idea – other users of planning engines like Kinaxis do the same thing – but Informatica can and should make the case that planning is planning and data management is data management. Doing this level of analysis at the point of use is – back to the water analogy – akin to testing your water for contamination right before you start to cook. Wouldn’t you rather just start the whole cooking process knowing the water was safe in the first place?

[%= name %]
[%= createDate %]
[%= comment %]
Share this:
Please login to enter a comment: