Big Data: How to Use itMeaningful Use of Data is Business' Greatest Challenge
"When incidents occur, we don't have to track down the data to be able to analyze it," Caldiero, a data scientist at Zions, says in an interview with Information Security Media Group [transcript below]. "We're able to have that at our fingertips ready to go to get answers to those questions."
Big data also helps Zions with cross-channel fraud detection, he says. "Our desire is to perform a little bit better in bringing together all the different, disparate data sources," Caldiero says. "It has given us the ability to put all those puzzle pieces together in a way that makes sense."
But big data is only as beneficial as the security professionals behind it, says Fowkes, who heads up Zions' security analytics department. One major challenge: keeping biases from interfering with analytics.
"We all do this," Fowkes says. "But if we're not careful, these biases can impede our ability to understand new attack methods and trends."
During this interview, Fowkes and Caldiero discuss:
- How open-source and proprietary big data tools can be used in tandem to improve fraud detection and prevention;
- The role big data plays in forensics investigations and in predictive fraud analysis;
- How organizations can enhance big-data teams by bringing on experts who understand the information.
At Zions Bancorp, Fowkes is responsible for driving the strategy for the bank's security data warehouse, which also involves the oversight of the security analytics and fraud prevention departments. Fowkes has 15 years' experience in audit, information security and fraud prevention, including more than eight years of experience in working with big data.
Caldiero performs data mining, statistical modeling and analytics on Zions' security data warehouse, known as the Hadoop cluster. Using data science tools, Caldiero builds, implements and maintains risk models for fraud and malware detection. He has nearly a decade of experience in analytics and financial services.
Security Analytics Department
TRACY KITTEN: What can you tell us about your security analytics department and the role it plays within your institution?
MICHAEL FOWKES: The Security Analytics Department was formed a few years ago after we had some initial success with using some of the newly available big data tools, and that was success with aggregating our security logs, performing long-term data correlation, where the correlations were spanning days and not minutes, as well as having a low-cost solution in place to store large amounts of data over a long period of time. Once we had those initial successes, the team was then formally put together. It's primary leveraged by our information security, fraud prevention and risk departments to assist with ingesting new data sets into our security data stores and doing data-mining activities and model-building activities using that data, as well as building software tools and utilities to assist in achieving our data analytics goals.
Improving Information Sharing
KITTEN: Has the creation of the department has helped the bank improve information sharing among its departments as well as external entities?
FOWKES: Yes it has. This is not something that we initially expected or set out to accomplish. We've seen some of the benefits that we have gotten from this and it has been the ability to provide detailed information about specific events in a timely fashion, or having conversations with others, and the ability to quickly test out hypotheses or hunches about particular security events. We found by being able to do that, it has greatly improved the quality of the conversations and the resulting decisions that are being made when working with others. But, again, it wasn't something that we initially set out to accomplish. It was a pleasant side-effect.
Fraud Analytics: Lessons Learned
KITTEN: What would you say have been some of the biggest lessons Zions Bank has learned, where analytics relative to fraud prevention and fraud detection are concerned?
FOWKES: To start, it's very hard to predict what your data needs are going to be in the future. On collecting data, there's a tendency, which is probably a carry-over from the days of building traditional data warehouses, to summarize data where possible and to remove data that's not relevant to the problems or questions you're trying to find answers to right now. After a while, we got tired of having to constantly go back and make adjustments with what data elements we were loading into our data store. It got to the point where we just decided to load everything, and, from an operational perspective, that made life much simpler.
However, it did create some other problems that we didn't expect, and I think these are problems that exist with big data in general. Big data also means big noise, and this is a topic that has been covered to some degree in Nate Silver's recent book. It can be a lot easier to find relationship in things where they don't actually exist when you have access to a lot of data, so it kind of ends up being a double-edged sword.
Another challenge that we ran across is that you need to figure out a way to tame the biases that you have about the data or events that you're reviewing. We all do this. We build up biases over time as part of our journey in becoming security experts. But if we're not careful, these biases can impede our ability to understand new attack methods and trends.
Improving Cross-Channel Detection
KITTEN: How has big data and some of the work you've done helped the bank to improve fraud detection from a cross-channel perspective?
AARON CALDIERO: Cross-channel fraud is a very complex problem - and a problem that's well-suited for big data. There are a lot of commercial products in this space that don't really truly perform on full cross-channel analytics. They just aggregate fraud alerts together from several channels, and our desire is to perform a little bit better than those in bringing together all the different disparate data sources. [It] has given us the ability to put all those puzzle pieces together in a way that makes sense and a way that can be impactful to the business.
KITTEN: What about fraud prevention? How is your bank using big data?
CALDIERO: I can't get too specific there. But I can say that it's part of our daily operations. We're able to use it from soup to nuts, from beginning to end, as part of our full fraud prevention. We use it for researching, forensics and all sorts of analytics in between.
KITTEN: How is big data being used for forensics after an attack or data mining to forecast potential attacks?
CALDIERO: In this space, it allows us to perform the work much quicker than we were able to in the past. When incidents occur, we don't have to track down the data to be able to analyze it. We're able to have that at our fingertips ready to go to get answers to those questions, and the data repository also acts as a large source for testing models as we develop them.
Big Data Management
KITTEN: Zions handles its own data in-house with an open-source platform. What can you tell me about the platform that you use for your big data management?
FOWKES: As far as the big data repository goes, we're using Hadoop and Hive. Hive is a data warehousing application that was developed for Hadoop, and we use this as our primary data store. The vendor we have chosen for Hadoop is MapR. However, we've used others in the past. To load data into this environment, we've developed some of our own ETL tools in-house - ETL stands for extract, transform and load - to manage loading data into our Hadoop environment in clusters. We use these custom tools to take care of scheduling independencies that exist in job flows - for example, where you wouldn't want to kick off a model or a report to run until the data has been fully loaded into our environment.
CALDIERO: In the analytics space, we use a lot of custom scripting as well as Hive in order to access the data through Hadoop, and we also use R - which is an open-source programming language for stats for doing a lot of the heavy lifting for analytics and anything in between - the right tool for the right job. We even use Excel for a lot of things when we need to, and for communicating results to the business.
Open-Source vs. Proprietary Platforms
KITTEN: What would you say is the primary difference between an open-source platform and a proprietary platform?
FOWKES: The big data space is very fluid right now. There are a lot of innovative commercial products that are being developed and are hitting the market, as well as great tools that have been developed by companies and academia that have been turned into open-source projects. One of the key differences we have found between the two is the level of support that's provided. If you decide to use an open-source application that's popular, then there's usually pretty good support available through the open-source community. But a lot of times the people that are providing this support are doing it on the side, and it's not their primary job. You don't have a guarantee in what the response time will be if you ask a question to help you out with a problem you've run into. This is one of the main potential benefits that exists with using a commercial provider. In a sense, it boils down to the level of expertise you want to develop and maintain in-house.
KITTEN: Michael, would you say that it's possible for an institution to develop some type of hybrid approach, one that includes some open-source as well as some proprietary options?
FOWKES: Certainly they can, and this is the path that we have chosen ourselves, where we're using a collection of supported products and tools in this big data space as well as some open-source tools, along with some custom in-house-developed things that have been built. You can certainly go either direction, fully supported or commercially available, or all open-source and custom-developed. We've taken the approach of picking the best breed.
In-House ApproachKITTEN: Would you say that handling big data in-house allows your institution to be somewhat more nimble and respond to fraud events more quickly?
FOWKES: Yes. Handling our big-data implementation along with the various analytic tools within our department has allowed us to be more nimble, mainly from the perspective of how quickly we can implement a new tool or load new data into our Hadoop cluster or even implement a new model. As I mentioned before, there's a trade-off. In order to be successful, you need to make sure that you build the right team with members that have the right skill set.
At least through our experience, we flagged four different areas where we believed we needed to have some level of expertise in-house. First, we need to have people who have domain expertise, so that you're able to understand the data you're capturing and loading. Second, you need to have people on the team with data analytics and statistics backgrounds, in order to efficiently work with the data and build models. Also we've found, especially in the tool space, that having some folks with software development skills is quite handy. To build data-wrangling tools, if you're using an open-source application, you may need to tweak it to meet your needs. Lastly, with system and application administration, you need to have folks who can manage these tools. Those last few, if you're using a commercial tool, you may not have to do as much in-house. But we believe that there can be a trade-off. I guess it doesn't necessarily have to be a trade-off with how quickly you can implement something or make a change. We've made the decision that we definitely want the ability to change fast, so we decided to pull that expertise in-house.
CALDIERO: From a cost-savings perspective, it's a balance as well. If you have to create an entire department and bring in different levels of expertise, you're obviously going to increase your cost. But you might be saving, if you're not relying on another provider.
FOWKES: When we started this, we selected people from within the existing security division to pull onto this team. There was not a lot of new capital to put this group together. The main area that we had to put some additional spend in place was on the data analytics side. We didn't have people already in-house that had backgrounds like Aaron has with statistics and advanced mathematical modeling.
Building the Right Team
KITTEN: Are there any final thoughts about the creation of a new department or just big data and fraud prevention and detection generally you'd like to share?
CALDIERO: I just want to reiterate the comments about having the right people on the team. In the big data space, there are all sorts of tools out there; but the tools are only as good as the people using them. If you have the right people using those right tools, it's going to give you the biggest bang for the buck, and be the most effective and impactful.
FOWKES: This is a really exciting time to be involved in security and fraud analytics. The barriers to entry have been lowered with the availability of new open-source tools. At the same time, it's creating all kinds of interesting new possibilities and career opportunities as well with data science, with data science being specifically applied to fraud and information security. But as Aaron alluded to, there also seems to be a misconception with big data that it will magically solve all of your analytical problems. This is certainly not the case. There are definitely new tools and techniques that can be used to solve problems when using data at a scale and at a price that haven't been possible before. But, as Aaron mentioned, it still requires people to interpret the result and derive insight or information from the data. Big data tools do not solve the problem as it's sometimes alluded to in the marketing materials. It does take people, and I guess if there's one thing that we could try to get across to people it is that. That's the key thing.