Does This Exposed Chinese Database Pose a Security Threat?The Zhenhua Data Leak Is Scraped Public Data. It Poses No Threat.
A leaked database compiled by a Chinese company has suddenly become the focus of media reports warning that it could be used for espionage by Beijing. But on closer examination, the data is public information that's been scraped, largely from social media sites and other public sources.
On Monday, news outlets including the ABC and the Australian Financial Review released a coordinated scoop about a leaked database from China. The data includes that of prominent members of Australian society, including many politicians.
Zhenhua Data feels like a company that has done what countless other Western companies have done in the age in which data is the new oil: Collect it and sell it. The company wasn't trying to hide. Neither was it very good at securing its own data.
The database contains details on at least 2.4 million people, including 35,000 Australians and Prime Minister Scott Morrison, as well as many business people.
The breathless reporting about the database has stoked fears Beijing may be collecting data on Australians and other people around the world to spy on them. But while it's easy to spin up a furor over anything involving China and cybersecurity, this data exposure deserves a more precise examination.
The database comes from a company called Zhenhua Data. According to Christopher Balding, an American academic in Vietnam, a source in China passed him the data, putting the source "at risk" from the Chinese Communist Party.
"The individual who provided the Shenzhen Zhenhua database by putting themselves at risk to get this data out has done an enormous service and is proof that many inside China are concerned about CCP authoritarianism and surveillance," Balding writes in a blog post on Monday.
What is the OKIDB?
The database is called the Overseas Key Information Database, or OKIDB. As I read the reports about it, I thought I might have seen it before. In fact, I had.
By virtue of being on the cybersecurity beat, I often receive tips about leaks. I've amassed files filled with random leaked data, many of which remain unconfirmed. The OKIDB information had remained in that bucket.
I started posting screenshots on Twitter of the version of the OKIDB I'd seen. I tagged Balding and Robert Potter, the co-founder of a Canberra-based company called Internet 2.0. Balding shared the database with Internet 2.0 to put it into a more digestible format because the version he'd received was corrupted.
I called Potter on Monday morning, and it became clear that the OKIDB that I saw is the same database Balding and Potter possess. In response, some people have rightly asked me why I didn't write about this sooner. Here's the skinny.
(1/6) The China database is causing a fair amount of stir in Australia, but before we get too spun up about China-spying-targeting-etc., there are a few important points to keep in mind.— Jeremy Kirk (@Jeremy_Kirk) September 14, 2020
The database was brought to my attention in late December 2019 or early January by a computer security researcher who is not based in China. The database had been left on the internet, open for anyone to access, presumably by mistake. In more precise terms, it was an unsecured Elasticsearch cluster.
Elasticsearch is an open-source platform for storing and querying data. By default, Elasticsearch clusters are not publicly accessible. But the clusters can be rolled out in a misconfigured manner, leaving data open on the Internet. Often, it's possible to hunt out misconfigured Elasticsearch instances using device-focused search engines such as Shodan.io.
When I reviewed the data stored in the OKIDB, it appeared impressive mostly for its size - hundreds and hundreds of gigabytes - but otherwise the data didn't appear to be sensitive. All of it seemed to be public. For example, there were bits from U.S. Navy press releases announcing deployments of ships, some of which had been translated into Mandarin.
One of the indices contained a list of U.S. Air Force personnel. It included names and addresses but no birth dates. Those listings contained a couple of interesting fields, such as "airmenID" and "medicalExpirationDate." But that data turned out to be public. There were also entries for U.S. Navy officers but included links to public biographies that have been posted on Navy websites.
Other indices contained what appeared to be research papers from think tanks. Copious amounts of data had been copied from sources including Crunchbase and EveryPolitician. Largely, however, I didn't see anything that raised alarms.
Social Media Scraping
So where did all this data originate? The database is related to a domain, aggso[dot]com, which belonged to a commercial Chinese company. The company specialized in aggregating data.
The front page of its now-shuttered website, okidb.aggso[dot]com, mentioned numerous data sources, including LinkedIn, Facebook, Instagram, YouTube, Twitter and Medium. It appeared quite similar to U.S. companies such as Spokeo or Pipl, which mine a variety of public data sources and link them together.
Early views of aggso[dot]com on the Wayback Machine from around 2012 show that it started out as something called the Weiju Social Media Management System.
Over time, the company changed how it marketed itself. The Australian Financial Review reports that Zhenhua Data was recently marketing the data it holds as the "Internet Big Data Military Intelligence System." While the company's website is now offline, it had listed such customers as the People's Liberation Army and Communist Party, the Australian Financial Review reports.
After reviewing the data set in January, I didn't see much to merit a story. To be sure, it contained a huge amount of data, some of which had obvious ties to China, but nothing appeared to be overtly nefarious. I also tried to contact the registrant for aggso[dot]com but received no reply. OKIDB joined the long list of other data exposures that I have learned about but not seen fit to report on.
Risky Data Collection?
I asked Potter this key question: What kind of non-public data is in the database? Because if there is any, it might give more weight to suggestions that the collected data poses a risk.
Potter responded that "it depends on how you define open source" and that "there seemed to be a fair amount in there that had been pinched from other platforms, which in and of itself wasn't open source as a method it was ingested in."
Asking Potter to define exactly what that meant, he told me that there seemed to be data that was "not classified but they're not public sources." He mentioned data from Factiva, the news-monitoring and research tool from Dow Jones. I pointed out Factiva isn't sensitive, but rather subscriber-only content.
To be sure, there are reasons to be worried about China's cyber activity. U.S. prosecutors have pinned on China some of the largest and most worrisome hacks in memory, including the U.S. Office of Personnel Management, Equifax and health insurance giant Anthem. Here in Australia, the country has been blamed for attacks on Parliament's email system and against Australian National University. The data from those hacks has never publicly surfaced. If Zhenhua's repository had that kind of data, this would be a much more significant finding.
Caution: I have seen only a small slice of the data. There could be material in there that is highly sensitive. But if that is true, then I call on anyone who's making this out to be a significant national security concern to describe that highly sensitive data more fully. So far, that hasn't happened.
I cringed when I saw the Australian Financial Review's hyperbolic headline contending that this material comprises a "social media warfare database." Anyone who posts material to social media sites or the internet in general should expect that data to be scraped by marketing agencies and others. By this point in the internet's history, everyone should have gotten fair warning that this is the current state of affairs. Be careful what you expose.
Zhenhua Data looks like a company that has done what countless other Western companies have done in the age in which data is the new oil: Collect it and sell it. The company wasn't trying to hide. Neither was it very good at securing its own data.