Opinion & Analysis

80% of Your Data is Unstructured and Potentially Unprotected – 4 Strategies to Tackle This

Many organizations have secured their structured data – about 20% of their data assets. Here are practical approaches to safeguarding the rest of your sensitive information.

Written by: Matt Howard

Updated 1:07 AM UTC, Thu April 4, 2024

When it comes to data-centric security, there are two priorities that every organization must contend with. The first is to protect structured data (information stored in rows and columns within databases) which typically makes up 10-20% of a data estate. The second is to protect unstructured data ( documents, emails, and everything else) which represents 80-90% of a data estate.

There are a few key reasons why IT and risk management teams often prioritize protecting structured data over unstructured data:

Structured data is easier to discover and classify. It resides in well-known systems like databases and servers. Unstructured data is more scattered and opaque.
Structured data typically contains sensitive information like financial data, customer information, intellectual property, etc. The risk impact of structured data exposure is more apparent.
There are well-defined schemas, standards, and access control methods for structured data. Unstructured data is considered messy and much harder to govern at scale.
Mature security tools for encryption, tokenization, masking, etc., exist for databases, making structured data easier to govern. Standards like Trusted Data Format (TDF) and best practices for protecting unstructured data are emergent.

Despite the dynamics above, organizations must focus on effectively securing and governing both, structured and unstructured data, simultaneously as part of an end-to-end zero-trust security strategy.

Here are some strategies to employ right now to help accomplish this feat.

1. Inventory and identify all data

As security expert Bruce Schneier once said, “You can’t secure what you don’t understand.” This applies to systems as well as the scope of the data your organization manages. Think about your HR department, your sales teams that can access sensitive customer data instantly, your IT teams managing passwords and API tokens.

Sensitive data lives — and moves — everywhere, and organizations should start by inventorying all their data assets, both structured and unstructured, and identifying where that data resides.

Of course, this is impossible to do manually, as there is simply too much information to consider. Data discovery tools can help enterprises not only shine a light on all their datasets but they can also be used to add context to data, including who has access to it, what it’s being used for, and more.

2. Classify data by sensitivity, value, and purpose

Once an enterprise has inventoried and identified all data, and put it into context, it can then prioritize and classify the information. Data should be sorted and classified based on its sensitivity and its value or purpose.

For example, if a particular category of unstructured data is either highly-valuable or very sensitive — or, likely, both. Another example – if the data contains personally identifiable information, the set could be classified as “top secret” or “secret.”

These classifications could also apply for manufacturing organizations handling intellectual property or critical infrastructure data. While these may seem like government-specific terms, many global regulations, such as the General Data Protection Regulation (GDPR), require enterprises to classify data in similar ways.

Conversely, most organizations also have a lot of unstructured data that is not extremely important or sensitive. This information can receive a lower classification – for instance, “confidential” or even “unclassified.” Such data might need minimal protection. In these instances, security managers can focus most of their attention on the higher classification levels, saving time while still ensuring that high-value or sensitive assets are protected.

This exercise can also benefit from cross-functional collaboration from risk management and data governance teams to ensure the full scope of organizational compliance needs are addressed.

3. Tag data appropriately

Once data is classified, organizations can use digital tools to tag each data object according to its classification level. Tags help dictate the security policies organizations place around certain datasets and restrict access to authorized users.

Organizations at the beginning of this journey can start simply with labels in Google Drive, keeping in mind that Drive labels are purely for organization and do not have any inherent access controls associated with them.

Still, tagging presents certain challenges, particularly when it comes to unstructured data. Sensitive unstructured data can show up anywhere, including a person’s laptop or mobile device. Plus, data handling requirements are continually changing. Data that might be considered highly-sensitive or confidential one day could be made more widely available the next.

Access privileges also change frequently, as people leave or get promoted within organizations. As with the classification process, cross-functional collaboration and leadership goes a long way in getting teams on board and invested in appropriately categorizing and tagging the data they create.

This is an information security priority as well as a data governance, risk management, and organization-wide initiative. But, even with cross-functional support, it is still difficult for even the most diligent security team to keep up. This leads to the final and perhaps most important data security strategy.

4. Employ data-centric security policies

Ultimately, it is the data we are trying to protect. Applying security to the data itself ensures that all data, whether structured or unstructured, remains well protected wherever it resides. With data-centric security, all types of unstructured data assets like documents, videos, and emails, can receive their own protective wrapper, replete with unique policies, access controls, and encryption.

The key ingredient is an open standard called the Trusted Data Format (TDF). First invented as a means to securely share information among U.S. intelligence agencies, TDF has since become widely used by corporations that leverage it to apply fine-grained security and access control around files and attachments.

Applying TDF to a file allows users to securely share that file internally or externally without worrying that the information it contains will be compromised or accessed by the wrong person. The original owner of the file and its associated data sets retains control over the information and is able to dictate who can access it, how it’s used, how it’s encrypted, and more.

As such, TDF employs a very targeted level of attribute-based access control (ABAC). Unlike role-based access control, which dictates who has access to information based on organizational roles, ABAC allows data owners to assign attributes to both the data and users attempting to access information. When a user attempts to access the data, the protective TDF wrapper automatically compares the user’s attributes to those embedded in the data. A match grants access, while a non-match is denied.

A true zero-trust approach

While the concept of developing a zero-trust approach to cybersecurity is now the norm, true zero trust can only be achieved if all data – structured and unstructured – is protected. Even then, the people who need access to information must be able to receive that access without restriction to keep workflows moving.

The only real way to do that is to follow the steps outlined above. Inventorying, classifying, and tagging data are foundational elements of security. Overlaying a data-centric approach onto this foundation enables access to information while significantly mitigating risks and providing an unmatched level of data protection.

About the Author:

A proven executive and entrepreneur with over 25 years experience developing high-growth software companies, Matt Howard serves as Virtu’s CMO and leads all aspects of the company’s go-to-market motion within the data protection and Zero Trust security ecosystems.

Prior to Virtru, Howard served 6 years as SVP and CMO at Sonatype where he designed, built, and led global marketing and demand generation for a pioneer in software supply chain management and DevOps automation.

Earlier in his career, Howard co-founded, developed, and successfully sold two software companies. He also led sales and marketing at USinternetworking (acquired by AT&T) and Groove Networks (acquired by Microsoft).He holds a Bachelor of Arts degree from The George Washington University and a Master of Arts from George Mason University.