How to configure a Keyword Scan for GDPR (or anything else)

In this product tutorial, we’ll see how to configure and take advantage of the Keyword Scan feature to support a GPDR assessment of your application portfolio. The feature can be used to search for any kind of keywords (API secret token or passwords in clear text for instance) but really makes sense in a GPDR initiative. Does your codebase manipulate PII data? You’ll get some hints very soon with CAST Highlight.

What’s GPDR

According to Wikipedia: “The General Data Protection Regulation (GDPR) is a regulation in EU law on data protection and privacy for all individuals within the European Union and the European Economic Area. It also addresses the export of personal data outside the EU and EEA areas. The GDPR aims primarily to give control to citizens and residents over their personal data and to simplify the regulatory environment for international business by unifying the regulation within the EU.”

Concretely speaking, it means that organizations will have to know if their applications read, process or store Personally Identifiable Information (PII) data of EU-based users, in order to set up the appropriate actions (register the application, declare a Data Processor, qualify the nature and purpose of data collection, modify the application behavior to ask users whether they consent to share their data or not, etc.).

What and why we need to scan the code

According to the GPDR regulation, organizations now have to know if their applications are processing PII data. This is something quite obvious and easy to determine if an application is connected with a central database which holds tables or columns like “first_name”, “email_address”, “social_security_number”, etc. But in Software, nothing is obvious anymore.

First, applications and databases don’t necessarily have a 1:1 ratio. You may have a few central databases that are accessed by hundreds of apps. Then, GPDR is not only about identifying the databases you have and putting them in the GPDR process. This verification also needs to be done at the application level.

Secondly, applications can manipulate PII data without any database. As today API, JSON, web and micro services are the norm, meaning that a piece of source code can read, process and share data with other components without having a clue about the database that initially stored it. A small script cooked by your HR department read LinkedIn’s API to hire the best profiles? There is a risk that it manipulates PII data, at least names, location and profile pictures.

Fortunately, developers love code they can easily read and maintain: 99% of the time they call their classes, methods, parameters with names that are not obscure (e.g. getCustomerName, updateProfile($CreditCardNumber) etc.). As a result, it is possible to approximate (if not determine) that an application processes PII data by scanning its source code and counting occurrences of PII-related keywords. Scanning code to search for patterns? That’s exactly where Highlight comes into the game.

How to configure a Keyword Scan

The Keyword Scan feature works with our command line and takes the path to your keyword configuration file (–keywordScan “path/to/your/file.xml”). This file will tell the analyzers in a structured way what to search during a code scan. Its structure is detailed below:

  • UserScan: the root node that contains the configuration.
  • keywordScan: the main node for a keyword topic. You can indicate a name and a version (e.g. name=”GDPR” version=”1.2″). You can have multiple topics in a single configuration file as you may want to search for GDPR-related keywords  but also keywords for licences, specific unauthorized functions, other regulation tags…
  • keywordGroup: the node that will search in code for a keyword or a set of similar keywords (e.g. “social security number”, “ssn”, “social security nbr”, etc.). For each keyword group, you can define a specific weight (for instance, in a GDPR context, a passport number will weigh more than a firstname) and search options such as case sensitivity or full vs. partial word-matching.
  • keywordItem: one of the search element. You can have multiple items for a given keyword group.

Your final configuration file would look like this.

<UserScan>
<keywordScan name="GDPR" version="1.0">
	<keywordGroup name="People" weight="1" sensitive="0" full_word="1">
		<keywordItem>firstname</keywordItem>
		<keywordItem>forename</keywordItem>
		<keywordItem>1stname</keywordItem>
		<keywordItem>email</keywordItem>
		<keywordItem>...</keywordItem>
	</keywordGroup>
	<keywordGroup name="Social Security" weight="10" sensitive="0" full_word="1">
		<keywordItem>social security number</keywordItem>
		<keywordItem>socialsecuritynumber</keywordItem>
		<keywordItem>ssn</keywordItem>
	</keywordGroup>
	<keywordGroup name="Passport" weight="10" sensitive="0" full_word="1">
		<keywordItem>...</keywordItem>
	</keywordGroup>
</keywordScan>
</UserScan>
Once your XML configuration file is ready, just include it when running your command line as follows:

C:\Highlight\Highlight-Automation-Command>java -jar HighlightAutomation.jar –sourceDir “C:\sourcecode” –workingDir “C:\sourcecode\HighlightResults” –keywordScan “\\network\path\GPDR_keywords.xml” –skipUpload

When scanning an app with the feature active, the command line will produce one result CSV per keywordScan and per technology (e.g. Java-[date].KeywordScan.GDPR.csv, Java-[date].KeywordScan.Passwords.csv, Python-[date].KeywordScan.GDPR.csv, Python-[date].KeywordScan.Passwords.csv, etc.).

Each produced CSV will contain the list of scanned files as rows and the number of found keyword group occurrences as columns.

6887

Explore the results

In Highlight dashboards, Keyword Scan information can be visualized at the portfolio level from the menu entry “KEYWORD SCAN”. Using the weights and number of occurrences for a keyword group, the dashboard displays the aggregated scores by domain, application or keywords. Keyword scores are really simple to apprehend as their purpose is to quickly identify the relative volume of occurrences (in regards of keyword severities), density and file scope of a keyword set:

  • Score: number of occurrences * weight
  • Density: score / total files of the application
  • Impacted Files: number of files where keyword occurrences have been found

Visually, you can easily see when applications contain a lot of occurrences and/or occurrences with high severity by looking at the scores (horizontal axis), right side of the chart corresponding to high scores. Depending on your use case, you can also change the vertical axis with the main Highlight KPIs (Business Impact, FTEs, Lines of Code, Cloud readiness, etc.) and see how your application portfolio is distributed on this new metric.

Take concrete actions

In a GDPR assessment context, it also makes sense to leverage one of the feature we introduced in the last version: the capability to filter application results on survey answers.

Create a custom survey and simply ask your application owners the question “Does your application manipulate Personally Identifiable Information?”. Back in the dashboard, select the answer “No” and see if the applications that are supposed to not manipulate PII data, have in fact occurrences on GDPR-related keywords. You now have a solid list of application candidates to investigate for a GPDR registration.