Accelerating Cloud Security Assurance

In the olden days, around 2010 or so, public cloud vendors didn’t offer that many services. AWS launched with just queuing (SQS), virtual servers (EC2), file storage (S3) and some basic identity services. Cloud “migration” by and large meant taking the zoning model and security controls that were already on-prem and replicating that design in the public cloud. Even when cloud-native firewalls and network controls became available, nobody trusted them, so existing security vendors sold “virtual appliances” and copying the existing data centre architecture was common.

Apart from a few exceptions, that’s not done anymore. Trust in cloud vendors and cloud-native services has increased, as has the number of people educated and trained in IaaS clouds. Many services are available to solve specific problems, and functionality is constantly being added. Some new solutions have no real on-prem equivalent, forcing controls to be cloud native.

Legacy Security Design Assurance

Even with the massive uptake in cloud usage, many organisations haven’t yet evolved their security assurance operating model and are still tied to the same model used when everything was on-prem. Typically, when a project team wants to use the enterprise public cloud platform, a condition of use is that the project must comply with all the company standards when designing their solution. This design then goes through a manual review process, where the architecture is reviewed, any pen testing is performed, code scanning is run, and perhaps a 3^rd party vendor assessment is undertaken if an external vendor is involved. Then, the outputs are collated, and a risk assessment might be performed. If your desire is to move fast, there are multiple problems with this approach:

Timing. In a large enterprise, this is typically a long, involved process. The business wants to be agile and release new functionality every two weeks (or faster), but the security assessment process is primarily manual and, in some cases, I’ve seen it take up to six months(!!). This turns the security assessment team into a bottleneck, and suddenly, the entire company views the security team as a blocker and discovers it’s easier to find ways to bypass the official process rather than following it.

And these timeframes may have made sense in the past, but with the advent of the cloud and self-service, the ops teams no longer need three months to purchase, design, rack, and configure physical servers. Increasingly, the development team and “server” admin team could be the same team—possibly even the same people. Maybe it was acceptable to take 3-6 months to review a design when it took the same amount of time to build the production environment, but that’s no longer the case.

Build Assurance. If the assessment process is mostly manual, and it’s a point-in-time design assessment, what guarantee does the assurance team have that what was designed is what was implemented? It doesn’t matter how rigorous the governance process is around design assurance if the development team can then go away and implement something completely different. And this may not be malicious – it could be purely innocent, where a last minute design change is made that has unintentional security consequences, but nobody thinks to check with the security team, because it’s already been “approved”.

Operational Assurance. Taking too long and making a last minute change is one thing, but the larger problem is that if all the processes and technology are setup to perform an assurance assessment at the design phase only, what happens after deployment? Often, developers or administrators need production access to diagnose and fix defects that have been discovered. Having access to prod means they can also change the running config. What if the manual change to fix a functional defect inadvertently breaks security policy?

Even if the change doesn’t break policy, the production config is no longer consistent if that manual change isn’t also replicated into the other environments. Without a way to continuously assess the security posture of your runtime environment, you won’t know there’s a problem until it’s too late and somebody finds company data for sale on the dark web.

What does good look like?

If we were to build a solution for this, what sort of outcomes would we like to see?

Scalable. The solution would need to be able to scale up and down with demand – not just for newly proposed solutions, but cater for existing applications as well.
Fast. It would be desirable if proposed designs could be assessed quickly so the security team is not a bottleneck. “Security at the speed of business” is the end goal.
Continuous. It must be more than a point in time assessment, existing solutions must be included as well to cater for config drift as well as addressing old software that contains vulnerabilities.
Secure. It must give the security team confidence that the solutions that are deployed are actually secure and this isn’t a ‘box checking’ exercise.
Cost Effective. Ideally, it should be cost neutral or even save money over time.
Handles Exceptions. Some solutions won’t be able to fit into this improved process, due to legacy software or hardware. There would need to be an exception process for solutions that don’t fit.

Enter Stage Right – CNAPP

A bunch of SaaS tools are available now in the nebulous “posture management” category. Cloud Security (CSPM), Data Security (DSPM), Application Security (APSM), Cloud Workload Protection Platform (CWPP). The nomenclature is still evolving, but these posture management tools seem to fall under the “Cloud Native Application Protection” moniker, and of course, the vendors are selling a “Platform”, so it seems we’re calling this new-ish category CNAPP.

Rather than configuring your infrastructure to send logs to a common endpoint, these tools connect to your public IaaS accounts via API and retrieve all the data needed on a continuous basis. Because CNAPP is connected directly to the “control plane” of your cloud environment, it gives you complete visibility of what’s running in your cloud without installing an agent or configuring a service to send logs. (Well, that’s the end goal – some solutions, such as Kubernetes, require an agent – but it’s a considerable step up in visibility)

Because CNAPP has visibility across the entire cloud estate and understands how the network is configured, it unlocks new ways to manage and address vulnerabilities. Vulnerabilities discovered but not accessible because of network configuration (for example) are given a much lower weighting – even if they might have a high CVSS score – since if a vulnerable service isn’t exposed to the internet, it’s less likely to be compromised. Any compromise would have to come from an insider – who is more trusted than internet traffic.

Conversely, internet-exposed vulnerabilities might be given a higher rating because they have a greater potential for exploitation. Depending on your risk appetite and desired cloud network topology, you can customise these “toxic combinations” of findings and their associated risk ratings using criteria such as the data classification the service is handling or your environment designation (dev, test, staging, prod)

Security teams can then tune the security policies pertaining to each cloud service. Many companies now define “cloud blueprints” or “cloud service certifications” that list each cloud service’s required and desired settings. If these policies are implemented accurately in the tool, if a development team attempts to deploy a service outside of these parameters, the CNAPP can block the deployment.

CNAPP is also effective when administrators manually change the configuration. If the new configuration no longer complies with the security policy (configuration drift), the tool can revert to the previous config or raise an alert (if you prefer not to give your CNAPP write access to your cloud environment).

This, in my opinion, is a game-changer.

Shifting Left and Right

One of the more popular CNAPP tools, Wiz, recently announced Wiz Code. This means that in addition to monitoring the cloud operating environment, you can also monitor the application build pipeline (Policy as Code – PaC), and the infrastructure pipeline (Infrastructure as Code – IaC), which enables policy enforcement to shift left. This gives security teams a single tool to monitor and enforce policies at build, deployment, and run time.

Once these policy guardrails are in place, most organisations should no longer need to rely on a detailed security assessment process. If the policies and guardrails are configured correctly and the right operational processes are in place, the development teams will be forced to deploy solutions that are inherently secure – also known as secure by design. There’s no other way those cloud services can be consumed. The other benefit of doing this is that if your policies comply with any frameworks or standards relevant to your industry, your systems should be compliant by default – and you can continuously verify and report on this.

Of course, it’s not all rainbows and unicorns. These tools are not ‘set and forget’ and must be updated when new policies are published, or when new services are available. The security team must closely monitor the tool and follow up with the asset owners when discovered issues aren’t being dealt with quickly. Also as mentioned previously not all solutions can be shoe-horned into the design patterns you publish. Sometimes, COTS software has certain… limitations. So, you will need a defined exemption process as well for solutions that don’t quite fit.

Democratising Security

If the Security team wants to stop being a bottleneck or the “handbrake to happiness,” as a colleague of mine likes to say, they must let go of the reins a little. Depending on the organisation’s industry and risk policy, this approach could be met with significant opposition. It’s impossible to solve these problems with just technology; you also need the process defined and right people onboard the bus, or you will never reach your destination.

Potentially, the biggest mind shift for the security team is delegating the responsibility of tracking and fixing any vulnerabilities to the individual development teams that own the cloud assets. If the CNAPP tool identifies an issue, it should be the responsibility of the asset owner and their dev team to address it—and within the SLA defined by the security team. The cost of being allowed to ‘move fast and break things’ means you are also responsible for fixing what breaks.

The effectiveness of this approach depends heavily on the maturity of the organization as well as the individual teams. More effective teams might be allowed less oversight, while less effective or mature teams might be only allowed a short leash. The people involved also need to understand the importance of fixing security defects and formally allocate time in their sprints for it. For example, teams should be dissuaded from flagging every finding as a ‘false positive’ so the findings disappear from the dashboard. A culture that encourages flagging security issues and increasing visibility should be promoted. They can then be adequately addressed, not hidden or brushed under the carpet to meet delivery timeframes.

This means making the security tools as user-friendly for developers as possible. Rolling out an automated assurance process will require strong change management and constant consultation with the development teams to ensure they get exactly what they want. If you’re not simultaneously making their lives easier and their software more secure, they won’t want to use your tools or follow your process. And if they don’t use what you put in place, they will start finding workarounds to bypass the official process. Even attempting CNAPP might be a waste of time in this type of corporate culture.

The Benefits

Providing the security team are comfortable that these automated checks are producing secure designs, this approach should greatly reduce the manual effort required of security assurance/design assessment team. This should meet our goals of decreasing the time per change, as well as the cost per change to the business. Over time, this solution should remove a pain point and eventually save the business money. This approach also moves away from a point-in-time assessment to a continuous assurance model. The existing assurance team can then shift their focus from checking everybody else’s homework to higher value tasks such as:

Fine-tuning the security policies as new cloud services are consumed.
Proactive risk management – following up on existing risks/exceptions/vulnerabilities that haven’t been addressed.
Pen testing / Red teaming existing solutions that are already in production, rather than incremental changes
Handling edge cases and exceptions
Coming up with new approved integration patterns.

Automating the assurance process will also standardise assessments because solutions either comply with the policies or they don’t. Manual reviews give different outcomes depending on the skillset and background of the reviewer and this can be eliminated using automation.

This will take quite a bit of a mind shift for some. The security team will need to stop thinking of themselves as the security police and instead start thinking of themselves as legislators writing the laws. We’ve had security view themselves as the company policeman for quite some time now, and that approach no longer works at the pace the business wants to move. Automation is the name of the game, and enforcement is now Robocop’s (CNAPP) responsibility.

So, to summarise, “all” you need to do is use the design principles of zero trust and overlapping controls (defence in depth) to define patterns that everybody agrees are secure, then use software, automation, and processes to enforce those constraints on solution designs. Yes, I know, it sounds so simple! I realise it’s quite a bit of effort to implement, but over time I reckon the payoff is worth it.

And if you don’t fully trust automation just yet, an alternate approach that perhaps would make CISO’s more comfortable is using automation to augment the existing manual process. This wouldn’t meet the cost requirements, but would significantly increase the security posture of the organisation over time – which can only be a good thing!