This site is the archived OWASP Foundation Wiki and is no longer accepting Account Requests.
To view the new OWASP Foundation website, please visit https://owasp.org

Difference between revisions of "OWASP WebGoat Benchmark Edition (WBE)"

From OWASP
Jump to: navigation, search
m
Line 8: Line 8:
 
The OWASP WebGoat Benchmark Edition (WBE) is a test suite designed to evaluate the speed, coverage, and accuracy of vulnerability detection tools. Without the ability to measure these tools, it is difficult to understand their value or interpret vendor claims. The WBE contains over 20,000 test cases that are fully runnable and exploitable.
 
The OWASP WebGoat Benchmark Edition (WBE) is a test suite designed to evaluate the speed, coverage, and accuracy of vulnerability detection tools. Without the ability to measure these tools, it is difficult to understand their value or interpret vendor claims. The WBE contains over 20,000 test cases that are fully runnable and exploitable.
  
You can use this initial version with Static Application Security Testing (SAST) and Interactive Application Security Testing (IAST) tools. A future goal is to support the evaluation of Dynamic Application Security Testing (DAST) tools like OWASP [[ZAP]]. The current version is implemented in Java.  Future versions may expand to include other languages.
+
You can use WBE with Static Application Security Testing (SAST) and Interactive Application Security Testing (IAST) tools. A future goal is to support the evaluation of Dynamic Application Security Testing (DAST) tools like OWASP [[ZAP]]. The current version is implemented in Java.  Future versions may expand to include other languages.
  
 
==Project Philosophy==
 
==Project Philosophy==
  
Security tools (SAST, DAST, and IAST) are amazing when they find a complex vulnerability in your code.  But they can drive everyone crazy with complexity, false alarms, and missed vulnerabilities.  We are on a quest to measure just how good these tools are at discovering and properly diagnosing security problems in applications.
+
Security tools (SAST, DAST, and IAST) are amazing when they find a complex vulnerability in your code.  But they can drive everyone crazy with complexity, false alarms, and missed vulnerabilities.  Using these tools without understanding their strengths and weaknesses can lead to a dangerous false sense of security.
  
One important lesson that WBE takes from NSA's Juliet test suite is the idea of measuring BOTH true vulnerabilities and false positives. Unlike Juliet, in the WBE, true vulnerabilities and false positives are not combined in a single test case. This allows each test case to verify a single aspect of vulnerability detection. For example, one test might check to see if a tool properly handles data flow propagation when a string is split into pieces using a regular expression.  Another might be the same test, with a seemingly plausible but fake propagation.
+
We are on a quest to measure just how good these tools are at discovering and properly diagnosing security problems in applications. We rely on the [http://en.wikipedia.org/wiki/Receiver_operating_characteristic long history] of military and medical evaluation of detection technology as a foundation for our research. Therefore, the test suite tests both real and fake vulnerabilities.
  
There are four kinds of test results in the WBE:
+
There are four possible test outcomes in the WBE:
  
# Tool correctly identifies a real vulnerability (TRUE positive)
+
# Tool correctly identifies a real vulnerability (True Positive - TP)
# Tool fails to identify a real vulnerability (FALSE negative)
+
# Tool fails to identify a real vulnerability (False Negative - FN)
# Tool correctly ignores a false alarm (TRUE negative)
+
# Tool correctly ignores a false alarm (True Negative - TN)
# Tool fails to ignore a false alarm (FALSE positive)
+
# Tool fails to ignore a false alarm (False Positive - FP)
 +
 
 +
Note that a tool that simply flags every line of code as vulnerable will perfectly identify all vulnerabilities in an application, but will also have 100% FPs.  Similarly, a tool that reports nothing will have zero FPs, but will also identify zero real vulnerabilities.  Imagine a tool that flips a coin to decide whether to report each vulnerability for every test case. The result would be 50% true positives and 50% false positives. The line that connects all these points, from 0,0 to 100,100 establishes a line that roughly translates to "random guessing." The ultimate measure of a security tool is how much better it can do than random guessing.
  
 
We recognize that the WBE test cases are not as complex as real code.  These tests are designed to test various capabilities of security tools, but the results for real code may be different or worse than the performance on the WBE. Please submit examples of code that present a challenge for security tools so that we can improve the WBE.
 
We recognize that the WBE test cases are not as complex as real code.  These tests are designed to test various capabilities of security tools, but the results for real code may be different or worse than the performance on the WBE. Please submit examples of code that present a challenge for security tools so that we can improve the WBE.
Line 34: Line 36:
 
## All changes to default security rules, tests, or checks to achieve the results
 
## All changes to default security rules, tests, or checks to achieve the results
 
## Easily reproducible steps for achieving the result  
 
## Easily reproducible steps for achieving the result  
# Summary results should be in the following table format  
+
# Results should be in the following table format  
 
   
 
   
< details TBD >
+
{| class="wikitable nowraplinks"
 +
|-
 +
! Security Category
 +
! TP
 +
! FN
 +
! TN
 +
! FP
 +
! Total
 +
! TPR
 +
! FPR
 +
! Score
 +
|-
 +
| valign="top" | General security category for test cases
 +
| valign="top" | '''True Positives''': Tests with real vulnerabilities that were correctly reported as vulnerable by the tool
 +
| valign="top" | '''False Negative''': Tests with real vulnerabilities that were not correctly reported as vulnerable by the tool
 +
| valign="top" | '''True Negative''': Tests with fake vulnerabilities that were correctly not reported as vulnerable by the tool
 +
| valign="top" | '''False Positive''':Tests with fake vulnerabilities that were incorrectly reported as vulnerable by the tool
 +
| valign="top" | Total test cases for this category
 +
| valign="top" width="250pt" | '''True Positive Rate''': TP / ( TP + FN )
 +
| valign="top" width="250pt" | '''False Positive Rate''': FP / ( FP + TN )
 +
| valign="top" | Distance from the “guess line” normalized to a 0-100 scale (More explanation TBD)
 +
|}
  
  
Line 43: Line 66:
 
The code for this project is hosted at the [https://github.com/OWASP/webgoat-benchmark OWASP Git repository]. Along with the code comes a Maven pom.xml file so you can download all the dependencies and build the entire project with ease using Maven.
 
The code for this project is hosted at the [https://github.com/OWASP/webgoat-benchmark OWASP Git repository]. Along with the code comes a Maven pom.xml file so you can download all the dependencies and build the entire project with ease using Maven.
  
Using the pom, it should be easy to verify all the code compiles correctly. To download and build everything, if you already have GIT and Maven installed, all you have to do is:
+
Using the pom, it should be easy to verify all the code compiles correctly. To download and build everything, if you already have git and maven installed, all you have to do is:
  
 
   $ git clone https://github.com/OWASP/webgoat-benchmark
 
   $ git clone https://github.com/OWASP/webgoat-benchmark
Line 49: Line 72:
 
   $ mvn compile
 
   $ mvn compile
  
A future version will support packaging up the results into a WAR file that you can then run in whatever JEE app server you want.
+
==Licensing==
  
==Licensing==
 
 
The OWASP WebGoat Benchmark is free to use under the [http://choosealicense.com/licenses/gpl-2.0/ GNU General Public License v2.0].
 
The OWASP WebGoat Benchmark is free to use under the [http://choosealicense.com/licenses/gpl-2.0/ GNU General Public License v2.0].
  
Line 59: Line 81:
  
 
== Project Leaders ==
 
== Project Leaders ==
 +
 
[https://www.owasp.org/index.php/User:Wichers Dave Wichers] [mailto:[email protected] @]
 
[https://www.owasp.org/index.php/User:Wichers Dave Wichers] [mailto:[email protected] @]
  
Line 74: Line 97:
  
 
== News and Events ==
 
== News and Events ==
 +
 
* April 15, 2015 - First Version Released
 
* April 15, 2015 - First Version Released
  
Line 160: Line 184:
 
* either a true vulnerability or a false positive for a single issue
 
* either a true vulnerability or a false positive for a single issue
  
Metadata for each test case, including the expected result, is contained in a matching XML file
+
The tool analyzes a broad array of application and framework behavior:
 
 
==Test Coverage==
 
 
 
The tool analyzes a broad array of application security tool capabilities:
 
  
 
* HTTP request and response problems?
 
* HTTP request and response problems?
Line 255: Line 275:
  
 
Our vision for this project is that we will develop automated test harnesses for lots of vulnerability detection tools where we can repeatably run the tools against each version of the benchmark and automatically produce results in our desired format.
 
Our vision for this project is that we will develop automated test harnesses for lots of vulnerability detection tools where we can repeatably run the tools against each version of the benchmark and automatically produce results in our desired format.
 
== FindBugs ==
 
 
[http://findbugs.sourceforge.net/ FindBugs] is a static analysis tool for Java. It finds both quality and security issues. We are only going to test its ability to find security vulnerabilities. FindBugs has detectors for the following kinds of security issues:
 
 
* Hardcoded Database Passwords
 
* HTTP Response Splitting
 
* Path Traversal
 
* SQL Injection
 
* XSS - Cross-Site Scripting
 
 
The WBE currently has test cases for the last three types of vulnerabilities. A future release will add tests for the first two that aren't yet covered.
 
 
== FindSecurityBugs ==
 
 
A very useful addition to FindBugs is the [http://h3xstream.github.io/find-sec-bugs/ FindSecurityBugs] plugin. This plugin significantly improves FindBug's ability to find security vulnerabilities. Once we've scored FindBugs, we'll do FindBugs with FindSecurityBugs next.
 
 
== OWASP ZAP ==
 
 
The OWASP [[ZAP]] project lead is excited to have ZAP be scored against the WBE. However, we need to develop a UI for the application and package everything up in a WAR file so its easily deployed, before we can try to run ZAP against the benchmark.
 
 
== Other Tools!  ==
 
  
 
We want to test the WBE against as many tools as possible. If you are:
 
We want to test the WBE against as many tools as possible. If you are:
Line 285: Line 283:
  
 
please let [mailto:[email protected] me] know!
 
please let [mailto:[email protected] me] know!
 
= Roadmap =
 
 
== 2015 Roadmap ==
 
 
* [June 2015] TBD
 
* Analysis tool integration: So you can automatically run tools against the benchmark. We want to build test harnesses for tools like:
 
  * OWASP's [[ZAP]]
 
  * [http://findbugs.sourceforge.net/ Findbugs] and plugins for it like [http://h3xstream.github.io/find-sec-bugs/ FindSecurityBugs]
 
  * Commercial SAST, DAST, and IAST tools
 
* FUTURE: Expand to include attack test cases to verify whether defenses (WAF, IDS/IPS, RASP) can identify and protect against them
 
  
 
= Acknowledgements =
 
= Acknowledgements =
Line 305: Line 292:
 
* Nick Sanidas - Development of initial release
 
* Nick Sanidas - Development of initial release
  
We are looking for '''many''' more volunteers as numerous significant enhancements still need to be made. Please contact [mailto:[email protected] Dave Wichers] if you are interested in contributing new test cases, tool results run against the benchmark, or anything else. We have a '''LONG''' wish list of ideas ...
+
We are looking for volunteers. Please contact [mailto:[email protected] Dave Wichers] if you are interested in contributing new test cases, tool results run against the benchmark, or anything else.
  
 
__NOTOC__ <headertabs />
 
__NOTOC__ <headertabs />
  
 
[[Category:OWASP_Project]]
 
[[Category:OWASP_Project]]

Revision as of 02:06, 10 May 2015

Incubator big.jpg

WebGoat Benchmark Edition

The OWASP WebGoat Benchmark Edition (WBE) is a test suite designed to evaluate the speed, coverage, and accuracy of vulnerability detection tools. Without the ability to measure these tools, it is difficult to understand their value or interpret vendor claims. The WBE contains over 20,000 test cases that are fully runnable and exploitable.

You can use WBE with Static Application Security Testing (SAST) and Interactive Application Security Testing (IAST) tools. A future goal is to support the evaluation of Dynamic Application Security Testing (DAST) tools like OWASP ZAP. The current version is implemented in Java. Future versions may expand to include other languages.

Project Philosophy

Security tools (SAST, DAST, and IAST) are amazing when they find a complex vulnerability in your code. But they can drive everyone crazy with complexity, false alarms, and missed vulnerabilities. Using these tools without understanding their strengths and weaknesses can lead to a dangerous false sense of security.

We are on a quest to measure just how good these tools are at discovering and properly diagnosing security problems in applications. We rely on the long history of military and medical evaluation of detection technology as a foundation for our research. Therefore, the test suite tests both real and fake vulnerabilities.

There are four possible test outcomes in the WBE:

  1. Tool correctly identifies a real vulnerability (True Positive - TP)
  2. Tool fails to identify a real vulnerability (False Negative - FN)
  3. Tool correctly ignores a false alarm (True Negative - TN)
  4. Tool fails to ignore a false alarm (False Positive - FP)

Note that a tool that simply flags every line of code as vulnerable will perfectly identify all vulnerabilities in an application, but will also have 100% FPs. Similarly, a tool that reports nothing will have zero FPs, but will also identify zero real vulnerabilities. Imagine a tool that flips a coin to decide whether to report each vulnerability for every test case. The result would be 50% true positives and 50% false positives. The line that connects all these points, from 0,0 to 100,100 establishes a line that roughly translates to "random guessing." The ultimate measure of a security tool is how much better it can do than random guessing.

We recognize that the WBE test cases are not as complex as real code. These tests are designed to test various capabilities of security tools, but the results for real code may be different or worse than the performance on the WBE. Please submit examples of code that present a challenge for security tools so that we can improve the WBE.

Scoring and Reporting Results

We encourage both vendors, open source tools, and end users to verify their application security tools using the WBE. We encourage everyone to contribute their results to the project. In order to ensure that the results are fair and useful, we ask that you follow a few simple rules when publishing results.

  1. Provide an easily reproducible procedure (script preferred) to run the tool on the WBE, including:
    1. A description of the default “out-of-the-box” installation, version numbers, etc…
    2. All configuration, tailoring, onboarding, etc… performed to make the tool run
    3. All changes to default security rules, tests, or checks to achieve the results
    4. Easily reproducible steps for achieving the result
  2. Results should be in the following table format


Code Repo

The code for this project is hosted at the OWASP Git repository. Along with the code comes a Maven pom.xml file so you can download all the dependencies and build the entire project with ease using Maven.

Using the pom, it should be easy to verify all the code compiles correctly. To download and build everything, if you already have git and maven installed, all you have to do is:

 $ git clone https://github.com/OWASP/webgoat-benchmark
 $ cd webgoat-benchmark
 $ mvn compile

Licensing

The OWASP WebGoat Benchmark is free to use under the GNU General Public License v2.0.

Mailing List

OWASP WebGoat Benchmark Mailing List

Project Leaders

Dave Wichers @

Related Projects

Quick Download

All test code and project files can be downloaded from OWASP GitHub.

News and Events

  • April 15, 2015 - First Version Released

Classifications

Owasp-incubator-trans-85.png Owasp-builders-small.png
Owasp-defenders-small.png
GNU General Public License v2.0
Project Type Files CODE.jpg

This initial release of the WBE has 20,983 test cases. The test case areas and quantities for the April 15, 2015 release are:

To download a spreadsheet that lists every test case, the vulnerability category, the CWE number, and the expected result (true finding/false positive), click here.

Every test case is:

  • a servlet or JSP (currently they are all servlets, but we plan to add JSPs soon)
  • either a true vulnerability or a false positive for a single issue

The tool analyzes a broad array of application and framework behavior:

  • HTTP request and response problems?
  • Simple and complex data flow?
  • Simple and complex control flow?
  • Popular frameworks?
  • Inversion of control?
  • Reflection?
  • Class loading?
  • Annotations?
  • Popular UI technologies (particularly JavaScript frameworks)

Future enhancements could cover:

  • All vulnerability types in the OWASP Top 10
  • Does the tool find flaws in libraries?
  • Does the tool find flaws spanning custom code and libraries?
  • Does tool handle web services? REST, XML, GWT, etc…
  • Does tool work with different app servers? Java platforms?

Example Test Case

Each test case is a simple Java EE servlet. BenchmarkTest00001 is an LDAP Injection test with the following metadata in the accompanying BenchmarkTest00001.xml file:

 <test-metadata>
   <category>ldapi</category>
   <test-number>00001</test-number>
   <vulnerability>true</vulnerability>
   <cwe>90</cwe>
 </test-metadata>

BenchmarkTest00001.java simply reads in all the cookie values, looks for a cookie named "foo" and uses the value of this cookie when performing an LDAP query. Here's the code for BenchmarkTest00001.java:

 package org.owasp.webgoat.benchmark.testcode;
 
 import java.io.IOException;
 
 import javax.servlet.ServletException;
 import javax.servlet.annotation.WebServlet;
 import javax.servlet.http.HttpServlet;
 import javax.servlet.http.HttpServletRequest;
 import javax.servlet.http.HttpServletResponse;
 
 @WebServlet("/BenchmarkTest00001")
 public class BenchmarkTest00001 extends HttpServlet {
 	
 	private static final long serialVersionUID = 1L;
 	
 	@Override
 	public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
 		doPost(request, response);
 	}
 
 	@Override
 	public void doPost(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {
 		// some code
 
 		javax.servlet.http.Cookie[] cookies = request.getCookies();
 		
 		String param = null;
 		boolean foundit = false;
 		if (cookies != null) {
 			for (javax.servlet.http.Cookie cookie : cookies) {
 				if (cookie.getName().equals("foo")) {
 					param = cookie.getValue();
 					foundit = true;
 				}
 			}
 			if (!foundit) {
 				// no cookie found in collection
 				param = "";
 			}
 		} else {
 			// no cookies
 			param = "";
 		}
 		
 		try {
 			javax.naming.directory.DirContext dc = org.owasp.webgoat.benchmark.helpers.Utils.getDirContext();
 			Object[] filterArgs = {"a","b"};
 			dc.search("name", param, filterArgs, new javax.naming.directory.SearchControls());
 		} catch (javax.naming.NamingException e) {
 			throw new ServletException(e);
 		}
 	}
 }

As of this initial release, we don't have any vulnerability detection tool results to publish. We are working on generating results for Findbugs as our first example, and then plan to work on more after that. If you would like to contribute to this project by running a tool against the benchmark and producing a set of results in the format described in the --Scoring and Reporting Results-- section on the main project tab, please contact the project lead.

Our vision for this project is that we will develop automated test harnesses for lots of vulnerability detection tools where we can repeatably run the tools against each version of the benchmark and automatically produce results in our desired format.

We want to test the WBE against as many tools as possible. If you are:

  • A tool vendor and want to participate in the project
  • Someone who wants to help score a free tool agains the project
  • Someone who has a license to a commercial tool and the terms of the license allow you to publish tool results, and you want to participate

please let me know!

The following people have contributed to this project and their contributions are much appreciated!

  • Juan Gama - Development of initial release and continued support
  • Ken Prole - Assistance with automated score card development using CodeDx
  • Nick Sanidas - Development of initial release

We are looking for volunteers. Please contact Dave Wichers if you are interested in contributing new test cases, tool results run against the benchmark, or anything else.