This site is the archived OWASP Foundation Wiki and is no longer accepting Account Requests.
To view the new OWASP Foundation website, please visit https://owasp.org

Difference between revisions of "Benchmark"

From OWASP
Jump to: navigation, search
(Headers as a Source of Attack for XSS)
(Test Cases)
Line 187: Line 187:
 
Version 1.0 of the Benchmark was published on April 15, 2015 and had 20,983 test cases. On May 23, 2015, version 1.1 of the Benchmark was released. The 1.1 release improves on the previous version by making sure that there are both true positives and false positives in every vulnerability area. Version 1.2beta was released on August 15, 2015.
 
Version 1.0 of the Benchmark was published on April 15, 2015 and had 20,983 test cases. On May 23, 2015, version 1.1 of the Benchmark was released. The 1.1 release improves on the previous version by making sure that there are both true positives and false positives in every vulnerability area. Version 1.2beta was released on August 15, 2015.
  
Version 1.2 and forward of the Benchmark is a fully executable web application, which means it is scannable by any kind of vulnerability detection tool. The 1.2beta has been limited to slightly less than 3,000 test cases, to make it easier for DAST tools to scan it (so it doesn't take so long and they don't run out of memory, or blow up the size of their database). The final 1.2 release is expected to be at least 5,000 or possibly 10,000 test cases, after we determine that the popular DAST scanners can handle that size. The 1.2beta release covers the same vulnerability areas that 1.1 covers. We added a few Spring database SQL Injection tests, but thats it. The bulk of the work was turning each test case into something that actually runs correctly AND is fully exploitable, and then generating a UI on top of it that works in order to turn the test cases into a real running application.
+
Version 1.2beta and forward of the Benchmark is a fully executable web application, which means it is scannable by any kind of vulnerability detection tool. The 1.2beta has been limited to slightly less than 3,000 test cases, to make it easier for DAST tools to scan it (so it doesn't take so long and they don't run out of memory, or blow up the size of their database). The final 1.2 release is expected to be the same size. The 1.2beta release covers the same vulnerability areas that 1.1 covers. We added a few Spring database SQL Injection tests, but thats it. The bulk of the work was turning each test case into something that actually runs correctly AND is fully exploitable, and then generating a UI on top of it that works in order to turn the test cases into a real running application.
  
 
Given 1.2beta is temporary, we aren't updating the chart below. You can still download the version 1.1 release of the Benchmark by cloning the release marked with the GIT tag '1.1'.
 
Given 1.2beta is temporary, we aren't updating the chart below. You can still download the version 1.1 release of the Benchmark by cloning the release marked with the GIT tag '1.1'.
Line 196: Line 196:
 
|-
 
|-
 
! Vulnerability Area
 
! Vulnerability Area
! Number of Tests
+
! # of Tests in v1.1
 +
! # of Tests in v1.2
 
! CWE Number
 
! CWE Number
 
|-
 
|-
 
| [[Command Injection]]
 
| [[Command Injection]]
 
| 2708
 
| 2708
 +
| 251
 
| [https://cwe.mitre.org/data/definitions/78.html 78]
 
| [https://cwe.mitre.org/data/definitions/78.html 78]
 
|-
 
|-
 
| Weak Cryptography
 
| Weak Cryptography
 
| 1440
 
| 1440
 +
| 246
 
| [https://cwe.mitre.org/data/definitions/327.html 327]
 
| [https://cwe.mitre.org/data/definitions/327.html 327]
 
|-
 
|-
 
| Weak Hashing
 
| Weak Hashing
 
| 1421
 
| 1421
 +
| 236
 
| [https://cwe.mitre.org/data/definitions/328.html 328]
 
| [https://cwe.mitre.org/data/definitions/328.html 328]
 
|-
 
|-
 
| [[LDAP injection | LDAP Injection]]
 
| [[LDAP injection | LDAP Injection]]
 
| 736
 
| 736
 +
| 59
 
| [https://cwe.mitre.org/data/definitions/90.html 90]
 
| [https://cwe.mitre.org/data/definitions/90.html 90]
 
|-
 
|-
 
| [[Path Traversal]]
 
| [[Path Traversal]]
 
| 2630
 
| 2630
 +
| 268
 
| [https://cwe.mitre.org/data/definitions/22.html 22]
 
| [https://cwe.mitre.org/data/definitions/22.html 22]
 
|-
 
|-
 
| Secure Cookie Flag
 
| Secure Cookie Flag
 
| 416
 
| 416
 +
| 67
 
| [https://cwe.mitre.org/data/definitions/614.html 614]
 
| [https://cwe.mitre.org/data/definitions/614.html 614]
 
|-
 
|-
 
| [[SQL Injection]]
 
| [[SQL Injection]]
 
| 3529
 
| 3529
 +
| 504
 
| [https://cwe.mitre.org/data/definitions/89.html 89]
 
| [https://cwe.mitre.org/data/definitions/89.html 89]
 
|-
 
|-
 
| [[Trust Boundary Violation]]
 
| [[Trust Boundary Violation]]
 
| 725
 
| 725
 +
| 126
 
| [https://cwe.mitre.org/data/definitions/501.html 501]
 
| [https://cwe.mitre.org/data/definitions/501.html 501]
 
|-
 
|-
 
| Weak Randomness
 
| Weak Randomness
 
| 3640
 
| 3640
 +
| 493
 
| [https://cwe.mitre.org/data/definitions/330.html 330]
 
| [https://cwe.mitre.org/data/definitions/330.html 330]
 
|-
 
|-
 
| [[XPATH Injection]]
 
| [[XPATH Injection]]
 
| 347
 
| 347
 +
| 35
 
| [https://cwe.mitre.org/data/definitions/643.html 643]
 
| [https://cwe.mitre.org/data/definitions/643.html 643]
 
|-
 
|-
 
| [[XSS]] (Cross-Site Scripting)
 
| [[XSS]] (Cross-Site Scripting)
 
| 3449
 
| 3449
 +
| 455
 
| [https://cwe.mitre.org/data/definitions/79.html 79]
 
| [https://cwe.mitre.org/data/definitions/79.html 79]
 
|-
 
|-
 
| Total Test Cases
 
| Total Test Cases
 
| 21,041
 
| 21,041
 +
| 2,740
 
|}
 
|}
  
Line 253: Line 266:
 
* either a true vulnerability or a false positive for a single issue
 
* either a true vulnerability or a false positive for a single issue
  
The benchmark is intended to help determine how well analysis tools correctly analyze a broad array of application and framework behavior, including:
+
The Benchmark is intended to help determine how well analysis tools correctly analyze a broad array of application and framework behavior, including:
  
 
* HTTP request and response problems
 
* HTTP request and response problems

Revision as of 23:50, 28 April 2016

Incubator big.jpg

OWASP Benchmark Project

The OWASP Benchmark for Security Automation (OWASP Benchmark) is a free and open test suite designed to evaluate the speed, coverage, and accuracy of automated software vulnerability detection tools and services (henceforth simply referred to as 'tools'). Without the ability to measure these tools, it is difficult to understand their strengths and weaknesses, and compare them to each other. The OWASP Benchmark contains over 20,000 test cases that are fully runnable and exploitable, each of which maps to the appropriate CWE number for that vulnerability.

You can use the OWASP Benchmark with Static Application Security Testing (SAST) tools, Dynamic Application Security Testing (DAST) tools like OWASP ZAP and Interactive Application Security Testing (IAST) tools. The current version of the Benchmark is implemented in Java. Future versions may expand to include other languages.

Benchmark Project Scoring Philosophy

Security tools (SAST, DAST, and IAST) are amazing when they find a complex vulnerability in your code. But with widespread misunderstanding of the specific vulnerabilities automated tools cover, end users are often left with a false sense of security.

We are on a quest to measure just how good these tools are at discovering and properly diagnosing security problems in applications. We rely on the long history of military and medical evaluation of detection technology as a foundation for our research. Therefore, the test suite tests both real and fake vulnerabilities.

There are four possible test outcomes in the Benchmark:

  1. Tool correctly identifies a real vulnerability (True Positive - TP)
  2. Tool fails to identify a real vulnerability (False Negative - FN)
  3. Tool correctly ignores a false alarm (True Negative - TN)
  4. Tool fails to ignore a false alarm (False Positive - FP)

We can learn a lot about a tool from these four metrics. Consider a tool that simply flags every line of code as vulnerable. This tool will perfectly identify all vulnerabilities! But it will also have 100% false positives and thus adds no value. Similarly, consider a tool that reports absolutely nothing. This tool will have zero false positives, but will also identify zero real vulnerabilities and is also worthless. You can even imagine a tool that flips a coin to decide whether to report whether each test case contains a vulnerability. The result would be 50% true positives and 50% false positives. We need a way to distinguish valuable security tools from these trivial ones.

If you imagine the line that connects all these points, from 0,0 to 100,100 establishes a line that roughly translates to "random guessing." The ultimate measure of a security tool is how much better it can do than this line. The diagram below shows how we will evaluate security tools against the Benchmark.

Wbe guide.png

A point plotted on this chart provides a visual indication of how well a tool did considering both the True Positives the tool reported, as well as the False Positives it reported. We also want to compute an individual score for that point in the range 0 - 100, which we call the Benchmark Accuracy Score.

The Benchmark Accuracy Score is essentially a Youden Index, which is a standard way of summarizing the accuracy of a set of tests. Youden's index is one of the oldest measures for diagnostic accuracy. It is also a global measure of a test performance, used for the evaluation of overall discriminative power of a diagnostic procedure and for comparison of this test with other tests. Youden's index is calculated by deducting 1 from the sum of a test’s sensitivity and specificity expressed not as percentage but as a part of a whole number: (sensitivity + specificity) – 1. For a test with poor diagnostic accuracy, Youden's index equals 0, and in a perfect test Youden's index equals 1.

 So for example, if a tool has a True Positive Rate (TPR) of .98 (i.e., 98%) 
   and False Positive Rate (FPR) of .05 (i.e., 5%)
 Sensitivity = TPR (.98)
 Specificity = 1-FPR (.95)
 So the Youden Index is (.98+.95) - 1 = .93
 
 And this would equate to a Benchmark score of 93 (since we normalize this to the range 0 - 100)

On the graph, the Benchmark Score is the length of the line from the point down to the diagonal “guessing” line. Note that a Benchmark score can actually be negative if the point is below the line. This is caused when the False Positive Rate is actually higher than the True Positive Rate.

Benchmark Validity

The Benchmark tests are not exactly like real applications. The tests are derived from coding patterns observed in real applications, but the majority of them are considerably simpler than real applications. That is, most real world applications will be considerably harder to successfully analyze than the OWASP Benchmark Test Suite. Although the tests are based on real code, it is possible that some tests may have coding patterns that don't occur frequently in real code.

Remember, we are trying to test the capabilities of the tools and make them explicit, so that users can make informed decisions about what tools to use, how to use them, and what results to expect. This is exactly aligned with the OWASP mission to make application security visible.

Generating Benchmark Scores

Anyone can use this Benchmark to evaluate vulnerability detection tools. The basic steps are:

  1. Download the Benchmark from github
  2. Run your tools against the Benchmark
  3. Run the BenchmarkScore tool on the reports from your tools

That's it!

Full details on how to do this are at the bottom of the page on the Quick_Start tab.

We encourage both vendors, open source tools, and end users to verify their application security tools against the Benchmark. In order to ensure that the results are fair and useful, we ask that you follow a few simple rules when publishing results. We won't recognize any results that aren't easily reproducible:

  1. A description of the default “out-of-the-box” installation, version numbers, etc…
  2. Any and all configuration, tailoring, onboarding, etc… performed to make the tool run
  3. Any and all changes to default security rules, tests, or checks used to achieve the results
  4. Easily reproducible steps to run the tool

Reporting Format

The Benchmark includes tools to interpret raw tool output, compare it to the expected results, and generate summary charts and graphs. We use the following table format in order to capture all the information generated during the evaluation.

Code Repo and Build/Run Instructions

See the Getting Started and Getting, Building, and Running the Benchmark sections on the Quick Start tab.

Licensing

The OWASP Benchmark is free to use under the GNU General Public License v2.0.

Mailing List

OWASP Benchmark Mailing List

Project Leaders

Dave Wichers @

Project References

Related Projects

Quick Download

All test code and project files can be downloaded from OWASP GitHub.

Project Intro Video

BenchmarkPodcastTitlePage.jpg

News and Events

  • LOOKING FOR VOLUNTEERS!! - We are looking for individuals and organizations to join and make this a much more community driven project, including additional coleaders to help take this project to the next level. Contributors could work on things like new test cases, additional tool scorecard generators, adding support for languages beyond Java, and a host of other improvements. Please contact me if you are interested in contributing at any level.
  • Sep 24, 2015 - Benchmark introduced to broader OWASP community at AppSec USA
  • Aug 27, 2015 - U.S. Dept. of Homeland Security (DHS) is financially supporting the Benchmark project.
  • Aug 15, 2015 - Benchmark Version 1.2beta Released with full DAST Support. Checkmarx and ZAP scorecard generators also released.
  • July 10, 2015 - Benchmark Scorecard generator and open source scorecards released
  • May 23, 2015 - Benchmark Version 1.1 Released
  • April 15, 2015 - Benchmark Version 1.0 Released

Classifications

Owasp-incubator-trans-85.png Owasp-builders-small.png
Owasp-defenders-small.png
GNU General Public License v2.0
Project Type Files CODE.jpg