# oss-analysis-census2-prototype **Repository Path**: mirrors_ossf/oss-analysis-census2-prototype ## Basic Information - **Project Name**: oss-analysis-census2-prototype - **Description**: Prototype of Census 2 of open source software (NOT MAINTAINED) - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-10-22 - **Last Updated**: 2026-03-22 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Core Infrastructure Initiative Census 2 Prototype This "Open Source Software Analysis Census 2 Prototype" is a prototype for a process for identifying the most important open source software (OSS), and from them, identifying the "highest security risk" OSS among the most important OSS. It was originally developed in 2017 under the Core Infrastructure Initiative (CII), and is now shared as part of the [Open Source Security Foundation (OpenSSF)](https://openssf.org/) [Securing Critical Projects Working Group (WG)](https://github.com/ossf/wg-securing-critical-projects). The purpose of sharing this prototype is to help future related work. For more information about this prototype, see the paper ["Core Infrastructure Initiative (CII) Open Source Software Census II Strategy"](https://www.ida.org/research-and-publications/publications/all/c/co/core-infrastructure-initiative-cii-open-source-software-census-ii-strategy) by David A. Wheeler and Jason N. Dossett, IDA Document D-8777, October 2017. That document describes the overall approach. To determine the feasibility of this approach, we developed this prototype that did some basic dependency analysis, computed a simple risk indicator, and then reported the combination. When developing this prototype, we identified approaches that enabled the analysis to scale to the millions of repositories and projects necessary. In developing this prototype we found that there are subtleties in versioning that must be carefully handled to produce accurate results; this protype does not (yet) implement the proposed changes to handle this. This prototype primarily uses the [dataset provided by Libraries.io](https://libraries.io/data); we gratefully acknowledge their work. We also used data from Google's BigQuery [githubarchive monthly dataset](https://bigquery.cloud.google.com/dataset/githubarchive:month). ## Related work Other related work includes: * ["Open Source Software Projects Needing Security Investments" by David A. Wheeler and Samir Khakimov, 19 June 2015, IDA document D-5459](https://www.coreinfrastructure.org/wp-content/uploads/sites/6/2018/04/pub_ida_lf_cii_070915.pdf). * [Vulnerabilities in the Core, a Preliminary Report and Census II of Open Source Software. from Harvard and the Linux Foundation (2020)](https://www.coreinfrastructure.org/programs/census-program-ii/) ## Installation You must install make and PostgreSQL. On Ubuntu, do the following: ~~~sh sudo apt update sudo apt install postgresql postgresql-contrib make sudo su postgres createdb census2 createuser --interactive # Enter name of role to add: $(yourusername) ; say "y" for superuser exit ~~~ You must then download the `libraries.io` dataset: ~~~sh make download make libio_data ~~~ ## Usage To run the analysis, simply run "make"; by default this will use all but one of the processors on your system. However, it takes a while, so you should probably run: > nohup make & You can set the environment variable NUM_CPU to set the number of processors to use. You can also set TARGET_DB to set the database (by default, "census2"). This will produce various files, including the final result, the file "filtered_runtime_deps_traversed_dependent_counts_labeled.csv" which reports the ordered list of OSS packages that are most depended upon by other package and repositories (which is our proxy for importance), only for runtime dependencies. Some other files it produces: * `filtered_runtime_deps.csv` - all direct dependencies for the filtered projects (from libraries.io dataset) * `filtered_runtime_deps_traversed/` - `NUM_CPU` gzipped .csv files containing all transitive dependencies To calculate security risks, see the file "sm_score-NOTES". It describes how to run the Google BigQuery query to get the one-year contributor data. You export that data to a .csv file and place it in your directory as repo_contributors_1yr.csv. Then run `make security_metrics.loaded` to calculate security risk scores. Once that is done, you can use the queries described in `sm_score-NOTES` to see the security risk scores, in particular, to sort the "most important" packages (in each package manager) by risk score. ## License All material is released under the [MIT license](./LICENSE). All material that is not executable, including all text when not executed, is also released under the [Creative Commons Attribution 3.0 International (CC BY 3.0) license](https://creativecommons.org/licenses/by/3.0/) or later. In SPDX terms, everything here is licensed under MIT; if it's not executable, including the text when extracted from code, it's "(MIT OR CC-BY-3.0+)".