Awesome
Empirical Software Engineering 
A curated repository of data sets and tools that can be used for
conducting evidence-based, data-driven research on software systems.
This research approach is often termed experimental,
or empirical software engineering. Many of the data sets can also be
useful in research using search-based
software engineering methods. The repository is named after the Mining Software Repositories (MSR)
conference series. For examples of such work see the MSR conference’s Hall of Fame.
- This list requires your input for its continuous improvement. Read
the contribution guide for instructions on
how you can contribute. Alternatively, you can send me an email if you find the process too
cumbersome or confusing.
- For more awesome lists, see awesome.
Contents
Repositories
- SIR -
Software-artifact infrastructure repository; Java, C, C++, and C#
software together with test suites and fault data.
- PROMISE
- About 20 datasets related to software engineering research.
- FLOSSmole -
Collaborative collection and analysis of free/libre/open source project
data.
- Zenodo - Software data collections
in CERN’s open-access repository.
Data Sets
- AndroidTimeMachine -
Graph-based dataset of commit history of 8,431 real-world Android
apps.
- AndroZoo - Collection of
Android Applications.
- Bug Prediction Dataset
- Collection of models and metrics from Eclipse JDT Core, PDE UI,
Equinox Framework, Lucene, Mylyn, and their histories.
- Code Reviews
- Code reviews of OpenStack, LibreOffice, AOSP, Qt, Eclipse.
- CoREBench -
Collection of 70 realistically Complex Regression Errors that were
systematically extracted from the repositories and bug reports of four
open-source software projects: Make, Grep, Findutils, and
Coreutils.
- Cryptocurrency
GitHub Activity and Market Cap Dataset - Activity such as commits,
stars, prices, and market cap of over 200 cryptocurrency projects on
GitHub over time. Raw, historic data is also available.
- Defects4J -
Collection of 395 reproducible bugs collected with the goal of advancing
software testing research.
- Eclipse
AERI stacktraces - Collection of stacktraces of Exceptions
encountered by users of the Eclipse IDE, as retrieved by the AERI
reporting system.
- Enron
Spreadsheets and Emails - All the spreadsheets and emails used in
the paper ‘Enron’s Spreadsheets and Related Emails: A Dataset and
Analysis’.
- Findbugs-maven -
Set of FindBugs reports for the Java projects of the Maven repository.
- GHTorrent - Scalable, queriable,
offline mirror of data offered through the GitHub REST API.
- GitHub
Bug Dataset - Bug Dataset of 15 Java open-source projects
characterized by static source code metrics.
- GitHub on
Google BigQuery - GitHub data accessible through Google’s BigQuery
platform.
- Grammar Zoo - Collection
of grammars of DSLs and GPLs, some extracted from metamodels and
document schemata.
- KaVE - Developer tool
interaction data.
- Linux
Kernel 4.21 Call Graphs - The Linux Kernel 4.21 Call Graphs produced
using CScout.
- Maven metrics -
Collection of software complexity & sizing metrics for the Maven Repository.
- Maven Dependency
Graph - Snapshot of the whole Maven Central taken on September 6,
2018, stored in a graph database.
- mzdata -
Multi-extract and multi-level dataset of Mozilla issue tracking
history.
- npm-miner
- The dataset contains the analysis results of 5 open source software
quality tools eslint, escomplex, nsp, jsinspect and sonarjs for 2000
popular (in terms of stars and downloads) npm packages.
- OCL Expressions on
GitHub - Data set of 9188 OCL expressions originating from 504 EMF
meta-models in 245 systematically selected GitHub repositories.
- RepoReapers Data Set -
Data set containing a collection of engineered software
projects from GHTorrent.
- Software Heritage
Graph Dataset - Graph of the development history and file metadata
of >80 million software projects from various forges (GitHub, Gitlab,
Debian, PyPI, Google Code, etc) in a deduplicated and unified
representation (paper here).
- STAMINA - (STAte
Machine INference Approaches) data are used to benchmark techniques for
learning deterministic finite state machines (FSMs).
- Stack
Exchange - Anonymized dump of all user-contributed content on the
Stack Exchange network.
- TravisTorrent -
Provides free and easy-to-use Traivs CI build analyses.
- Ultimate
Debian Database (UDD) - Data about various aspects of Debian
(e.g. packages, bugs, mainteners) in the same SQL database.
- Unified
Bug Dataset - Static source code based datasets which includes the
Bugcatchers Bug Dataset, the Bug Prediction Dataset, the
Eclipse
Bug Dataset, the GitHub
Bug Dataset, some datasets from the PROMISE
repository.
- Unix
history - Git repository with 46 years of Unix history
evolution.
- astminer -
Library and tool for mining of path-based representations of code and
other data derived from ASTs.
- Boa - Domain-specific
language and infrastructure that eases mining software
repositories.
- buckwheat -
Multi-language tokenizer for extracting identifiers from source
code.
- ckjm - Chidamber and
Kemerer Java Metrics.
- Coming - A Java
framework for analyzing code changes and mining instances of change
patterns from Git repositories.
- CryptOSS - Mine
GitHub activity and market cap data for cryptocurrency projects.
- DbDeo - Extract
embedded SQL statements and detect database schema smells.
- Designite - Compute
source code metrics and detect a variety of implementation, design, and
architecture smells for C#.
- DesigniteJava -
Compute source code metrics and detect a variety of implementation and
design smells for Java.
- Diggit - Agile Ruby
Tool to analyze Git repositories.
- GrimoireLab -
Free/Libre/Open Source tools for Software Development Analytics.
- MetricMiner
- Lean Java DSL to mine and extract data (e.g. commits, developers,
modifications, diffs) from Git and SVN repositories.
- Maven-miner -
Java tools and infrastructure to resolve the whole Maven dependency
graph, hosted in Maven Central, in the form of a Neo4j Graph.
- Perceval -
Fetch repository data from tens of back-ends.
- Puppeteer -
Detect configuration smells in Puppet code.
- PyDriller -
Python Framework to analyse Git repositories.
- qmcalc -
Calculate quality metrics from C source code.
- reaper - Python
tool to compute a score for a repository from GHTorrent. The score
quantifies the extent to which the project contained within the
repository is engineered.
- RefactoringMiner
- Library/API for detection of refactorings in changes of Java
code.
- VulData7 -
Java framework enabling the automated collection of commits fixing
vulnerabilities that are reported in NVD (links NVD with Git).
Research Outlets
- Outlets exclusively devoted to empirical software engineering
research
- Outlets that publish empirical software engineering research
License

To the extent possible under law, Diomidis Spinellis has waived all
copyright and related or neighboring rights to this work.