{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "

Tutorial: Mining Software Repositories

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Introduction

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

In this tutorial we will explore basic tasks of mining software repositoris and data collection procedures.

\n", "\n", "

Mining Software Repositories (MSR) : field that analyzes source code repositories for the purpose of obtaining intersting information and actionable insigts about practical aspectes of software engineering. The data obtained from repositores is the bases of most empricial software engineering researches.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Review of git version control system and PyDriller

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Git is free and open soruce version control system. \n", "It helps to keep track of changes in files including source code, and \n", "helps devlopers to share and colaborate software projects. Git is avialable as GUI and command line tool. We use the command line tool for more flexibility

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Installation

\n", "

Checkout this page to get git for your system. You can check if git is already installed by typing git in the command line.

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#1 check if git is already installed, if not installed download and \n", "!git" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

We are going to explore some git commands by mining MyExpenses\n", " project for the remaining exercises. This is an opensource android app for tracking expenses

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Cloning repository

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#clone the project MyExpenses, we are basically downloading the default branch of the source code.\n", "!git clone https://github.com/mtotschnig/MyExpenses MyExpenses" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Go to the project folder\n", "%cd MyExpenses/\n", "!pwd" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#check the status of the repository\n", "!git status" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#check the branches of the repo \n", "!git branch\n", "#!echo \"-r option\" #for also showing remote branches\n", "#!git branch -r\n", "#!echo \"show-branch\"\n", "#!git show-branch #shows the default branch and the last commit message" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Commits

\n", "

Commits are snapsots of a reposiory at different points in time. They are basic building blocks of repositories.

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#use variations of git log command to explore commits associated with a repository\n", "!git log #shows the commits from most recent to the oldest" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!git log -n #view most recent n-commits\n", "!git log -3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!git log --oneline #just view single line per commit" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#git command to get the commit ids of list of commits associated with a branch, \n", "#helpful when scripting data collection. get rev-list \n", "!git rev-list master" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Checkingout a repsotory

\n", "

We use git checkout command to go to any snapshot of the repositoy in time using the commit id.\n", "This is very important command when you need to analyzie (Eg. detect smells) every commit of a repository.

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#First use git log or git rev-list to grab some commit id\n", "!echo \"before checkout\"\n", "!git status" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#then use checkout\n", "!git checkout 1b4f4dc00dba7d41afc9d51641b196c38fbd7488" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!echo \"after checkout\"\n", "!git status" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#to get back to the current state of the default branch for this case\n", "!git checkout master" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!git status #now its back to master" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Using PyDriller

\n", "

PyDriller is a python library/framework that helps devlopers and researchers to do MSR. It is specially helpful to extract information associated with commits and it implments szz algorithm to find bug inducing commits given bug fix commits.

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#installation command\n", "pip install pydriller" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#check if the installation is successful\n", "from pydriller import Repository" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Repository: is the main class of Pydriller to intialize a repository and traverse commits. To assign PyDriller a repository to analyze, you can use local path, remote url, list of local paths of repositories or urls. You can target specifc commits and commit ranges using data filters. You can analyze a single commit using single paramter by asssigning commit id. You can use since, from_commit, to, to_commit options to specifically analyzie a set of commits. The order parameter determines the order of the commit objects returned by this function.

\n", "\n", "

Repository class have a function traverse_commits() This function retuns commit objects from the oldest commit to the newst commit (by default)

. " ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#Intialize pydriller with the cloned repository MyExpenses\n", "#r=Repository(\"MyExpenses\")\n", "#r=Repository(\"MyExpenses\",since=datetime.datetime(2021, 9, 1))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#traversing commits with pydriller\n", "for c in r.traverse_commits():\n", " print(\"Commit:{}\\n time:{} \\n\".format(c.hash,c.committer_date))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Commit object contains all informations associated wiht a git commit including hash, msg, author, committer, author_date,author_timezone, committer_date, modified_files(list of modifed files in the commit) and some statistics about the commit.

\n", "\n", "

ModifedFile object stores information about modifed file in a commit. It contians attributes such as old_path, new_path, filename, change_type,diff,diff_parsed, nloc, complexity,number_of_methods etc.

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#PyDriller commit object properties\n", "modified_files=[]\n", "for commit in Repository(\"MyExpenses\",single='52493a4a110fcb0fdbf292976bff129ae9a8ed26').traverse_commits():\n", " print(\"commit id: {}\\n Commit message: {} \\n Author name: {} \\n Committer name: {} \\n Committer date: {}\".format(commit.hash, commit.msg,commit.author.name, commit.committer.name, commit.committer_date))\n", " modified_files=commit.modified_files" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#iterate through modifed files of the commit\n", "for f in modified_files:\n", " print(f.filename)\n", " print(f.new_path)\n", " print(f.diff)\n", " print(f.diff_parsed) #very useful to extract lines of code where change happens" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Selecting subject systems from GitHub

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

The choice of subject systems for your study determines the quality of your research outcomes. \n", "Github repository is public and for anyone, so you could have archived projects, toy projects, class tutorials, introductory examples... Care should be taken to makesure the selected subject systems are approperate for your context. Please read the paper The promises and perils of mining GitHub to get more detailed information and on some techniques to make sure we get the right projects from GitHub.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

When choosing projects, check the number of stars, number of forks, number of issues, number of open issues, relevance and most recent commit time to avoid toy projects.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Typical subject system selection steps

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
    \n", "
  1. First, explore related literature if there are already popular subject systems and use or include them in your study. Eg. the paper considers projects in F-droid repositories. F-droid repsitores are popular in studies that involve android apps. If you don't find, you can go to the following steps
  2. \n", "
  3. Identify keywords that could describe your target projects
  4. \n", "
  5. Go to the github advanced search and apply the keywords. Check if the projects that you get are approperate. Rank them in decreasing order of stars to get more popular projects.
  6. \n", "
  7. If you are ok with the keywords, write a script to autmatically search repositores and gather your hits. There are many ways to do it, but Github search API and PyGitHub library are popular for this activity.Watchout for ratelimits and other API constraints.\n", "
  8. \n", " \n", "
\n", "\n", "

There is no standard rule to decide the number of projects that are enough for study. But go for as higher as possible considering the avialable resources and data-collection time. Also, consider related work in this area to see what are the average number of subject systems used. It also depends on the type of study. Number of projects detrmines the generalizability of your results.\n", "\n", "

\n", "\n", "

\n", " Example: Lets use the Github advanced search to look for machine learning projects in python\n", "

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Code smell detection tools

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Test smell detection using TsDetect

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
    \n", "
  1. First step is to detect test files, use TestFileDetector
  2. \n", "
  3. Ts detect also needs to match test file to the production file, use Test File Mapping
  4. \n", "
  5. Prepare a CSV file containing test file name, production file name and run ts detect with the csv file as a parameter
  6. \n", "
  7. Repeat this steps for each version and for each subject systems
  8. \n", "
  9. Modify/adapt the (test file detect and test file mapping) script to work with python code. Use the python test smell detector.
  10. \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Traditional code smell detection

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Demo: Use DECOR for traditional code smell detection in java

" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "#!java -jar DECOR_JAVA.jar \"\"\n", "\n", "!java -jar DECOR_JAVA.jar MyExpenses MyExpenses \"\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The result of the detection is stored two folder lavels above the detector, make sure you consider that" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }