Rebuilding Git in Ruby

Git is a distributed version control system (DVCS) that we use every day to manage our code. It is a powerful tool but have you ever wondered how it works its magic? The Git internal docs can be intimidating, incomplete, and don’t have examples. Digging through the Git’s implementation can also be intimidating, particularly if you aren’t familiar with C.

Pulling apart the engine and putting it back together is one of the best ways to understand how a system works. However, instead of writing C, let’s use something more familiar to us as Rails developers. Let’s re-implement Git in Ruby!

If you want to dig deeper into the implementation, check out the RGit source on Github.

Git commands

Git is built in modular fashion following the UNIX philosophy of small, sharp tools. Each command is its own script file and the top level git command simply proxies to them. Git ships with a number of built-in commands but custom commands can be written as long as they follow a given naming convention.

#!/usr/bin/env ruby

# bin/rgit

command, *args = ARGV

if command.nil?
  $stderr.puts "Usage: rgit <command> [<args>]"
  exit 1
end

path_to_command = File.expand_path("../rgit-#{command}", __FILE__)
if !File.exist? path_to_command
  $stderr.puts "No such command"
  exit 1
end

exec path_to_command, *args

This script does one of three things when we call it:

Outputs usage information if no subcommand was given
Outputs an error message if no script for the subcommand was found
Runs the given subcommand if it is found

Notice that we pass on any additional arguments to the subcommand.

As good UNIX citizens, we output messages to the standard error stream and return a non-zero exit code when errors occur.

Initializing a repository

Git stores all of its data and metadata in a .git directory in the root of the repository. The git init command initializes the .git directory and a few subdirectories as follows:

.git
├── HEAD
├── config
├── objects
│  ├── info
│  └── pack
└── refs
    ├── heads
    └── tags

HEAD is a file that has the hard-coded value ref: refs/heads/master. We’ll need this file later. config contains configuration for the repo. We’ll ignore it for now in the interest of simplicity. The remaining items in the tree are empty directories.

Generating this structure is mostly a lot of calls to Dir.mkdir

#!/usr/bin/env ruby

# bin/rgit-init

RGIT_DIRECTORY=".rgit".freeze
OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze
REFS_DIRECTORY = "#{RGIT_DIRECTORY}/refs".freeze

if Dir.exists? RGIT_DIRECTORY
  $stderr.puts "Existing RGit project"
  exit 1
end

def build_objects_directory
  Dir.mkdir OBJECTS_DIRECTORY
  Dir.mkdir "#{OBJECTS_DIRECTORY}/info"
  Dir.mkdir "#{OBJECTS_DIRECTORY}/pack"
end

def build_refs_directory
  Dir.mkdir REFS_DIRECTORY
  Dir.mkdir "#{REFS_DIRECTORY}/heads"
  Dir.mkdir "#{REFS_DIRECTORY}/tags"
end

def initialize_head
  File.open("#{RGIT_DIRECTORY}/HEAD", "w") do |file|
    file.puts "ref: refs/heads/master"
  end
end

Dir.mkdir RGIT_DIRECTORY
build_objects_directory
build_refs_directory
initialize_head

$stdout.puts "RGit initialized in #{RGIT_DIRECTORY}"

This script is called rgit-init in keeping with the conventions expected by the rgit command we built. If there is already a .rgit directory, we output an error message and exit with a non-zero exit code. Real Git allows you to safely “re-initialize” a repository but let’s opt out of this edge case for our MVP.

The init command is a little verbose but very boring. It creates a bunch of directories as well as the HEAD file.

Adding files to the staging area

Git allows capture a snapshot of the current state of a file via the git add command. The set of these snapshots is called the staging area. A list of snapshots and their metadata is stored at .rgit/index. Staging a file takes a few steps:

Create a SHA based on the file contents
Create a blob by compressing the file contents
Save that blob as rgit/objects/<first-two-characters-of-sha>/<rest of sha>
Add the SHA and original file path to the index so we can retrieve it later.

The index is a binary file that has the following format:

DIRC <version_number> <number of entries>

<ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path>
<ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path>
<ctime> <mtime> <dev> <ino> <mode> <uid> <gid> <SHA> <flags> <path>

# more entries

A lot of this metadata comes in handy for calculations done by other commands. If you try to open this file however, you will see a bunch of gibberish.

cat .git/index

bin/rgit-initTREE52 1?Ibin/rgitU?U?2????        ???
C??B=????''9bin2 0
?Cԣ̏k?i??`V:??3'9Z?6??赠xa?cǢbF

This is because the contents of the index file is stored as a binary format for performance reasons.

For simplicity and human-readability, let’s ignore most of the metadata and use a text format. We can return and add these features as they become necessary in the future.

For now, RGit’s index format will look like:

<SHA> <path>
<SHA> <path>
<SHA> <path>

# more entries

Let’s look at the actual Ruby code to do all this!

#!/usr/bin/env ruby

require "digest"
require "zlib"
require "fileutils"

RGIT_DIRECTORY = ".rgit".freeze
OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze
INDEX_PATH = "#{RGIT_DIRECTORY}/index"

if !Dir.exists? RGIT_DIRECTORY
  $stderr.puts "Not an RGit project"
  exit 1
end

path = ARGV.first

if path.nil?
  $stderr.puts "No path specified"
  exit 1
end

file_contents = File.read(path)
sha = Digest::SHA1.hexdigest file_contents
blob = Zlib::Deflate.deflate file_contents
object_directory = "#{OBJECTS_DIRECTORY}/#{sha[0..1]}"
FileUtils.mkdir_p object_directory
blob_path = "#{object_directory}/#{sha[2..-1]}"

File.open(blob_path, "w") do |file|
  file.print blob
end

File.open(INDEX_PATH, "a") do |file|
  file.puts "#{sha} #{path}"
end

Let’s start versioning Rgit with Rgit! First we need to add a file to the staging area:

rgit add bin/rgit

Our .rgit directory now looks like:

.rgit
├── HEAD
├── index
├── objects
│   ├── b3
│   │   └── 02dd6f8cd2b385b170e78c14503342c0ba6ae8
│   ├── info
│   └── pack
└── refs
    ├── heads
    └── tags

Notice that we now have a file in the objects directory. It contains the compressed source of bin/rgit.

Finally, our index looks like:

cat .rgit/index

b302dd6f8cd2b385b170e78c14503342c0ba6ae8 bin/rgit

Committing files

Blobs are the contents of a particular file at a particular time. In order to capture a snapshot of the entire project, Git bundles a bunch of these into a commit.

In order to capture the directory structure of the project, Git creates a “tree” object for each directory of a project. Each tree object contains a list of the tracked files and their associated blob as well as tree objects for subdirectories.

This gives us a tree structure that mirrors the tracked project’s filesystem. Directories are represented by “tree” objects while files are “blobs”. This whole tree structure is then tied to a “commit” object so that we can refer to it later.

The commit command does three things:

Build the tree/blob structure
Create a commit object that points to that structure
Update the current branch to point to the this commit.

Because creating objects is a common task, I’ve extracted it to RGit::Object.

# lib/rgit/object

require "fileutils"

module RGit
  RGIT_DIRECTORY = "#{Dir.pwd}/.rgit".freeze
  OBJECTS_DIRECTORY = "#{RGIT_DIRECTORY}/objects".freeze

  class Object
    def initialize(sha)
      @sha = sha
    end

    def write(&block)
      object_directory = "#{OBJECTS_DIRECTORY}/#{sha[0..1]}"
      FileUtils.mkdir_p object_directory
      object_path = "#{object_directory}/#{sha[2..-1]}"
      File.open(object_path, "w", &block)
    end

    private

    attr_reader :sha
  end
end

This class handles all of the directory/path related tasks as well as opening the file. It then yields to the given block for the actual writing of the object’s contents.

With this refactor done, let’s take a look at the commit command:

#!/usr/bin/env ruby

# bin/rgit-commit

$LOAD_PATH << File.expand_path("../../lib", __FILE__)
require "digest"
require "time"
require "rgit/object"

RGIT_DIRECTORY = "#{Dir.pwd}/.rgit".freeze
INDEX_PATH = "#{RGIT_DIRECTORY}/index"
COMMIT_MESSAGE_TEMPLATE = <<-TXT
# Title
#
# Body
TXT

def index_files
  File.open(INDEX_PATH).each_line
end

def index_tree
  index_files.each_with_object({}) do |line, obj|
    sha, _, path = line.split
    segments = path.split("/")
    segments.reduce(obj) do |memo, s|
      if s == segments.last
        memo[segments.last] = sha
        memo
      else
        memo[s] ||= {}
        memo[s]
      end
    end
  end
end

def build_tree(name, tree)
  sha = Digest::SHA1.hexdigest(Time.now.iso8601 + name)
  object = RGit::Object.new(sha)

  object.write do |file|
    tree.each do |key, value|
      if value.is_a? Hash
        dir_sha = build_tree(key, value)
        file.puts "tree #{dir_sha} #{key}"
      else
        file.puts "blob #{value} #{key}"
      end
    end
  end

  sha
end

def build_commit(tree:)
  commit_message_path = "#{RGIT_DIRECTORY}/COMMIT_EDITMSG"

  `echo "#{COMMIT_MESSAGE_TEMPLATE}" > #{commit_message_path}`
  `$VISUAL #{commit_message_path} >/dev/tty`

  message = File.read commit_message_path
  committer = "user"
  sha = Digest::SHA1.hexdigest(Time.now.iso8601 + committer)
  object = RGit::Object.new(sha)

  object.write do |file|
    file.puts "tree #{tree}"
    file.puts "author #{committer}"
    file.puts
    file.puts message
  end

  sha
end

def update_ref(commit_sha:)
  current_branch = File.read("#{RGIT_DIRECTORY}/HEAD").strip.split.last

  File.open("#{RGIT_DIRECTORY}/#{current_branch}", "w") do |file|
    file.print commit_sha
  end
end

def clear_index
  File.truncate INDEX_PATH, 0
end

if index_files.count == 0
  $stderr.puts "Nothing to commit"
  exit 1
end

root_sha = build_tree("root", index_tree)
commit_sha = build_commit(tree: root_sha)
update_ref(commit_sha: commit_sha)
clear_index

This file does several things:

Exits with error code and message if there are no files to commit
Creates all the necessary tree objects for the files in the index
Creates a commit object pointing to the root tree object
Updates the current branch to point to the commit
Clears the index

Building the tree is done in two passes. First the index is converted into a hash structure representing the file tree. Secondly, this structure is converted to tree objects on the filesystem. Both steps are done recursively.

For the commit message, we simply open a file using the user’s $VISUAL editor. Once the user exit their editor, we read the file an put the contents into the commit.

Let’s see it all come togeter. Staging and committing bin/rgit and bin/rgit-add gives us the following results in .rgit:

.rgit
├── COMMIT_EDITMSG
├── HEAD
├── index
├── objects
│   ├── 63
│   │   └── 45493c987e6144cc68142ad2405db681b28628
│   ├── 8c
│   │   └── fe566596683acae588039156f40ecaff282c30
│   ├── ae
│   │   └── 161568392ed9aa321466446a9bb01acb111e4f
│   ├── b3
│   │   └── 02dd6f8cd2b385b170e78c14503342c0ba6ae8
│   ├── f9
│   │   └── 60e7d48c47e86289a653b0afc0b7a13a9d372e
│   ├── info
│   └── pack
└── refs
    ├── heads
    │   └── master
    └── tags

In order to find the current state, we first look up what branch we are on by checking .rgit/HEAD. This points to .rgits/refs/heads/master, the master branch. The master branch points to its latest commit. The commit in turn points to a tree object representing the root of the project. This tree object points to another tree object representing the bin/ directory which in turn points to two blob objects containing the compressed contents of bin/rgit and bin/rgit-add at the time of the commit.

This structure of objects pointing to each other is what makes Git so powerful. By simply changing a few of these pointing files, we can switch to different points in history.

Git commands

Initializing a repository

Adding files to the staging area

Committing files

Sign up to receive a weekly recap from thoughtbot