Do Not Resuscitate

Learning static analysis of Ruby by building a dead code detector

Jason Gedge

Founding Engineer

Gadget

Hey, everyone. My name's Jason, but everyone knows me by my last name, Gedge. I work at local startup called Gadget, where I write next to no Ruby, so it makes total sense that I'm here to talk about static analysis in Ruby.

So what do we mean by static analysis? It's when we analyze some source code without actually running it. Ruby makes this an exceptionally hard problem, since there can be a lot of metaprogramming, but that doesn't mean we can't do interesting things with it.

Tonight we'll learn a bit about this through building a script that can detect unused methods in a codebase. Let's start by learning a bit about abstract syntax trees.

https://github.com/Shopify/packwerk

Abstract Syntax Tree (AST)

It is a tree representation of the abstract syntactic structure of text (often source code) written in a formal language.
Wikipedia

ruby-parse -e "
  # a simple class
  class MyClass;
    def initialize
      @ivar = 1
    end
  end
"

(class
    (const nil :MyClass)
    nil
    (def :initialize
        (args
        (ivasgn
            :@ivar
            (int 1)))))

Ruby has this great tool called ruby-parse that can take some Ruby code and output an AST in format known as "s-expression" (popularized by Lisp).

It's a great tool to use so you can quickly visualize an AST for a statement. In this example you can see several different kinds of nodes: class, const, def, args, ivasgn, and int (the word immediately after each opening parenthesis).

Note how there's no reference to line/column numbers, and the comment is nowhere to be found. That's what makes it an AST.

But you might wonder: what are all the possible node types? And what are the other bits in the s-expression like those nil values?

node_dump.c

Where you can find all of the possible node types you can encounter.

static void
dump_node(VALUE buf, VALUE indent, int comment, const NODE * node) {

  // ...

    case NODE_CLASS:

  // ...

}

case NODE_CLASS:
	ANN("class definition");
	ANN("format: class [nd_cpath] < [nd_super]; [nd_body]; end");
	ANN("example: class C2 < C; ..; end");
	F_NODE(nd_cpath, "class path");
	F_NODE(nd_super, "superclass");
	LAST_NODE;
	F_NODE(nd_body, "class definition");
	return;

case NODE_CLASS:
	ANN("class definition");
	ANN("format: class [nd_cpath] < [nd_super]; [nd_body]; end");
	ANN("example: class C2 < C; ..; end");
	F_NODE(nd_cpath, "class path");
	F_NODE(nd_super, "superclass");
	LAST_NODE;
	F_NODE(nd_body, "class definition");


(class
  (const nil :MyClass)
  nil
  (def :initialize
    (args)
    (ivasgn :@ivar
      (int 1))))

Parser options

💎 parser
👮 Rubocop
🌲 RubyVM::AbstractSyntaxTree

First, we have the parser gem. It has the benefit of being able to parse many different versions of Ruby code. Being pure Ruby, it isn't the fastest option though.

(click)

Secondly, we have Rubocop, which is built on the parser gem. It can make it really easy to deal with parsing a lot of files, and it comes with a DSL to make it easier to work with ASTs.

(click)

Finally, available since Ruby 2.6, is the AbstractSyntaxTree module. It's built into the standard library, and is really fast, but is much more "raw".

We're going to focus on this one.

🚨 🚨 🚨 🚨 🚨


                    This class is experimental and its API is not stable,
                    therefore it might change without notice. As examples, the
                    order of children nodes is not guaranteed, the number of
                    children nodes might change, there is no way to access
                    children nodes by name, etc.
                  
                  
                    If you are looking for a stable API or an API working under
                    multiple Ruby implementations, consider using the
                    parser
                    gem or Ripper. If you would like to make
                    RubyVM::AbstractSyntaxTree stable, please join the
                    discussion at
                    https://bugs.ruby-lang.org/issues/14844 .

https://github.com/ruby/ruby/blob/master/ast.rb

🚨 🚨 🚨 🚨 🚨

RubyVM::AbstractSyntaxTree.parse(string)

RubyVM::AbstractSyntaxTree.parse_file(pathname)

`RubyVM::AbstractSyntaxTree::Node`

type
children
https://ruby-doc.org/core-trunk/RubyVM/AST/Node.html for the rest

ast.c

shows the ordering of children, which can be different than ruby-parse / node_dump.c.

case NODE_CALL:
  return rb_ary_new_from_args(
    3,
    NEW_CHILD(ast_value, RNODE_CALL(node)->nd_recv),
    ID2SYM(RNODE_CALL(node)->nd_mid),
    NEW_CHILD(ast_value, RNODE_CALL(node)->nd_args)
  );

Designing our dead code detector

For every Ruby file in current path, collect two pieces of information:

Function calls
Method definitions

We're going to start off keeping our approach really simple. We'll collect two pieces of information while traversing the ASTs of all files under a directory.

First, we need to find all of the function calls. Second, we need to find all of the method definitions.

We're almost certainly going to have false positives here, but the idea is to start simple, and figure out how to build heuristics on top of that to filter out those false positives. As an example: Rails, being a framework, will call many methods in our codebase. We're not going to find a function call, for example, for a controller's index function.

def traverse(node, &blk)
  return unless node.is_a(RubyVM::AbstractSyntaxTree::Node)

  blk.call(node)

  node.children.each do |child|
    traverse(child, &blk)
  end
end

traverse(RubyVM::AbstractSyntaxTree.parse_file(pathname)) do |node|
  # TODO
end

calls = Set.new
defs = Set.new

traverse(RubyVM::AbstractSyntaxTree.parse_file(pathname)) do |node|
  # TODO
end

puts "Calls: #{calls.to_a}"
puts " Defs: #{defs.to_a}"

calls = Set.new
defs = Set.new

traverse(RubyVM::AbstractSyntaxTree.parse_file(pathname)) do |node|
  case node.type
  when :FCALL
    calls << node.children[0]
  when :CALL
    calls << node.children[1]
  end
end

puts "Calls: #{calls.to_a}"
puts " Defs: #{defs.to_a}"

calls = Set.new
defs = Set.new

traverse(RubyVM::AbstractSyntaxTree.parse_file(pathname)) do |node|
  case node.type
  when :FCALL
    calls << node.children[0]
  when :CALL
    calls << node.children[1]
  when :DEFN
    defs << node.children[0]
  when :DEFS
    defs << node.children[1]
  end
end

puts "Calls: #{calls.to_a}"
puts " Defs: #{defs.to_a}"

calls = Set.new
defs = Set.new

traverse(RubyVM::AbstractSyntaxTree.parse_file(pathname)) do |node|
  case node.type
  when :FCALL
    calls << node.children[0]
  when :CALL
    calls << node.children[1]
  when :DEFN
    defs << node.children[0]
  when :DEFS
    defs << node.children[1]
  end
end

puts "     Calls: #{calls.to_a}"
puts "      Defs: #{defs.to_a}"
puts "Maybe dead: #{(defs - calls).to_a}"

$ ruby find_dead_code.rb unused_example.rb

     Calls: [:include, :attr_reader, :new, :value, :create, :puts, :to_s, :squared]
      Defs: [:create, :initialize, :squared, :incremented, :<=>]
Maybe dead: [:initialize, :incremented, :<=>]

https://gist.github.com/thegedge/c05da047120c6b208dccce6025232610

Thanks for listening!️

❤️ ❤️ ❤️ ❤️ ❤️ ❤️

https://gedge.ca/presentations/2024-05-28-static_analysis_of_ruby

thegedge thegedge jason-gedge