Do Not Resuscitate Learning static analysis of Ruby by building a dead code detector Jason Gedge
Founding Engineer
Hey, everyone. My name's Jason, but everyone knows me by my last name, Gedge. I work at local startup called Gadget,
where I write next to no Ruby, so it makes total sense that I'm here to talk about static analysis in Ruby.
So what do we mean by static analysis? It's when we analyze some source code without actually running it. Ruby makes
this an exceptionally hard problem, since there can be a lot of metaprogramming, but that doesn't mean we can't do
interesting things with it.
Tonight we'll learn a bit about this through building a script that can detect unused methods in a codebase. Let's
start by learning a bit about abstract syntax trees.
My biggest journey into static analysis was at Shopify, working on a tool called packwerk.
We had a requirement that we needed a tool that could very quickly analyze Shopify's huge monolith quickly. The tricky
part was properly resolving constants statically, but it was a ton of fun and we learned a lot.
Abstract Syntax Tree (AST)
It is a tree representation of the abstract syntactic structure of text (often source code)
written in a formal language.
Wikipedia
The key thing to remember is that we can break down our code into a tree-like structure.
Note "abstract" here means it's not necessarily directly connected to the original text. For example, whitespace and
comments may not be represented in an AST.
If the tree can map perfectly back to the source text, it's called a concrete syntax tree.
So what does that look like?
ruby-parse -e "
# a simple class
class MyClass;
def initialize
@ivar = 1
end
end
"
( class
( const nil :MyClass )
nil
( def :initialize
( args
( ivasgn
:@ivar
( int 1 ) ) ) ) )
Ruby has this great tool called ruby-parse that can take some Ruby code and output an AST in format known as
"s-expression" (popularized by Lisp).
It's a great tool to use so you can quickly visualize an AST for a statement. In this example you can see several
different kinds of nodes: class, const, def, args, ivasgn, and int (the word immediately after each opening
parenthesis).
Note how there's no reference to line/column numbers, and the comment is nowhere to be found. That's what makes it an
AST.
But you might wonder: what are all the possible node types? And what are the other bits in the s-expression like those
nil values?
node_dump.c
Where you can find all of the possible node types you can encounter.
static void
dump_node ( VALUE buf , VALUE indent , int comment , const NODE * node ) {
case NODE_CLASS :
}
Well, there's no real reference that I know of. When that happens, I like to go to the source!
In the Ruby source code you can find a file called node_dump.c
. In there is a method called "dump_node", which has
some great examples of the node types, what each part of the s-expression for that node means, and examples of what it
looks like in source code.
case NODE_CLASS :
ANN ( "class definition" ) ;
ANN ( "format: class [nd_cpath] < [nd_super]; [nd_body]; end" ) ;
ANN ( "example: class C2 < C; ..; end" ) ;
F_NODE ( nd_cpath , "class path" ) ;
F_NODE ( nd_super , "superclass" ) ;
LAST_NODE ;
F_NODE ( nd_body , "class definition" ) ;
return ;
For example, here's an example of the NODE_CLASS node type, which we saw earlier. Focus on each of the F_NODE
calls to
get a good description of each subnode.
case NODE_CLASS :
ANN ( "class definition" ) ;
ANN ( "format: class [nd_cpath] < [nd_super]; [nd_body]; end" ) ;
ANN ( "example: class C2 < C; ..; end" ) ;
F_NODE ( nd_cpath , "class path" ) ;
F_NODE ( nd_super , "superclass" ) ;
LAST_NODE ;
F_NODE ( nd_body , "class definition" ) ;
( class
( const nil :MyClass )
nil
( def :initialize
( args )
( ivasgn :@ivar
( int 1 ) ) ) )
For example, here's an example of the NODE_CLASS node type, which we saw earlier. Focus on each of the F_NODE
calls to
get a good description of each subnode.
And here's the s-expression for the Ruby code we saw earlier. You can see how the s-expression maps to the node type. We
now know that the nil
we saw earlier was the super class.
So how do we actually get an AST in Ruby?
I've been out of the Ruby game for some time, so there may be newer options, but I'll highlight three I'm familiar with.
Parser options 💎 parser
👮 Rubocop
🌲 RubyVM::AbstractSyntaxTree
First, we have the parser gem. It has the benefit of being able to parse many different versions of Ruby code. Being
pure Ruby, it isn't the fastest option though.
(click)
Secondly, we have Rubocop, which is built on the parser gem. It can make it really easy to deal with parsing a lot of
files, and it comes with a DSL to make it easier to work with ASTs.
(click)
Finally, available since Ruby 2.6, is the AbstractSyntaxTree
module. It's built into the standard library, and is
really fast, but is much more "raw".
We're going to focus on this one.
This class is experimental and its API is not stable, therefore it might change without notice. As examples, the order
of children nodes is not guaranteed, the number of children nodes might change, there is no way to access children nodes
by name, etc.
If you are looking for a stable API or an API working under multiple Ruby implementations, consider using the parser
gem or Ripper. If you would like to make RubyVM::AbstractSyntaxTree stable, please join the discussion at
https://bugs.ruby-lang.org/issues/14844 .
It's important to call out that the builtin AST module is not considered a stable API yet. Expect it to break from one
version to the next.
For me, that's okay. Most of the time I'm doing these analyses I am making quick scripts that are meant to be thrown
away.
RubyVM :: AbstractSyntaxTree . parse ( string )
RubyVM :: AbstractSyntaxTree . parse_file ( pathname )
Fortunately, producing an AST is super simple in Ruby!
You simply call the parse_file
method with a path to some Ruby file and voila, you have the AST. There's also the
parse method if you have a string with some Ruby source code.
So that gives you a tree-like structure that you can traverse, but how do we do that?
RubyVM::AbstractSyntaxTree::Node
We're primarily going to focus on two attributes.
(click)
type
will be the type of node (uppercase). You can find those saw in the dump_node
method earlier. Don't forget you
can use ruby-parse
too!
(click)
children
are the child nodes under the current node. You'll generally be writing a recursive function to traverse the
tree. Note that not all children will be nodes. Sometimes there'll be symbol or other literal values, depending on the
type of node we're in.
ast.c
shows the ordering of children, which can be different than ruby-parse
/ node_dump.c
.
case NODE_CALL :
return rb_ary_new_from_args (
3 ,
NEW_CHILD ( ast_value , RNODE_CALL ( node ) -> nd_recv ) ,
ID2SYM ( RNODE_CALL ( node ) -> nd_mid ) ,
NEW_CHILD ( ast_value , RNODE_CALL ( node ) -> nd_args )
) ;
Just one side note.
Most of the times the ordering of children you get from ruby-parse will be the same, but if it isn't you can either
print out the types yourself, or head over to ast.c
to see.
Okay, let's traverse the tree. We're going to build what's known as a "depth-first traversal".
This means we'll go as deep as possible down one branch of the tree before moving on to the next. We'll process nodes in
"pre-order", which means we'll do something before descending. That's the red dots in this diagram.
Public domain source Designing our dead code detector For every Ruby file in current path, collect two pieces of information:
Function calls
Method definitions
We're going to start off keeping our approach really simple. We'll collect two pieces of information while traversing
the ASTs of all files under a directory.
First, we need to find all of the function calls. Second, we need to find all of the method definitions.
We're almost certainly going to have false positives here, but the idea is to start simple, and figure out how to build
heuristics on top of that to filter out those false positives. As an example: Rails, being a framework, will call many
methods in our codebase. We're not going to find a function call, for example, for a controller's index
function.
def traverse ( node , & blk )
return unless node . is_a ( RubyVM :: AbstractSyntaxTree :: Node )
blk . call ( node )
node . children . each do | child |
traverse ( child , & blk )
end
end
traverse ( RubyVM :: AbstractSyntaxTree . parse_file ( pathname ) ) do | node |
end
So here's the usual template for our traversal function. We're going to take a block to make it really easy to reuse.
Let's start filling it in.
calls = Set . new
defs = Set . new
traverse ( RubyVM :: AbstractSyntaxTree . parse_file ( pathname ) ) do | node |
end
puts "Calls: #{ calls . to_a } "
puts " Defs: #{ defs . to_a } "
Next, we'll create our sets that will keep track of the function calls and definitions
calls = Set . new
defs = Set . new
traverse ( RubyVM :: AbstractSyntaxTree . parse_file ( pathname ) ) do | node |
case node . type
when :FCALL
calls << node . children [ 0 ]
when :CALL
calls << node . children [ 1 ]
end
end
puts "Calls: #{ calls . to_a } "
puts " Defs: #{ defs . to_a } "
Now, let's start tracking function calls. The tricky bit is always getting the right child.
Note how we only care about the name of the functional call. We're ignoring the receiver, but a more intelligent dead
code detector may want to use that.
calls = Set . new
defs = Set . new
traverse ( RubyVM :: AbstractSyntaxTree . parse_file ( pathname ) ) do | node |
case node . type
when :FCALL
calls << node . children [ 0 ]
when :CALL
calls << node . children [ 1 ]
when :DEFN
defs << node . children [ 0 ]
when :DEFS
defs << node . children [ 1 ]
end
end
puts "Calls: #{ calls . to_a } "
puts " Defs: #{ defs . to_a } "
Now, function definitions.
calls = Set . new
defs = Set . new
traverse ( RubyVM :: AbstractSyntaxTree . parse_file ( pathname ) ) do | node |
case node . type
when :FCALL
calls << node . children [ 0 ]
when :CALL
calls << node . children [ 1 ]
when :DEFN
defs << node . children [ 0 ]
when :DEFS
defs << node . children [ 1 ]
end
end
puts " Calls: #{ calls . to_a } "
puts " Defs: #{ defs . to_a } "
puts "Maybe dead: #{ ( defs - calls ) . to_a } "
Finally, we tie it all together by removing all of the called methods from the definitions set.
$ ruby find_dead_code.rb unused_example.rb
Calls: [:include, :attr_reader, :new, :value, :create, :puts, :to_s, :squared]
Defs: [:create, :initialize, :squared, :incremented, :<=>]
Maybe dead: [:initialize, :incremented, :<=>]
Let's run that against our test file.
(demo)
So, we got pretty close! The only false positive we got was initialize
, but it's actually implicitly called when we
instantiate a class.
The spaceship operator -- <=>
-- is used for comparisons, and could also be called implicitly by other code.
So there you have it. We built a primitive dead code detector in Ruby.
If you want a slightly more sophisticated example, you can check out one of my gists. It has a few bonus features like
parsing many files, ERB support, and some heuristics for excluding methods that are likely to be called, even if we
couldn't find a call for them.
Thanks for listening!️
❤️ ❤️ ❤️ ❤️ ❤️ ❤️
Thanks for listening! I'm looking forward to seeing you all build some cool static analysis tools :)