Do Not Resuscitate
Learning static analysis of Ruby by building a dead code
detector
Jason Gedge
Founding Engineer
Hey, everyone. My name's Jason, but everyone knows me by my
last name, Gedge. I work at local startup called Gadget, where I
write next to no Ruby, so it makes total sense that I'm
here to talk about static analysis in Ruby.
So what do we mean by static analysis? It's when we analyze
some source code without actually running it. Ruby makes this an
exceptionally hard problem, since there can be a lot of
metaprogramming, but that doesn't mean we can't do
interesting things with it.
Tonight we'll learn a bit about this through building a
script that can detect unused methods in a codebase. Let's
start by learning a bit about abstract syntax trees.
My biggest journey into static analysis was at Shopify, working
on a tool called packwerk.
We had a requirement that we needed a tool that could very
quickly analyze Shopify's huge monolith quickly. The tricky
part was properly resolving constants statically, but it was a
ton of fun and we learned a lot.
Abstract Syntax Tree (AST)
It is a tree representation of the abstract
syntactic structure of text (often source code)
written in a formal language.
Wikipedia
The key thing to remember is that we can break down our code
into a tree-like structure.
Note "abstract" here means it's not necessarily
directly connected to the original text. For example, whitespace
and comments may not be represented in an AST.
If the tree can map perfectly back to the source text, it's
called a concrete syntax tree.
So what does that look like?
ruby-parse -e "
# a simple class
class MyClass;
def initialize
@ivar = 1
end
end
"
( class
( const nil :MyClass )
nil
( def :initialize
( args
( ivasgn
:@ivar
( int 1 ) ) ) ) )
Ruby has this great tool called ruby-parse that can take some
Ruby code and output an AST in format known as
"s-expression" (popularized by Lisp).
It's a great tool to use so you can quickly visualize an
AST for a statement. In this example you can see several
different kinds of nodes: class, const, def, args, ivasgn, and
int (the word immediately after each opening parenthesis).
Note how there's no reference to line/column numbers, and
the comment is nowhere to be found. That's what makes it an
AST.
But you might wonder: what are all the possible node types? And
what are the other bits in the s-expression like those nil
values?
node_dump.c
Where you can find all of the possible node types you can
encounter.
static void
dump_node ( VALUE buf , VALUE indent , int comment , const NODE * node ) {
case NODE_CLASS :
}
Well, there's no real reference that I know of. When that
happens, I like to go to the source!
In the Ruby source code you can find a file called
node_dump.c
. In there is a method
called "dump_node", which has some great examples of
the node types, what each part of the s-expression for that node
means, and examples of what it looks like in source code.
case NODE_CLASS :
ANN ( "class definition" ) ;
ANN ( "format: class [nd_cpath] < [nd_super]; [nd_body]; end" ) ;
ANN ( "example: class C2 < C; ..; end" ) ;
F_NODE ( nd_cpath , "class path" ) ;
F_NODE ( nd_super , "superclass" ) ;
LAST_NODE ;
F_NODE ( nd_body , "class definition" ) ;
return ;
For example, here's an example of the NODE_CLASS node type,
which we saw earlier. Focus on each of the
F_NODE
calls to get a good
description of each subnode.
case NODE_CLASS :
ANN ( "class definition" ) ;
ANN ( "format: class [nd_cpath] < [nd_super]; [nd_body]; end" ) ;
ANN ( "example: class C2 < C; ..; end" ) ;
F_NODE ( nd_cpath , "class path" ) ;
F_NODE ( nd_super , "superclass" ) ;
LAST_NODE ;
F_NODE ( nd_body , "class definition" ) ;
( class
( const nil :MyClass )
nil
( def :initialize
( args )
( ivasgn :@ivar
( int 1 ) ) ) )
For example, here's an example of the NODE_CLASS node type,
which we saw earlier. Focus on each of the
F_NODE
calls to get a good
description of each subnode.
And here's the s-expression for the Ruby code we saw
earlier. You can see how the s-expression maps to the node type.
We now know that the nil
we saw
earlier was the super class.
So how do we actually get an AST in Ruby?
I've been out of the Ruby game for some time, so there may
be newer options, but I'll highlight three I'm
familiar with.
Parser options
💎 parser
👮 Rubocop
🌲 RubyVM::AbstractSyntaxTree
First, we have the parser gem. It has the benefit of being able
to parse many different versions of Ruby code. Being pure Ruby,
it isn't the fastest option though.
(click)
Secondly, we have Rubocop, which is built on the parser gem. It
can make it really easy to deal with parsing a lot of files, and
it comes with a DSL to make it easier to work with ASTs.
(click)
Finally, available since Ruby 2.6, is the
AbstractSyntaxTree
module. It's
built into the standard library, and is really fast, but is much
more "raw".
We're going to focus on this one.
This class is experimental and its API is not stable,
therefore it might change without notice. As examples, the
order of children nodes is not guaranteed, the number of
children nodes might change, there is no way to access
children nodes by name, etc.
If you are looking for a stable API or an API working under
multiple Ruby implementations, consider using the
parser
gem or Ripper. If you would like to make
RubyVM::AbstractSyntaxTree stable, please join the
discussion at
https://bugs.ruby-lang.org/issues/14844 .
It's important to call out that the builtin AST module is
not considered a stable API yet. Expect it to break from one
version to the next.
For me, that's okay. Most of the time I'm doing these
analyses I am making quick scripts that are meant to be thrown
away.
RubyVM :: AbstractSyntaxTree . parse ( string )
RubyVM :: AbstractSyntaxTree . parse_file ( pathname )
Fortunately, producing an AST is super simple in Ruby!
You simply call the
parse_file
method with a path to
some Ruby file and voila, you have the AST. There's also
the parse method if you have a string with some Ruby source
code.
So that gives you a tree-like structure that you can traverse,
but how do we do that?
RubyVM::AbstractSyntaxTree::Node
We're primarily going to focus on two attributes.
(click)
type
will be the type of node
(uppercase). You can find those saw in the
dump_node
method earlier. Don't
forget you can use ruby-parse
too!
(click)
children
are the child nodes under
the current node. You'll generally be writing a recursive
function to traverse the tree. Note that not all children will
be nodes. Sometimes there'll be symbol or other literal
values, depending on the type of node we're in.
ast.c
shows the ordering of children, which can be different than
ruby-parse
/
node_dump.c
.
case NODE_CALL :
return rb_ary_new_from_args (
3 ,
NEW_CHILD ( ast_value , RNODE_CALL ( node ) -> nd_recv ) ,
ID2SYM ( RNODE_CALL ( node ) -> nd_mid ) ,
NEW_CHILD ( ast_value , RNODE_CALL ( node ) -> nd_args )
) ;
Just one side note.
Most of the times the ordering of children you get from
ruby-parse will be the same, but if it isn't you can either
print out the types yourself, or head over to
ast.c
to see.
Okay, let's traverse the tree. We're going to build
what's known as a "depth-first traversal".
This means we'll go as deep as possible down one branch of
the tree before moving on to the next. We'll process nodes
in "pre-order", which means we'll do something
before descending. That's the red dots in this diagram.
Public domain
source
Designing our dead code detector
For every Ruby file in current path, collect two pieces of
information:
Function calls
Method definitions
We're going to start off keeping our approach really
simple. We'll collect two pieces of information while
traversing the ASTs of all files under a directory.
First, we need to find all of the function calls. Second, we
need to find all of the method definitions.
We're almost certainly going to have false positives here,
but the idea is to start simple, and figure out how to build
heuristics on top of that to filter out those false positives.
As an example: Rails, being a framework, will call many methods
in our codebase. We're not going to find a function call,
for example, for a controller's
index
function.
def traverse ( node , & blk )
return unless node . is_a ( RubyVM :: AbstractSyntaxTree :: Node )
blk . call ( node )
node . children . each do | child |
traverse ( child , & blk )
end
end
traverse ( RubyVM :: AbstractSyntaxTree . parse_file ( pathname ) ) do | node |
end
So here's the usual template for our traversal function.
We're going to take a block to make it really easy to
reuse.
Let's start filling it in.
calls = Set . new
defs = Set . new
traverse ( RubyVM :: AbstractSyntaxTree . parse_file ( pathname ) ) do | node |
end
puts "Calls: #{ calls . to_a } "
puts " Defs: #{ defs . to_a } "
Next, we'll create our sets that will keep track of the
function calls and definitions
calls = Set . new
defs = Set . new
traverse ( RubyVM :: AbstractSyntaxTree . parse_file ( pathname ) ) do | node |
case node . type
when :FCALL
calls << node . children [ 0 ]
when :CALL
calls << node . children [ 1 ]
end
end
puts "Calls: #{ calls . to_a } "
puts " Defs: #{ defs . to_a } "
Now, let's start tracking function calls. The tricky bit is
always getting the right child.
Note how we only care about the name of the functional call.
We're ignoring the receiver, but a more intelligent dead
code detector may want to use that.
calls = Set . new
defs = Set . new
traverse ( RubyVM :: AbstractSyntaxTree . parse_file ( pathname ) ) do | node |
case node . type
when :FCALL
calls << node . children [ 0 ]
when :CALL
calls << node . children [ 1 ]
when :DEFN
defs << node . children [ 0 ]
when :DEFS
defs << node . children [ 1 ]
end
end
puts "Calls: #{ calls . to_a } "
puts " Defs: #{ defs . to_a } "
Now, function definitions.
calls = Set . new
defs = Set . new
traverse ( RubyVM :: AbstractSyntaxTree . parse_file ( pathname ) ) do | node |
case node . type
when :FCALL
calls << node . children [ 0 ]
when :CALL
calls << node . children [ 1 ]
when :DEFN
defs << node . children [ 0 ]
when :DEFS
defs << node . children [ 1 ]
end
end
puts " Calls: #{ calls . to_a } "
puts " Defs: #{ defs . to_a } "
puts "Maybe dead: #{ ( defs - calls ) . to_a } "
Finally, we tie it all together by removing all of the called
methods from the definitions set.
$ ruby find_dead_code.rb unused_example.rb
Calls: [:include, :attr_reader, :new, :value, :create, :puts, :to_s, :squared]
Defs: [:create, :initialize, :squared, :incremented, :<=>]
Maybe dead: [:initialize, :incremented, :<=>]
Let's run that against our test file.
(demo)
So, we got pretty close! The only false positive we got was
initialize
, but it's actually
implicitly called when we instantiate a class.
The spaceship operator --
<=>
-- is used for
comparisons, and could also be called implicitly by other code.
So there you have it. We built a primitive dead code detector in
Ruby.
If you want a slightly more sophisticated example, you can check
out one of my gists. It has a few bonus features like parsing
many files, ERB support, and some heuristics for excluding
methods that are likely to be called, even if we couldn't
find a call for them.
Thanks for listening!️
❤️ ❤️ ❤️ ❤️ ❤️ ❤️
Thanks for listening! I'm looking forward to seeing you all
build some cool static analysis tools :)