I have always been a fan of visualizations as I believe firmly that it is easier to crunch visual information than anything else. Visualizations are especially helpful for finding out patterns in data that are not expected and for patterns that are difficult to express textually in a concise manner.
The beast
A couple of weeks back I had the task of refactoring a very badly written piece of python code which violated tons of programming guidelines because of rampant usage of global variables, horrendous variable names and functions whose body extended to a couple of hundred lines of code!
As a first step to taming the beast, I removed all globals used and replaced with explicit arguments to function calls.
After this I had to split the code into multiple files so that different coherent parts maintained by respective teams/persons would go into seperate files. The file I was looking at had 5000+ lines of code and 100+ functions calling each other happily.
The approach
I was trying to figure out how to partition the file when I was struck by an interesting idea. The idea was to generate a call graph by statically analyzing the python code and then look at the generated graph to figure out which functions I could group in a given file split.
Python has a built-in module called parser that can construct an syntax tree by parsing a given piece of python code.
ast = parser.suite(python_code) ast_list = parser.ast2list(ast)
The syntax tree generated for the python function
def a(): c()
is
['stmt', ['compound_stmt', ['funcdef', ['NAME', 'def'], ['NAME', 'a'], ['parameters', ['LPAR', '('], ['RPAR', ')']], ['COLON', ':'], ['suite', ['NEWLINE', ''], ['INDENT', ''], ['stmt', ['simple_stmt', ['small_stmt', ['expr_stmt', ['testlist', ['test', ['or_test', ['and_test', ['not_test', ['comparison', ['expr', ['xor_expr', ['and_expr', ['shift_expr', ['arith_expr', ['term', ['factor', ['power', ['atom', ['NAME', 'c']], ['trailer', ['LPAR', '('], ['RPAR', ')']]]]]]]]]]]]]]]]]], ['NEWLINE', '']]]]]]]
I wrote a script to get the parse tree for the code I had to refactor. For every function definition, I found the name of the defined function and the functions it was calling (I removed references to built-in functions from the accounting). From the resulting call graph, I generated a graph definition in the dot language used by Graphviz. When this “dot” file is given as input to Graphviz, the final graph image file is generated.
You can download the script from here: construct_call_graph.py.
Usage: (test.py is the input file containing python code and test.dot is the graph definition file for Graphviz
cat test.py | ./construct_call_graph.py > test.dot
The output

I was able to identify orphaned functions that we not being used and functions that were being used by a lot of other functions (utility functions) which I moved to a seperate file. The shaded regions are fairly isolated from the rest of the program and represent a set of functions doing a common task. I moved the two regions to their own files.
Here is the output file for a smaller code file.

callg1.py
def b(): pass def d(): pass def c(): b() d() def a(): c() b() a() |
digraph G { rankdir=LR a -> c; a -> b; c -> b; c -> d; b; d; } |

callg2.py
this code has a cycle in the call graph. b() -> c() -> b() -> …
def b(): c() def c(): b() def a(): c() b() a() |
digraph G { rankdir=LR a -> c; a -> b; c -> b; b -> c; } |

callg3.py
This code has an orphaned function, b().
def b(): pass def c(): pass def a(): c() a() |
digraph G { rankdir=LR a -> c; c; b; } |

Try your code here
I set up the call graph generator here so you can try it yourself. Before submitting a piece of python code, make sure it does not have syntax errors.
To generate the graph image from “dot” specification file, you will need to install Graphviz. If you have a graph defined in test.dot, the following command will generate a png file for it.
cat test.dot | dot -Tpng > test.png
Links
There is python module out there called “pycallgraph” that builds call graphs by collecting information at run-time. I tried it but it proved too slow for the program I applied it to. I could have filtered to reduce the number of functions it hooked into but was too lazy to try this. You can follow these links to learn more about it.
- http://thinktoomuch.net/2007/06/06/python-call-graphs/
- http://blogs.sun.com/richb/entry/orca_python_call_graph















Hugo wrote,
Wow! That’s sweet… I’ll give it a whirl when I have a big graph to generate. pycallgraph’s speed (slow) made me think it is most useful to use it selectively on particular unit tests - but then I’d have to have good unit tests, which I don’t always have.
How tough is it to extend this kind of thing to multiple modules? And what will it not work on, eval statements, obviously. How about use of “getattr” and similar? Basically any time a function is passed as a string, the call will be missed. Right?
Link | November 14th, 2007 at 12:14 pm
prashanthellina wrote,
Although I have not tried yet, extending the script to handle modules and classes should be straight-forward. Like you said, evals, execs and functions being passed around are difficult to handle. Do keep me posted if you get somewhere along these lines. Thanks!
Link | November 14th, 2007 at 12:26 pm
vikraman wrote,
This is a good way to split a big python script into smaller ones and including them in one “master script”. But i’m not sure whether such a division will be correct “functionally” because you might have to group functions that perform entirely different things in one file.
plus, There may not be an increase in speed in doing such a thing. However from the point of view of a programmer its pretty neat.
Link | November 23rd, 2007 at 4:00 pm
prashanthellina wrote,
I find it hard to imagine that “entirely different” functions will be calling each other. an example? I believe that functions that call each other will be doing some common task. Also, increase in speed is not a necessary goal of refactoring. In fact in python when you split a function into smaller functions, you incur a performace hit (albeit very small). The goal is to make the code more maintainable.
Link | November 23rd, 2007 at 6:13 pm