-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathsoftware_archeology.html
68 lines (68 loc) · 8.2 KB
/
software_archeology.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link href="https://adriann.github.io/feed.rss" rel="alternate" type="application/rss+xml" title="What's new on adriann.github.io" />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<meta name="generator" content="pandoc" />
<meta name="author" content="Adrian Neumann ([email protected])" />
<meta name="date" content="2015-11-15" />
<title>Software Archeology</title>
<style>
.caption{font-size:66%;text-align:right;}
.figure{float:right;padding-bottom:1em;padding-left:1em;}
.figure>img{display:block;margin:0 auto;}
.footnotes{font-size:80%;}
.block{border-left:1ex solid gray;padding-left:2em;}
li{padding:0.25em;}
a:hover{text-shadow: 0 0 5px;}
body{font-family:sans-serif;max-width:100ex;padding-left:3em;padding-right:2em;}
code{font-family:Consolas, Inconsolata, Monaco, monospace;}
p{text-align:justify;}
</style>
</head>
<body>
<div id="header">
<h1 class="title">Software Archeology</h1>
<p style="text-align:right">2015-11-15</p>
</div>
<p>So, you work in a <em><a href="http://lambda-the-ultimate.org/node/4424">mature programming environment</a></em> and your boss gave you the job of figuring out how an important piece of code in your company’s infrastructure works. Unfortunately, the tool is about a million lines of poorly documented legacy code. What documentation you can find seems outdated. There are few tests.</p>
<p>In this post I will share some techniques that I found helpful in the daunting task of software archeology.</p>
<h2 id="where-to-start">Where to Start</h2>
<p>The first thing you should do is trying to build the software and check that it runs OK. This might be as easy as <code>make</code>, but depending on your luck you might have to hunt for the requisite dependencies in exactly the right versions and define magic environment variables to make everything work. If you have a working system somewhere, but can’t get it to work on your machine, it is worthwhile to examine the binary there. You might want to check <a href="http://superuser.com/questions/239590/find-libraries-a-binary-was-linked-against">which libraries it’s linked against</a>.</p>
<p>Once you have a working copy, you can start understanding the code. When faced with a huge occult code base, it’s often difficult to find a spot from where to start understanding it.</p>
<p>If you’re looking at a runnable program, starting from its entry point is a good way. Finding that entry point might be difficult if you have multiple <code>main</code> methods. Try examining the build scripts to see which of the alternatives is used. You could also try putting <code>assert(false)</code> in the code to see which <code>main</code> is actually called.</p>
<p>For a library, this task is more difficult. I suggest looking at users of that library to get an idea what parts are most important. You can also try finding which functions are exported, but probably that’s a huge list.</p>
<p>If your project is under version control, you can try finding files that are modified often. These tend to contain some crucial functionality.</p>
<h2 id="getting-around">Getting Around</h2>
<p>You found a starting point and begin your archeological journey into the great unknown. If you use a sane build system and not some bundle of hand written Perl scripts, you can try generating a project file for some full featured IDE. Modern IDEs are really good at jumping around the code and telling you from where a function is used.</p>
<p>If that’s not possible, or you don’t like IDEs very much, there are a number of other tools that can be helpful. You probably already know <code>grep</code>. With some knowledge of regular expressions you can quickly get an overview of your codebase, I recommend reading its manpage. There are faster alternatives to <code>grep</code> available, like <a href="https://github.com/ggreer/the_silver_searcher">The Silver Searcher</a>. Unfortunately, regular expressions only get you so far.</p>
<p>However, there are some neat tools that actually understand your programming language. I’ve so far only investigated C and C++, so my advice here is somewhat limited. There are tools like <code>ctags</code> that produce an index of function and variable declarations/definitions and integrate nicely in your text editor of choice. I also found the copy-paste-detector from <a href="http://pmd.github.io/">PMD</a> to be very useful.</p>
<p>For an overview of some of these tools (again with a strong C bias), see</p>
<ul>
<li><a href="http://www.drdobbs.com/navigating-linux-source-code/184401358">Navigating Linux Source Code</a></li>
<li><a href="http://www.lemis.com/grog/software/source-code-navigation.php">Greg’s source code navigation tools</a> (note the Unix Beard Of Authority on the author, his advice must be really good!)</li>
<li><a href="http://stackoverflow.com/questions/100143/what-tool-do-you-use-to-index-c-c-code">What tool do you use to index C/C++ code?</a></li>
</ul>
<h2 id="draw-maps">Draw Maps</h2>
<p>It’s important that you document your findings. Remember, it’s only science if you write things down.</p>
<p>I recommend adding comments to the code where necessary (but <a href="http://stackoverflow.com/questions/143429/whats-the-least-useful-comment-youve-ever-seen">don’t overdo it</a>) and writing structural insights down in a wiki. Try to create a consistent narrative for a particular use case from system startup to shutdown. This provides a skeleton that you can then flesh out as you investigate other parts of the code.</p>
<p>Sometimes, UML diagrams can be useful. I think simpler is better, so I would recommend a simple tool like <a href="http://argouml.tigris.org/">ArgoUML</a> or even <a href="http://alexdp.free.fr/violetumleditor/page.php">VioletUML</a> over Enterprise Architect or the like. Something like <a href="https://inkscape.org/">Inkscape</a> can also work. In my opinion, UML is best if used very informally. <a href="http://www.cs.bsu.edu/homepages/pvg/misc/uml/">All the UML you need to know</a> provides a succint overview of class diagrams. Some UML tools can automatically analyze your codebase and draw diagrams for you. I’ve yet to see a nontrivial code base where this is helpful, but I wouldn’t consider myself particularly experienced.</p>
<h2 id="write-tests">Write Tests</h2>
<p>Usually, legacy code is poorly tested. Writing tests serve two purposes. First it helps you understand what’s going on by giving you a possibility to test your assumptions. This alone is a good reason to write some tests as you dwelve into the code. Another important benefit is that tests allow you to refactor some code and being reasonably sure that you didn’t break other parts of your <a href="http://joeyoder.com/PDFs/mud.pdf">Big Ball Of Mud</a>. That is of course only useful if your pointy haired boss gives you time to do any refactoring. You company probably already has a test framework of choice. Otherwise googling will reveal a number of nice unit testing frameworks for your language. As always, avoid rolling your own if possible.</p>
<h2 id="related-links">Related Links</h2>
<p>As usual, I’m not the first person to write about this. Here are some other resources that I found interesting</p>
<ul>
<li><a href="http://queue.acm.org/detail.cfm?id=945136">Code Spelunking</a></li>
<li><a href="http://media.pragprog.com/articles/mar_02_archeology.pdf">Software Archeology</a></li>
<li><a href="http://herraiz.org/papers/english/icsm05short.pdf">An Empirical Approach to Software Archeology</a></li>
</ul>
<hr/>
<div style="display:inline-flex;flex-wrap:wrap;justify-content:space-between;font-size:80%">
<p style="margin-right:2ex">CC-BY-SA <a href="mailto:[email protected]">Adrian Neumann</a> (PGP Key <a href="https://adriann.github.io/ressources/pub.asc">A0A8BC98</a>)</p>
<p style="margin-left:1ex;margin-right:1ex"><a href="http://adriann.github.io">adriann.github.io</a></p>
<p style="margin-left:2ex"><a href="https://adriann.github.io/feed.rss">RSS</a></p>
</div>
</body>
</html>