Spiral is a Python module that provides several different functions for splitting identifiers found in source code files. The name Spiral is a loose acronym based on "SPlitters for IdentifieRs: A Library".
Authors: Michael Hucka
Repository: https://github.com/casics/spiral
License: Unless otherwise noted, this content is licensed under the GPLv3 license.
April 2018: Version 1.1.0 fixes a bug that prevented importing Spiral, and another bug that cause setup.py
to fail to install dependencies automatically. Additional enhancements include improved command-line help and internal code refactoring.
- Introduction
- Please cite the paper
- Installation instructions
- Basic operation
- Performance of Ronin
- Other splitters in Spiral
- Limitations
- More information
- Getting help and support
- Contributing — info for developers
- Acknowledgments
Spiral is a Python 3 package that implements numerous identifier splitting algorithms. Identifier splitting (also known as identifier name tokenization) is the task of breaking apart program identifier strings such as getInt
or readUTF8stream
into component tokens: [get
, int
] and [read
, utf8
, stream
]. The need for splitting identifiers arises in a variety of contexts, including natural language processing (NLP) methods applied to source code analysis and program comprehension. Spiral provides some basic naive splitting algorithms, such as a straightforward camel-case splitter, as well as more elaborate heuristic splitters, such as a new algorithm called Ronin.
Article citations are critical for academic developers. If you use Spiral and you publish papers about your software, we ask that you please cite the Spiral paper:
- Hucka, M. (2018). Spiral: splitters for identifiers in source code files. Journal of Open Source Software, 3(24), 653, https://doi.org/10.21105/joss.00653
Please also indicate the specific version of Spiral you use, to improve other people's ability to reproduce your results. You can use the Zenodo DOIs we provide for this purpose:
- Spiral release 1.0.1 ⇒ 10.5281/zenodo.1211835
For basic usage, the following is probably the simplest and most direct way to install Spiral on your computer:
sudo pip3 install git+https://github.com/casics/spiral.git
Alternatively, you can clone this GitHub repository and then run setup.py
:
git clone https://github.com/casics/spiral.git
cd spiral
sudo python3 -m pip install .
The above should be all you need to run Spiral out of the box. If you plan on experimenting with alternative parameter values or alternative dictionaries, you will additionally need to install the following:
-
Platypus-Opt, an optimization library (if you want to optimize parameters). Important: This Platypus is not the same as the other package called "Platypus" in PyPI. Make sure to get Platypus-Opt or install from the Platypus repo.
-
Data modules from NLTK, particularly
nltk_words
andntlk_wordnet
from thenltk.corpus
module and thenltk.stem
module. These cannot be installed automatically bysetup.py
—please refer to the NLTK installation instructions for instructions on how to do it.
Spiral is extremely easy to use. To use a Spiral splitter in a Python program, simply import a splitter and then call it.
from spiral.simple_splitters import pure_camelcase_split
print(pure_camelcase_split('TestString'))
Some splitters take optional parameters, and the more complex splitters have an init()
function that you can optionally call to set additional parameters or load data. Currently, only Ronin and Samurai have initialization functions, and calling init()
is optional—if you do not call it, Ronin and Samurai will call their init()
functions automatically.
Here are some examples of using the Ronin splitter algorithm. The following input,
from spiral import ronin
for s in [ 'mStartCData', 'nonnegativedecimaltype', 'getUtf8Octets', 'GPSmodule', 'savefileas', 'nbrOfbugs']:
print(ronin.split(s))
produces the following output:
['m', 'Start', 'C', 'Data']
['nonnegative', 'decimal', 'type']
['get', 'Utf8', 'Octets']
['GPS', 'module']
['save', 'file', 'as']
['nbr', 'Of', 'bugs']
Spiral also includes a command-line program named spiral
in the bin subdirectory; it will split strings provided on the command line or in a file. (Note: Ronin and Samurai load internal data files when they start up. In normal use, called from an application program, the resulting startup delay will only happen once. In the command-line program, the data is reloaded at every invocation, causing a startup delay every time. The delay is not typical for normal Spiral usage.)
By the way, if your goal is to split identifiers obtained during source code mining and you need to filter out gibberish strings of characters before attempting to split them, you may find Nostril (the Nonsense String Evaluator) useful.
Splitting identifiers is deceptively difficult. It is a research-grade problem for which no perfect solution exists. Even in cases where the input consists of identifiers that strictly follow conventions such as camel case, ambiguities can arise. For example, correctly splitting J2SEProjectTypeProfiler
into J2SE
, Project
, Type
, Profiler
requires the reader to recognize J2SE
as a unit. The task of splitting identifiers is made more difficult when there are no case transitions or other obvious boundaries in an identifier. And then there is the small problem that humans are often simply inconsistent!
Ronin is an advanced splitter that uses a variety of heuristic rules, English dictionaries, and tables of token frequencies obtained from mining source code repositories. Ronin includes a default table of term frequencies derived from an analysis of over 46,000 randomly selected software projects in GitHub that contained at least one Python source code file. The tokens were extracted using software from Spiral's parent project, CASICS (specifically, the extractor package), and the frequency table constructed using a procedure encoded in the small program create_frequency_file
included with Spiral. Ronin also has a number of parameters that need to be tuned; this can be done using the optimization program optimize-ronin
in the dev/optimization subdirectory. The default parameter values were derived by optimizing performance against two data sets available from other research groups:
- The Loyola University of Delaware Identifier Splitting Oracle (Ludiso)
- The INTT data set, extracted from the zip archive of INTT
Spiral includes copies of these data sets in the tests/data subdirectory. The parameters derived primarily by running against the INTT database of 18,772 identifiers and their splits. The following table summarizes the results:
Data set | Number of splits matched by Ronin | Total in data set | Accuracy |
---|---|---|---|
INTT | 17,287 | 18,772 | 92.09% |
Ludiso | 2,248 | 2,663 | 84.42% |
Many of the "failures" against these sets of identifiers are actually not failures, but cases where Ronin gets a more correct answer or where there is a legitimate difference in interpretation. Here are some examples:
Identifier | Ludiso result | Ronin split |
---|---|---|
a.ecart |
a ecart |
a e cart |
ConvertToAUTF8String |
Convert To AUTF 8 String |
Convert To A UTF8 String |
MAPI_BCC |
MAPI BCC |
M API BCC |
GetWSIsEnabled |
Get WSIs Enabled |
Get WS Is Enabled |
fread |
fread |
f read |
FF_LOSS_COLORSPACE |
FF LOSS COLORSPACE |
FF LOSS COLOR SPACE |
m_bFilenameMode |
m b File name Mode |
m b Filename Mode |
Names like fread
could be considered appropriate to leave alone, but the f
in fread
actually stands for "file", and thus IMHO it makes sense to split it—splitting identifiers into meaningful tokens is the purpose of Ronin, after all. Conversely, tokens such as Utf8
should be left as units because they are meaningful. Differences involving some other terms such as color space
are due to Ronin splitting terms that are typically considered separate words but are treated as one word in Ludiso, or vice versa. (The main entry in Wikipedia for "color space" is two words, while for "filename" it is one word.) This sometimes also arises because Ronin recognizes prefixes and sometimes does not split words because they would not be in normal written English, such as nonblocking
instead of non
blocking
. Still other differences between Ronin's output and the expected splits given in Ludiso and INTT are due to inconsistencies in the Ludiso and INTT data sets. For example, sometimes a token within a larger identifier is split in one case but not in another. Finally, some other differences are cases where the Ludiso split seems to favor program-specific names. For example, QWheelEvent
is split as [ QWheel
, Event
] whereas Ronin will split it as [ Q
, Wheel
, Event
].
So, Ronin's performance may be better than the pure numbers imply. However, without checking every Ludiso case and manually reinterpreting the splits, we can't say for sure. All that aside, Ronin definitely gets many cases wrong.
Here is a list of all the splitters implemented in Spiral at this time:
Splitter name | Operation |
---|---|
delimiter_split |
split only at characters $ ~ _ . : / @ |
digit_split |
split only at digits |
pure_camelcase_split |
split at forward camel case transitions (lower to upper case) |
safe_simple_split |
split at hard delimiter characters and forward camel case only; won't split strings that don't follow strict camel case |
simple_split |
split at hard delimiter characters and forward camel case, even if a string doesn't follow strict camel case conventions |
elementary_split |
split by hard delimiters, forward camel case, and digits |
heuristic_split |
split by hard delimiters, forward camel case, and digits, but recognize special cases such as utf8 , sha256 , etc. |
Samurai | frequency-based algorithm published in the literature |
Ronin | frequency-based algorithm based on Samurai but greatly extended (see previous section) |
The following table illustrates the differences between the simpler splitters.
Input string | pure camel | safe simple | simple | elementary | heuristic |
---|---|---|---|---|---|
alllower |
alllower |
alllower |
alllower |
alllower |
alllower |
ALLUPPER |
ALLUPPER |
ALLUPPER |
ALLUPPER |
ALLUPPER |
ALLUPPER |
a_delimiter |
a_delimiter |
a delimiter |
a delimiter |
a delimiter |
a delimiter |
a.delimiter |
a.delimiter |
a delimiter |
a delimiter |
a delimiter |
a delimiter |
a$delimiter |
a$delimiter |
a delimiter |
a delimiter |
a delimiter |
a delimiter |
a:delimiter |
a:delimiter |
a delimiter |
a delimiter |
a delimiter |
a delimiter |
a_fooBar |
a_foo Bar |
a foo Bar |
a foo Bar |
a foo Bar |
a foo Bar |
fooBar |
foo Bar |
foo Bar |
foo Bar |
foo Bar |
foo Bar |
FooBar |
Foo Bar |
Foo Bar |
Foo Bar |
Foo Bar |
Foo Bar |
Foobar |
Foobar |
Foobar |
Foobar |
Foobar |
Foobar |
fooBAR |
foo BAR |
fooBAR |
foo BAR |
foo BAR |
foo BAR |
fooBARbif |
foo BARbif |
fooBARbif |
foo BARbif |
foo BARbif |
foo BARbif |
fooBARzBif |
foo BARz Bif |
fooBARzBif |
foo BARz Bif |
foo BARz Bif |
foo BARz Bif |
ABCfoo |
ABCfoo |
ABCfoo |
ABCfoo |
ABCfoo |
ABCfoo |
ABCFoo |
ABCFoo |
ABCFoo |
ABCFoo |
ABCFoo |
ABCFoo |
ABCFooBar |
ABCFoo Bar |
ABCFooBar |
ABCFoo Bar |
ABCFoo Bar |
ABCFoo Bar |
ABCfooBar |
ABCfoo Bar |
ABCfooBar |
ABCfoo Bar |
ABCfoo Bar |
ABCfoo Bar |
fooBar2day |
foo Bar2day |
foo Bar2day |
foo Bar2day |
foo Bar 2 day |
foo Bar 2 day |
fooBar2Day |
foo Bar2Day |
foo Bar2Day |
foo Bar2Day |
foo Bar 2 Day |
foo Bar 2 Day |
fooBAR2day |
foo BAR2day |
fooBAR2day |
foo BAR2day |
foo BAR 2 day |
foo BAR 2 day |
fooBAR2Day |
foo BAR2Day |
fooBAR2Day |
foo BAR2Day |
foo BAR 2 Day |
foo BAR 2 Day |
foo3000 |
foo3000 |
foo3000 |
foo3000 |
foo 3000 |
foo 3000 |
99foo3000 |
99foo3000 |
99foo3000 |
99foo3000 |
99 foo 3000 |
99 foo 3000 |
foo2Bar |
foo2Bar |
foo2Bar |
foo2Bar |
foo 2 Bar |
foo 2 Bar |
foo2bar2 |
foo2bar2 |
foo2bar2 |
foo2bar2 |
foo 2 bar 2 |
foo 2 bar 2 |
Foo2Bar2 |
Foo2Bar2 |
Foo2Bar2 |
Foo2Bar2 |
Foo 2 Bar 2 |
Foo 2 Bar 2 |
MyInt32 |
My Int32 |
My Int32 |
My Int32 |
My Int 32 |
My Int32 |
MyInt42 |
My Int42 |
My Int42 |
My Int42 |
My Int 42 |
My Int 42 |
MyInt32Var |
My Int32Var |
My Int32Var |
My Int32Var |
My Int 32 Var |
My Int32 Var |
2ndvar |
2ndvar |
2ndvar |
2ndvar |
2 ndvar |
2nd var |
the2ndvar |
the2ndvar |
the2ndvar |
the2ndvar |
the 2 ndvar |
the 2nd var |
the2ndVar |
the2nd Var |
the2nd Var |
the2nd Var |
the 2 nd Var |
the 2nd Var |
row10 |
row10 |
row10 |
row10 |
row 10 |
row 10 |
utf8 |
utf8 |
utf8 |
utf8 |
utf 8 |
utf8 |
aUTF8var |
a UTF8var |
aUTF8var |
a UTF8var |
a UTF 8 var |
a UTF8 var |
J2SE4me |
J2SE4me |
J2SE4me |
J2SE4me |
J 2 SE 4 me |
J2SE 4 me |
IPv4addr |
IPv4addr |
IPv4addr |
IPv4addr |
IPv 4 addr |
IPv4 addr |
Ronin and Samurai are more advanced than any of the simple splitters above. Ronin in particular does everything heuristic_splitter
does, but handles far more difficult cases.
Ronin is the most advanced splitter in the Spiral package, but it is not perfect by any means. Certain limitations are known:
-
Ronin is tuned for splitting program identifiers, which is not the same as splitting strings of concatenated words. With its default parameter values, it is not good at splitting strings like
driveourtrucks
which are not composed of tokens commonly encountered in source code contexts. (By the way, reducing the value of parameterlow_freq_cutoff
inronin.init()
to something like10
makes Ronin more likely to split concatenated word strings likedriveourtrucks
orsocietynamebank
. However, this comes at the expense of worsening scores on the INTT and Ludiso data sets.) -
Ronin was trained on source code repositories containing Python code. The default frequency table may not be optimal for splitting things that come from non-Python source files, although the test results on Ludiso and INTT show that it is pretty good on non-Python content.
-
Spiral includes a set of predefined special terms in the file
constants.py
so thatheuristic_split
and the Ronin splitter can recognize certain character sequences that should be treated as units. Examples includeutf8
,ipv4
,checkbox
,J2SE
and others. The use of specialized dictionaries like this is an approach taken in other splitters (including Simon Butler's INTT) and is the only known way to prevent incorrect splits of special terms, but it is also an obvious limitation: the list is surely incomplete (and can never be complete because new terms are invented all the time), and may even cause incorrect splits in some contexts. -
The current algorithm implemented in Ronin arose organically after a lot of trial-and-error experimentation, and as a consequence, is rather gnarly and difficult to explain. The implementation could be made more efficient and the algorithm clarified by a nice about of refactoring.
The name "Ronin" is a play on the use of the name "Samurai" by Enslen et al. (2009) for their identifier splitting algorithm. The core loop of Ronin is based on Samurai, but it would be inappropriate to call this implementation Samurai too. In an effort to imply the lineage of this modified algorithm, I chose "Ronin" (a name referring to a drifter samurai without a master, during the Japanese feudal period).
I implemented Samurai based on the description of the algorithm published in the following paper, then modified it repeatedly in an attempt to improve performance. Ronin is the result. A goal for Ronin was to produce a splitter that had good performance even without using a local frequency table.
Enslen, E., Hill, E., Pollock, L., & Vijay-Shanker, K. (2009). Mining source code to automatically split identifiers for software analysis. In Proceedings of the 6th IEEE International Working Conference on Mining Software Repositories (MSR'09)
Abstract: Automated software engineering tools (e.g., program search, concern location, code reuse, quality assessment, etc.) increasingly rely on natural language information from comments and identifiers in code. The first step in analyzing words from identifiers requires splitting identifiers into their constituent words. Unlike natural languages, where space and punctuation are used to delineate words, identifiers cannot contain spaces. One common way to split identifiers is to follow programming language naming conventions. For example, Java programmers often use camel case, where words are delineated by uppercase letters or non-alphabetic characters. However, programmers also create identifiers by concatenating sequences of words together with no discernible delineation, which poses challenges to automatic identifier splitting. In this paper, we present an algorithm to automatically split identifiers into sequences of words by mining word frequencies in source code. With these word frequencies, our identifier splitter uses a scoring technique to automatically select the most appropriate partitioning for an identifier. In an evaluation of over 8000 identifiers from open source Java programs, our Samurai approach outperforms the existing state of the art techniques.
The Ludiso data set is described in the following paper:
Hill, E., Binkley, D., Lawrie, D., Pollock, L., & Vijay-Shanker, K. (2014). An empirical study of identifier splitting techniques. Empirical Software Engineering, 19(6), 1754–1780.
Abstract: Researchers have shown that program analyses that drive software development and maintenance tools supporting search, traceability and other tasks can benefit from leveraging the natural language information found in identifiers and comments. Accurate natural language information depends on correctly splitting the identifiers into their component words and abbreviations. While conventions such as camel-casing can ease this task, conventions are not well-defined in certain situations and may be modified to improve readability, thus making automatic splitting more challenging. This paper describes an empirical study of state-of-the-art identifier splitting techniques and the construction of a publicly available oracle to evaluate identifier splitting algorithms. In addition to comparing current approaches, the results help to guide future development and evaluation of improved identifier splitting approaches.
The INTT splitter and data set are described in the following paper:
Butler, S., Wermelinger, M., Yu, Y., & Sharp, H. (2011). Improving the Tokenisation of Identifier Names. In ECOOP 2011 – Object-Oriented Programming (pp. 130–154). Springer, Berlin, Heidelberg.
Abstract: Identifier names are the main vehicle for semantic information during program comprehension. Identifier names are tokenised into their semantic constituents by tools supporting program comprehension tasks, including concept location and requirements traceability. We present an approach to the automated tokenisation of identifier names that improves on existing techniques in two ways. First, it improves tokenisation accuracy for identifier names of a single case and those containing digits. Second, performance gains over existing techniques are achieved using smaller oracles. Accuracy was evaluated by comparing the output of our algorithm to manual tokenisations of 28,000 identifier names drawn from 60 open source Java projects totalling 16.5 MSLOC. We also undertook a study of the typographical features of identifier names (single case, use of digits, etc.) per object-oriented construct (class names, method names, etc.), thus providing an insight into naming conventions in industrial-scale object-oriented code. Our tokenisation tool and datasets are publicly available.
If you find an issue, please submit it in the GitHub issue tracker for this repository.
A lot remains to be done on CASICS in many areas. We would be happy to receive your help and participation if you are interested. Please feel free to contact the developers either via GitHub or the mailing list [email protected].
Everyone is asked to read and respect the code of conduct when participating in this project.
This material is based upon work supported by the National Science Foundation under Grant Number 1533792 (Principal Investigator: Michael Hucka). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
The installation of Spiral includes a dictionary containing words from the NTLK project, specifically the words
and wordnet
dictionaries. The word
dictionary is public domain, but the wordnet
dictionary is copyright 2006 by Princeton University and made available under license terms that permit free redistribution. For more information, please see Princeton University "About WordNet" (Princeton University, 2010).
The vector artwork of the segmented spiral illusion at the top of this page was obtained from www.123freevectors.com.