-
Notifications
You must be signed in to change notification settings - Fork 1
Home
INTT is a library for tokenising identifier names. Written in Java INTT attempts to solve a number of problems concerning the tokenisation of identifier names. Identifier names can be tokenised relying on typographical boundaries, so, for example, anIdentifierName
and an_identifier_name
will both yield the list of tokens: {an, identifier, name}
. However, not all typographical boundaries are unambiguous, although they may be considered readable by a human. All developers will have created identifier names with ambiguous token boundaries, quite probably without having realised. The problem is that for software to check identifier name structure in an IDE, or to split names to support code search, concept location, automated code reuse, the ability to identify the tokens present in a name improves the quality of the program output.
What are ambiguous token boundaries? An ambiguous token boundary occurs where the developer thinks or perceives that there is a boundary, but software may struggle to identify the boundary. And a key reason software might struggle is that the boundary between tokens cannot be easily defined. The idea is perhaps better understood when the conventional typographical boundaries are defined. The separator character, such as an underscore, defines a boundary between two tokens very precisely, and may be identified algorithmically with ease — essentially by identifying the separator and splitting the string each side of the separator character, e.g. an_identifier_name
. The exception is, of course, where a developer has deliberately inserted separator characters to divide words, e.g. an_iden_ti_fier_na_me
. A human can still read the name, but programmatically it is more difficult to identify the tokens (NB solutions that suggest themselves at this point may be useful later). The other conventional typographical boundary is the transition from lower to upper case as in anIdentifierName
, where, unless the developer has been devious, the typographical transition coincides with token boundaries. All other token boundaries are ambiguous as are an absence of token boundaries.
The statement seems rather strong initially, but is correct. Some names are written using upper case abbreviations, for example HTMLEditorKit
, and a simple rule might be where there is a sequence of two or more upper case characters, the tokenisation boundary occurs immediately following the penultimate upper case character. However, counter examples are relatively easy to find, e.g. NBinitialize
. There are many more documented. From this we can develop a heuristic that says something like two or more uppercase characters followed by a lower case character imply a token boundary, identifying candidate tokens is then the next challenge. Digits are another complication. Books on naming conventions do not give (comprehensive) rules on how to use digits in identifier names, but people do. Characterising the acronyms and abbreviations MP3
, 2D
, J2SE
and 2of5
quickly shows that common digit abbreviations have no useful typographical properties that can be exploited for tokenisation — or put slightly differently, a heuristic can be developed that says something like where there is a group of one or more digits in an identifier name, there may be a token boundary nearby.
So far, the typographical boundaries observed exist or are implied by a change of case or where there is a change between the type of character. There is another set of problems that has implications for the problems above. What if there are no typographical token boundaries? An identifier name like indexpointer
, for example, has two words that a human can read, but how can they be separated programatically? One solution is a greedy algorithm that processes the string from left to right searching for the longest word. Such an algorithm would find index
and then find another word, pointer
, in the remainder of the string. However, not all identifier names are composed of whole words found in dictionaries and it is possible to find examples where there are two or more possible tokenisations. One example, proposed by Dawn Lawrie and David Binkley, is that of thenewestone
which while clearly not a real identifier name illustrates some of the issues. A simple left to right search for the longest word from the LHS gives {then
, ewes
, tone
} which may consist of three words found in the dictionary, but is nonsensical. The solution that Lawrie and Binkley suggest is to use word frequencies to determine the probability of each tokenisation, and they have also tried n-gram dictionaries. An additional complication is that some words in English are compounds of other words, e.g. output
, and that in many languages, such as Swedish and German, compound nouns are used; sometimes created spontaneously.
Of course there is machine learning; surely that provides a solution? It may do, and plenty of people have tried. However, like word frequencies and n-gram dictionaries, machine learning requires storage and processing overhead. While that may be practical in a check-in time conventions checking tool or something run on a build server, it, for the moment, is not a practical solution for in IDE applications.