SWISH++ is a Unix-based file indexing and searching engine (typically used to index and search files on web sites). It was based on SWISH-E although SWISH++ is a complete rewrite. SWISH++ was developed to circumvent my difficulties with using the SWISH-E package.
SWISH++ has been ported to compile and run under Microsoft Windows by Robert J. Lebowitz [email protected] and Christoph Conrad [email protected].
The git repository contains only the necessary source code.
Things like configure
are derived sources and
should not be included in repositories.
If you have autoconf
, automake
, and m4
installed,
you can generate configure
yourself by doing:
autoreconf -fiv
Then follow the generic installation instructions given in INSTALL
.
SWISH++ attains its speed chiefly by doing two things: using good algorithms and data structures and by doing fast I/O.
-
SWISH++ uses the C++ Standard Template Library's
map
class that is typically implemented either as a red-black or AVL tree for which the worst-case running time is O(lg n). -
SWISH++ uses the mmap(2) Unix system call to read files instead of using standard I/O. If you are unfamiliar with mmap, it "maps" a file into memory using the same virtual memory management mechanism the operating system itself uses. When the first character of a file is read, a page fault occurs and the operating system maps a page of the file into memory. It is impossible to get faster access. Additionally, because the file is in memory, the characters in it are accessed via pointers using simple pointer arithmetic rather than through library function calls and input buffers.
Other factors contributing to SWISH++'s speed are that it does very little explicit dynamic memory allocation, uses function inlining, and makes very few function calls in inner loops.
For HTML or XHTML files,
SWISH++ indexes words in META
element CONTENT
attributes
and associates them with the NAME
attributes.
Meta names can later be queried against specifically, e.g.:
search author = hawking
SWISH++ also indexes the words in ALT
attributes
(for the AREA
, IMG
, and INPUT
elements),
STANDBY
attributes (for the OBJECT
element),
SUMMARY
attributes (for the TABLE
element),
and
TITLE
attributes (for any HTML or XHTML element).
Text within HTML or XHTML elements belonging to specified classes can be not indexed. This is most useful not to index text in common page headers, footers, and pop-up menus.
SWISH++ indexes words in headers and associates them with the name of the headers as meta names that can later be queried against specifically, e.g.:
search subject = big-bang
Similarly, words in vCard fields are associated with the names of the fields as meta names that can also later be queried against, e.g.:
search title = professor
search org = SLAC
Additionally, plain and enriched text, and HTML in any one of ASCII, ISO-8859-1, UTF-7, or UTF-8 character sets in any one of 7-bit, 8-bit, quoted-printable, or Base64 encodings is decoded and converted on-the-fly thus properly indexing encoded bodies and attachments.
Lastly, attachments having other MIME types can be filtered on-the-fly before being indexed, e.g., convert Microsoft Word or PDF attachments to plain text.
SWISH++ indexes words in sections and associates them with the name of the section as meta names that can later be queried against specifically, e.g.:
search description = environment
search author = lucas
SWISH++ can therefore be used as a much better apropos(1) command replacement.
SWISH++ can ignore LaTeX and RTF markup.
Additionally, for LaTeX documents,
SWISH++ sets the document title
to the content of the \title{...}
command.
SWISH++ indexes words in ID3 tags of MP3 files and associates them with the name of the fields as meta names than can later be queried against specifically, e.g.:
search artist = roxette
search title = dangerous
All ID3 tag versions through 2.4 are supported. Additionally, text fields in any one of ASCII, ISO-8859-1, UTF-8, or UTF-16 character sets are supported.
A separate text-extraction utility extract
is included to
assist in indexing non-text files.
It is a essentially a more sophisticated version
of the Unix strings(1) command,
but employs the same word-determination heuristics used for indexing.
Based on filename patterns, files can be filtered before being indexed, e.g.: compressed files uncompressed, PDF files converted to plain text, etc.
New indexing modules can be written to index other file formats directly (without filters).
New files can be indexed and added to an existing index incrementally.
A separate utility httpindex
is included that interfaces
SWISH++ to the wget(1) command enabling remote web sites to
be indexed.
This is useful to be able to search all the servers
in your local area network simultaneously.
SWISH++ automatically splits and merges partial indices for large collections of files as it goes thereby not bringing your machine to its knees by exhausting physical memory and causing it to swap like mad.
The same mmap(2) technique used for indexing is used again for searching. The generated index file is written to disk such that it can be mmap'ed back into memory and binary searched immediately, with no parsing of the data, also in O(lg n) time.
SWISH++ allows stemming to be performed at the time of searches, not at the time of index generation. This allows users to decide whether to perform stemming or not.
SWISH++'s search engine can run in the background as a multi-threaded daemon process to function as a search server accepting query requests and returning results via a Unix domain or TCP socket or both. For search-intensive applications, such as a search engine on a heavily used web site, this can yield a large performance improvement since the start-up cost (fork(2), exec(2), and initialization) is paid only once.
SWISH++ outputs its search results in the form:
rank path_name file_size file_title
By placing the file_title
, which may contain spaces, last,
you can easily parse it, e.g.:
($rank,$path,$size,$title) = split( / /, $result, 4 ); # Perl
($rank,$path,$size,$title) = result.split( ' ', 3 ) # Python
Alternatively, SWISH++ can output search results in XML for increased interoperability with other XML applications.
The source code is clearly written with lots of comments including references to other works in case you want to modify it under the terms of the GNU general public license.
The following is a list of the features SWISH++ does not have that SWISH-E does. I wrote SWISH++ to solve my immediate indexing problems; therefore, I implemented only those features useful to me. If others can also benefit from the work, great. I may implement other features as time permits.
SWISH++ has no equivalent means for searching within specific HTML tags
(the SWISH-E -t
option).
I didn't have a need for this feature.
This functionality can be achieved by using the extract_meta()
function in the included WWW Perl module.
SWISH++ will not crash while indexing any file. Period. If it does, there's a bug and I'll fix it.
Copyright (C) 1998-2015 by Paul J. Lucas <[email protected]>
SWISH++ is available under the GNU General Public License.