If you have search functionality based on PHP and MySQL but don't have the luxury of waiting for Boolean support with the release of MySQL 4.0, then this PHP hack may be of use to you.

The Problem

So you've finished two-thirds of an application when your project manager comes to you with a great idea (soon to be requirement), "I was thinking, it would be nice if we could do Boolean fulltext searches... what do you think?" Now, if the content data is stored in a MySQL database and accessed from a PHP framework, here are a few canned responses to choose from:

Unfortunately, most of us have settled on like variations of the second response; this is functionality that MySQL has promised with its next full version release. The fact remains, the more web savvy users have come to expect a few options when searching:

Support for such options are doubly important when we are dealing with large datasets like knowledge bases and news sites. Fortunately, these are two areas in which MySQL already excels, just not both at the same time. However, there is at least one PHP hack for those of us that can't wait for the 4.0 release.

A Solution

The Code

The functions (will open in a new window): funcs.mysql.boolean.php

An example implementation complete with sample data: example.mysql.boolean.php

Interactive Example

I've also setup a script that lets you test drive this functionality on a small sample database: form.mysql.boolean.php

Script Implementation

The provided functions are mainly used to generate SQL elements of the SELECT and WHERE clauses as follows. First, we must retrieve the FULLTEXT KEY column names from our table, $table_name, via our database connection, $db_connect:

   $fulltext_key = get_fulltext_key($table_name,$db_connect);

And then we build the SQL statement using $fulltext_key and the user input, $search_string:

   $sql = "SELECT id, author, content, \n"
     .boolean_sql_select(
         boolean_inclusive_atoms($search_string),
         $fulltext_key)." \n"
     ."as relevance \n"
     ."FROM $table_name \n"
     ."WHERE \n"
     .boolean_sql_where($search_string,$fulltext_key)." \n"
     ."HAVING relevance>0 \n"
     ."ORDER BY relevance DESC \n";

Supported Input

Fundamental Operators
The code supports five basic operators:
Shorthand Operators
The code also supports a shorthand syntax similar to ebay's boolean syntax; the two syntaxes can be intermixed at the users discretion:
Nesting, Grouping, Logical Precedence
The code is also capable of nested Boolean statements to any plausible depth (that which MySQL can handle) via the use of parentheses. These characters can also be used to force the order of evaluation of any given statement. Remember that the AND operator has precedence over the OR operator.

Allowed Characters
As it exists, the code allows alphanumeric characters as well as the special characters '.', '_', and '-', providing they are not used as the first character in a subject word. Regardless of the characters allowed in the code, their are certain limitations imposed by MySQL when it runs the fulltext searches.

Minimum Word Length
The actual minimum word length is defined when MySQL is compiled in the myisam/ftdefs.h file:

   #define MIN_WORD_LEN 4

If you need more functionality and don't want to go through the trouble of recompiling MySQL, no worry: the code is currently written to be adaptive to words of three or less characters. For a given word, the SQL element in the SELECT clause will switch to a simple scoring algorithm that does a case insensitive count of the subject word in all of the FULLTEXT KEY columns; the SQL element in the WHERE clause will switch to a LIKE string comparison of the subject word with all of the FULLTEXT KEY columns. Note that enabling this additional functionality may create some performance issues with larger datasets as it runs significantly slower than the fulltext system used for words of 4+ characters, perhaps because it does not use a compiled algorithm or the fulltext index.

Stop Words
Restrictions on stop words (common words) when using the fulltext search are still dictated by MySQL once compiled. However, if the adaptive functions are enabled then the generated SQL will return records with stop words of three or less characters. Again, keep in mind that this functionality does not use the fullext index and is typically quite slow.

Fundamental Constructs
Some basic input:
Advanced Constructs
Some more advanced input (based on the shorthand syntax):

The Concept

The WHERE Clause
Assuming we have a table with columns title,content indexed on the FULLTEXT KEY, the typical syntax for implementing a fulltext search on said table might be:

   MATCH (title,content) AGAINST ('george')

If the fulltext index contains the word george, the above will return a floating point number, typically between 0.0 (exclusive) and 5.0 for any given record. If the fulltext index does not contain the word george, the above will return 0.0 for any given record. Thus, we can make the above statement evaluate to TRUE or FALSE with the following syntax:

   MATCH (title,content) AGAINST ('george') > 0

In order to facilitate boolean capability for an expression like

   george AND dubya

the corresponding SQL code in the WHERE clause will be

   MATCH (title,content) AGAINST ('george') > 0
   AND
   MATCH (title,content) AGAINST ('dubya') > 0

The SELECT Clause
The SQL in the SELECT clause can be used to calculate the total relevance as declared by the relevance alias for the above example like so:

   MATCH (title,content) AGAINST ('george')
   +
   MATCH (title,content) AGAINST ('dubya')
   as relevance

but the code utilizes a simplification which produces a numerical equivalent:

   MATCH (title,content) AGAINST ('george dubya')
   as relevance

The HAVING Clause
We can also add a condition to the HAVING clause which will ensure that the query only returns records with a non-zero relevance:

   relevance > 0

While this may seem redundant considering the SQL in the WHERE clause which already deals which the inclusion and exclusion of records, this allows the developer to impose a minimum relevance restriction on the result set.

   relevance > 0.2

The ORDER BY Clause
This clause sorts the results by our calculated relevance starting with the records with the highest relevance:

   relevance DESC

Other Issues

Performance
I've implemented these functions in a knowledge base with 16,000+ records, and they have been running wonderfully since June 2001. The largest table on this system is about 7,000+ records containing 77.1 MB of data with an 8.5 MB fulltext index. Query times for a typical two to four atom statement on this table average about 0.25 seconds providing they use the fulltext index. However, if the SQL is forced to adapt to a three or less character word, query times jump to over 5.0 seconds in some cases. Query times on smaller tables of with 1 MB of data and a 300 KB fulltext index are almost negligible, averaging 0.005 seconds with the fulltext index and 0.02 seconds without.
Security
The string replacements in the functions are only utilized to properly manage whitespace in the input string, they do not check for malicious user input. Such checks should be made prior to passing the user input to these functions.

Keep in mind that some of the parameters in these functions are arbitrary or may be specific to the context in which they are used. However, I do encourage you to email me with any general comments or improvements.

-David Altherr altherda@email.uc.edu davidaltherr.net