alg1 is a software module for finding "frequent" subsets among a
collection of sets T1....Tn. A subset is called "frequent" if it is found
within at least C sets, where C is a supplied parameter, typically less
than 1% of the total number of sets. Frequent subsets usually indicate
an association between their members which goes beyond simple coincidence.
For instance, products which are frequently bought together, or events
which are causally related to each other, will show up in this type of
analysis.
Typical applications of associations mining are: "market baskets", studying shopping transactions to find products which customers buy together (eg cereal and milk); web log mining, to see which pages users visit during the same session, and just about any mass activity, where multiple collections of events or objects can be studied for hidden relationships.
For simplicity's sake, set members are represented as integers. Since integers occur rarely in nature, it is up to the caller to translate input data from more meaningful formats, and translate output back from integers to meaningful values.
The current, alpha version reads a binary file format generated by the IBM Almaden synthetic data generator, a utility which creates test data. The file is a set of consecutive records of the format [customer ID (ignored)] [transaction ID (ignored)] [ n ] [item1]....[item n], with each element a 4-byte int. alg1 accepts sets of up to 63 items in length, and the items do not need to be sorted.
Usage is: alg1 filename nrecs C
Filename is the input file described above, nrecs is the number of sets to read and mine, and C is the cutoff value for frequent subsets. Output is not handled much at all, except by printing out frequent subsets for diagnostic purposes.
The current build is just one medium-length C++ file with several object definitions and a main(), which can be gcc'ed straight to a binary. The next stage will be to structure the code into modules, put together a proper makefile, and decide on what to do for diagnostics instead of just dumping lots of statistics to stdio.
Down the line I hope to build some useful interfaces to, say, perl, so that this thing can be used in other applications.
Performance so far seems good, although possibly not as good as some published algorithms. For "typical" data generated by IBM's data generator, the algorithm is taking between .1 and .25 seconds per 1000 records on a Duron 650 with plenty of RAM, excluding some time spent reading in data and building a heap with it. There are still some optimizations possible.
alg1 uses the principle of ordered subset enumeration, in which
possible subsets are generated in a specific order, and grouped into consecutive
runs of C or more identical instances from different sets Ti. Each
set Ti generates a an ordered list of subsets, and these lists are merge-joined
together to find members which repeat C or more times.
Example: take three sets of integers (with their members in descending
numerical order): {4,3,2,1}, {4,2,1}, and {4,3,2}, and let
C=3. Generate the lists of subsets for each set, and sort them lexicographically,:
4321: | 421: | 432: |
1 | 1 | 2 |
2 | 2 | 3 |
21 | 21 | 32 |
3 | 4 | 4 |
31 | 41 | 42 |
32 | 42 | 43 |
321 | 421 | 432 |
4 | ||
41 | ||
42 | ||
421 | ||
43 | ||
431 | ||
432 | ||
4321 |
To find members common to all three columns, we start with the first row, and find the lexicographically minimum subset in any column (1). We then find all instances of this subset and count them before deleting them and moving their respective lists up a notch. We repeat, finding {2} occurs in all three lists, and so on until the lists are exhausted.
In practice, we don't need to generate each list in its entirety, since we only look at one value at a time during the counting, and it is easy to generate the next subset from the one just examined. If we create a bitmap to represent each subset as drawn from its parent set, the ordering is particularly apparent: 0...0001, 0...0010, 0...0011, ..., 1...1111 - simple binary counting. alg1 uses just such an integer bitmap to generate successive subsets.
alg1 uses a heap to do this merging, iteratively taking the minimum not-yet-visited subset from the heap, counting it, and constructing the next subset which is reinserted later in the ordering.
{1} {1} {2}.....
|
^
|
|
+-->{2}-----+
count/find successor/re-insert
Its primary optimization is to use lookahead to skip runs of infrequent sets. Using a FIFO of C elements, alg1 knows if a subset is frequent if it occurs in C consecutive positions. If the head and tail of the FIFO are identical, then we have found such a frequent subset. Otherwise, if the tail of the FIFO is X, say, and the head is Z (X < Z), we can skip any successors Y of X for which Y < Z. The reasoning is as follows: if we remove X from the list, generate successor Y, and re-insert it into the ordering, we still have C-1 elements preceding Z, so there is no way a run of Y's could exceed C-1 elements. Therefore, Y must be infrequent, so it can be skipped. alg1 uses a "leapfrog" routine which, given X and Z, finds a successor of X which is greater than or equal to Z.
FIFO: X X...Y Y Y...Z <---- Heap: Z Z ....
|
^
|
|
+------skip Y ---------------------+