generally accounts for a higher percentage of the project time than the high-tech implementation of sophisticated mathematics. (This is, of course, not to imply that the mathematics is not important; indeed, it is often crucial. But much preparatory work usually needs to be done before the mathematics can be applied.)
The case McCue describes involves the establishment of a fraudulent telephone account that was used to conduct a series of international telephone conferences. The police investigation began when a telephone conference call service company sent them a thirty-seven-page conference call invoice that had gone unpaid. Many of the international conference calls listed on the invoice lasted for three hours or more. The conference call company had discovered that the information used to open the account was fraudulent. Their investigation led them to suspect that the conference calls had been used in the course of a criminal enterprise, but they had nothing concrete to go on to identify the perpetrators. McCue and her colleagues set to work to see if a data-mining analysis of the conference calls could provide clues to their identities.
The first step in the analysis was to obtain an electronic copy of the telephone bill in easily processed text format. With telephone records, this is fairly easy to do these days, but as data-mining experts the world over will attest, in many other kinds of cases a great deal of time and effort has to be expended at the outset in re-keying data as well as double-checking the keyed data against the hard-copy original.
The next stage was to remove from the invoice document all of the information not directly pertinent to the analysis, such as headers, information about payment procedures, and so forth. The resulting document included the conference call ID that the conference service issued for each call, the telephone numbers of the participants, and the dates and durations of the calls. Fewer than 5 percent of entries had a customer name, and although the analysts assumed those were fraudulent, they nevertheless kept them in case they turned out to be useful for additional linking.
The document was then formatted into a structured form amenable to statistical analysis. In particular, the area codes were separated from the other information, since they enabled linking based on area locations, and likewise the first three digits of the actual phone number were coded separately, since they too link to more specific location information. Dates were enhanced by adding in the days of the week, in case a pattern emerged.
At this point, the document contained 2,017 call entries. However, an initial visual check through the data showed that on several occasions a single individual had dialed in to a conference more than once. Often most of the calls were of short duration, less than a minute, with just one lasting much longer. The most likely explanation was that the individuals concerned had difficulty connecting to the conference or maintaining a connection. Accordingly, these duplications were removed. That left a total of 1,047 calls.
At this point, the data was submitted to a Kohonen-style neural network for analysis. The network revealed three clusters of similar calls, based on the day of the month that the call took place and the number of participants involved in a particular call.
Further analysis of the calls within the three clusters suggested the possibility that the shorter calls placed early in the month involved the leaders, and that the calls at the end of the month involved the whole group. Unfortunately for the police (and for the telephone company whose bill was not paid), at around that time the gang ceased their activity, so there was no opportunity to take the investigation any further. The analysts assumed that the sudden cessation was preplanned, since the gang organizers knew that when the bill went unpaid, the authorities would begin an investigation.
No