{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Finding frequent itemsets with the Apriori algorithm\n", "\n", "This Jupyter notebook is based on the following publication:\n", " \n", " Harrington, P. Machine Learning in Action. \n", " Association analysis with the Apriori algorithm (Chapter 11).\n", " Manning Publications Co., 2012" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Apriori algorithm\n", "\n", "The way to find frequent itemsets is the **Apriori algorithm**. The Apriori algorithm\n", "needs a **minimum support level** as an input and a data set. The algorithm will generate\n", "a list of all **candidate itemsets** with one item. The transaction data set will then be\n", "scanned to see which sets meet the minimum support level. Sets that don't meet the\n", "minimum support level will get tossed out.\n", "\n", "The remaining sets will then be combined\n", "to make itemsets with two elements. Again, the transaction dataset will be scanned\n", "and itemsets not meeting the minimum support level will get tossed. This procedure\n", "will be repeated until all sets are tossed out." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function creates a simple data set." ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "def Load_data():\n", " baskets_data = [\n", " ['kruh', 'mleko'],\n", " ['kruh', 'plenice', 'union'],\n", " ['mleko', 'plenice', 'union'],\n", " ['kruh', 'mleko', 'plenice', 'union'],\n", " ['kruh', 'mleko', 'plenice'],\n", " ['mleko', 'plenice'],\n", " ['plenice'],\n", " ['mleko', 'union', 'plenice'],\n", " ['plenice', 'union'],\n", " ['mleko', 'plenice', 'union'],\n", " ]\n", " return baskets_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Generating candidate itemsets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pseudocode for scanning the dataset would look like this:\n", " \n", " For each transaction in tran the dataset:\n", " For each candidate itemset, can:\n", " Check to see if can is a subset of tran\n", " If so increment the count of can\n", " \n", " For each candidate itemset:\n", " If the support meets the minimum, keep this item\n", " \n", " Return list of frequent itemsets" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function creates C1: a candidate itemset of size one. In the Apriori algorithm, we create C1, and then we'll scan the dataset to see if these one itemsets meet our minimum support requirements.\n", "\n", "We need a special function for the first list of candidate itemsets because\n", "initially you're reading from input, whereas later lists will be properly formatted." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "def createC1(data):\n", " \"\"\"\n", " Create a list of unique items in transaction data.\n", " Represent each item as a set of length 1.\n", " \"\"\"\n", " C1 = []\n", " for transaction in data:\n", " for item in transaction:\n", " if not [item] in C1:\n", " C1.append([item])\n", " C1.sort()\n", " \n", " # create a set for each item in C1\n", " return [set(x) for x in C1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function createCk() will take a list of frequent itemsets, Lk, and the size of\n", "the itemsets, k, to produce Ck. This is accomplished by first creating an empty\n", "list and then measuring how many elements are in Lk.\n", "\n", "Next, compare each item in Lk with all of the other items in Lk. The two `for` loops accomplish that. \n", "\n", "Next, take two sets in our list and compare them. If **the first `k-2` items** are equal, then combine\n", "the two sets to make a set of size k. B The sets are combined using the set union,\n", "which is the | symbol in Python." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "def createCk(Lk, k):\n", " \"\"\"\n", " Create a list of candidates of length k.\n", " \n", " Arguments:\n", " Lk: a list of frequent itemsets\n", " k: the size of the itemsets\n", " \n", " \"\"\"\n", " cand_list = []\n", " len_Lk = len(Lk)\n", " \n", " # join sets if first k-2 items are equal\n", " for i in range(len_Lk):\n", " for j in range(i+1, len_Lk):\n", " L1 = list(Lk[i])[:k-2]\n", " L2 = list(Lk[j])[:k-2]\n", " L1.sort()\n", " L2.sort()\n", " if L1==L2:\n", " cand_list.append(Lk[i] | Lk[j])\n", " \n", " return cand_list" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following function is used to generate a **list of itemsets** from candidate sets. It also returns a dictionary with **support values** for use later.\n", "\n", "This function creates an empty dictionary, `count`, and then goes over all the transactions in\n", "the dataset and all the candidate sets in `Ck`. If the sets of `Ck` are part of the dataset,\n", "then increment the count in the dictionary. The set is the key in the dictionary.\n", "\n", "After scanning over all the items in the dataset and all the candidate sets, we\n", "need to calculate the support. Sets that don't meet your minimum support levels\n", "won't be output. \n", "\n", "First, we create an empty list that will hold the sets that do meet the\n", "minimum support. The next loop goes over every element in the dictionary and measures\n", "the support." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "def scanD(data, Ck, min_support):\n", " \"\"\"\n", " Scan through transaction data and return a list of candidates that meet\n", " the support threshold, and support data about the current candidates.\n", " \n", " Arguments:\n", " data: data set,\n", " Ck: a list of candidate sets\n", " min_support: the minimum support\n", " \"\"\"\n", " count = {}\n", " for transaction in data:\n", " tr = set(transaction)\n", " for candidate in Ck:\n", " if candidate.issubset(tr):\n", " can = frozenset(candidate)\n", " if can not in count:\n", " count[can] = 1\n", " else:\n", " count[can] += 1\n", " num_items = float(len(D))\n", " \n", " cand_list = []\n", " support_data = {}\n", " \n", " # calculate support for every itemset\n", " for key in count:\n", " support = count[key]/num_items\n", " \n", " # If the support meets the minimum support requirements, \n", " # add it to the list of itemsets.\n", " if support >= min_support:\n", " cand_list.insert(0, key)\n", " support_data[key] = support\n", " \n", " return cand_list, support_data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Run Apriori\n", "\n", "Pseudo-code for the whole Apriori algorithm would look like this:\n", "\n", " While the number of items in the set is greater than 0:\n", " Create a list of candidate itemsets of length k\n", " Scan the dataset to see if each itemset is frequent\n", " Keep frequent itemsets to create itemsets of length k+1" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "min_support = 0.5" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[['kruh', 'mleko'],\n", " ['kruh', 'plenice', 'union'],\n", " ['mleko', 'plenice', 'union'],\n", " ['kruh', 'mleko', 'plenice', 'union'],\n", " ['kruh', 'mleko', 'plenice'],\n", " ['mleko', 'plenice'],\n", " ['plenice'],\n", " ['mleko', 'union', 'plenice'],\n", " ['plenice', 'union'],\n", " ['mleko', 'plenice', 'union']]" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = Load_data()\n", "data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### k = 1" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'kruh'}, {'mleko'}, {'plenice'}, {'union'}]" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C1 = createC1(data)\n", "C1" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[{'kruh', 'mleko'},\n", " {'kruh', 'plenice', 'union'},\n", " {'mleko', 'plenice', 'union'},\n", " {'kruh', 'mleko', 'plenice', 'union'},\n", " {'kruh', 'mleko', 'plenice'},\n", " {'mleko', 'plenice'},\n", " {'plenice'},\n", " {'mleko', 'plenice', 'union'},\n", " {'plenice', 'union'},\n", " {'mleko', 'plenice', 'union'}]" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "D = list(map(set, data))\n", "D" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "L1, support_data1 = scanD(D, C1, min_support)" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[frozenset({'union'}), frozenset({'plenice'}), frozenset({'mleko'})]" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L1" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{frozenset({'kruh'}): 0.4,\n", " frozenset({'mleko'}): 0.7,\n", " frozenset({'plenice'}): 0.9,\n", " frozenset({'union'}): 0.6}" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "support_data1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### k = 2" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[frozenset({'plenice', 'union'}),\n", " frozenset({'mleko', 'union'}),\n", " frozenset({'mleko', 'plenice'})]" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C2 = createCk(L1, k=2)\n", "C2" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "L2, support_data2 = scanD(D, C2, min_support)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[frozenset({'mleko', 'plenice'}), frozenset({'plenice', 'union'})]" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L2" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{frozenset({'plenice', 'union'}): 0.6,\n", " frozenset({'mleko', 'union'}): 0.4,\n", " frozenset({'mleko', 'plenice'}): 0.6}" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "support_data2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### k = 3" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[frozenset({'mleko', 'plenice', 'union'})]" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "C3 = createCk(L2, k=3)\n", "C3" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "L3, support_data3 = scanD(D, C3, min_support)" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "L3" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{frozenset({'mleko', 'plenice', 'union'}): 0.4}" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "support_data3" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.10" } }, "nbformat": 4, "nbformat_minor": 2 }