Have you ever experienced that when you go to Mall to buy some required things and end up with buying lot more things which you haven’t planned to buy?! This may happen several times because of some sales and discounts offered by the Malls. To understand in more detail, we will discuss the origin of this technique.
A sales person from Wal-Mart tried to increase the sales of the store by combining the products together and offering discounts on them. He combined bread and jam which made it easy for a customer to find them together and also, customers are able to buy them together because of the discount.
To combine some more products which can be tied together, the sales guy made analysis on all sales records. He found something interesting was many customers who purchased diapers also bought beers. Even though the two products are obviously unrelated customers frequently bought such combination. So, he decided to dig deeper. He found that raising kids is exhausting and to relieve their stress, parents decided to buy beer. Now, he paired diapers with beers and the sales got increased. This is the perfect example of Association Rules in data mining.
Apriori algorithm is a traditional algorithm in data mining process. It is used for mining of frequent item sets and relevant or applicable association rules. It is developed to operate on a database which contains a lot of transactions, for example, items brought by customers in a store.
Association Rule Mining
Association rules is similar to an IF-THEN relationship. If item A is being bought by the customer, then the chances of item B being picked by the customer under the same Transaction ID is determined.
There are two fundamentals of these rules:
Antecedent (IF): This is a group of items which are generally found in the Item sets or Datasets.
Consequent (THEN): This comes along as an item with an Antecedent or complementary goods is called consequent.
There are 3 ways to calculate association:
Support: It computes fraction of transactions which contains item A and B. Generally, Support tells us about the frequently purchased items or the combination of items that are bought frequently.
With Support, we can filter out the items with low frequency.
Confidence: It gives us how frequent the items A and B occur together, for given the number times A occurs.
To make it simple,
Now, after filtering many items you still left out with around 5000 items. Creating association rules for them is a practically impossible task. Here, the concept of lift comes into play.
Lift: Lift tells us about the strength of a rule over the random occurrence of A and B. It basically explains us the strength of any rule.
Concentrate on the denominator, it computes the probability of the individual support values of A and B. If the value of Lift is more then, it has more strength.
Frequent Pattern Mining (FPM)
The frequent pattern mining algorithm is one of the most important techniques of data mining to find relationships between different items in a dataset. These relationships are represented in the form of association rules which also helps to find the irregularities in data.
Apriori algorithm was the first algorithm that was proposed for frequent itemset mining. This algorithm uses two steps join and prune to decrease the search space. It is an iterative approach to find out the most frequent item sets.
The probability that item I is not frequent if:
- P(I) < minimum support threshold, then I is not frequent.
- P (I+A) < minimum support threshold, then I+A is not frequent, where A belongs to itemset.
- If an item set has value less than minimum support threshold then all of its supersets will also fall below minimum support, and thus can be ignored. This property is called as the Antimonotone property.
The steps followed in the Apriori Algorithm are:
- Join Step: This step generates (K+1) itemset from K-item sets by adding each item with itself.
- Prune Step: This step scans the count of every item in the database. If the candidate item does not meet minimum support, then it is regarded as less frequent and thus it is removed from the list. This step is performed to decrease the size of the candidate item sets.
Steps followed in Apriori algorithm:
Apriori algorithm is a series of steps to be followed to determine the most frequent itemset in the given database. This technique follows the join and the prune steps which are discussed earlier are performed iteratively until the most frequent itemset is achieved. A minimum support threshold is given in the problem or it is assumed by the user.
In the first iteration of the algorithm, each item is taken as a first itemset candidate. The algorithm will count the occurrences of each item or frequency of item.
Let us consider some minimum support. The set of first itemset whose occurrence is satisfying the minimum support are determined. Only the candidates which count more than or equal to minimum support are taken ahead for the next iteration and the others are pruned or removed from dataset.
After, second itemset frequent items with minimum support are determined. For this in the join step, the second itemset is generated by forming a group of 2 by combining items with itself.
The second itemset candidates are pruned using minimum support as threshold value. Now the table will have second itemset with minimum support only.
The third iteration will form third itemset by using join and prune step. This iteration will follow antimonotone property which is discussed earlier. Where the subsets of third itemset, which means the second itemset subsets of each group fall in minimum support. If all second itemset subsets are frequent then the super set will be frequent otherwise it is pruned.
This algorithm is stopped when the most frequent itemset is achieved after performing many iterations.
Advantages of the Apriori algorithm
- It is an easy to implement and easy to understand algorithm.
- It may be used on large item sets.
Disadvantages of the Apriori Algorithm
- It may need to determine a large number of candidate rules which can be computationally expensive.
- Computing support is also expensive as it has to go through the entire database.