1 Association Rules

Association Rules is a popular and well researched method for discovering interesting relations between itemsets in large databases. We start from defining a few ways to measure association.

  1. Support: The support, \(\text{supp}(X)\), measures how popular an itemset (\(X\)) is. It is calculated as the proportion of transactions in the data set which contain the itemset.

    • We use Table 11 below to show that the support of {apple} is 4 out of 8, or 50%. Itemsets can also contain multiple items. For instance, the support of {apple, beer, rice} is 2 out of 8, or 25%. The support of {apple, beer} is 3 out of 8.
Table 1
  1. Confidence: The confidence of a rule measures how likely item \(Y\) is purchased when item \(X\) is purchased, defined as \(\text{conf}( X\Rightarrow Y) = \text{supp}( X \cap Y )/\text{supp}(X)\). This is measured by the proportion of transactions with item \(X\), in which item \(Y\) also appears.

    • In Table 1, the confidence of {apple \(\Rightarrow\) beer} is 3 out of 4, or 75%. It means that for 75% of the transactions containing apple the rule is correct (you will see apple and beer appear together). Confidence can be interpreted as an estimate of the conditional probability P(Y|X), the probability of finding the RHS of the rule in transactions under the condition that these transactions also contain the LHS. Association rules are required to satisfy both a minimum support and a minimum confidence constraint at the same time.
Table 2
  1. Lift: Lift is a popular measure of to filter or rank found rules. It measures how likely item \(Y\) is purchased when item \(X\) is purchased, while controlling for how popular item Y is, which is defined as \(\text{lift}(X \Rightarrow Y ) = \text{supp}(X \cap Y )/(\text{supp}(X)\text{supp}(Y))\). Lift can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Greater than 1 lift values indicate stronger associations.

    • In Table 1, the lift of {apple -> beer} is \(\frac{3/8}{(4/8)\cdot(6/8)}=1\), which implies no association between items. A lift value greater than 1 means that item \(Y\) is likely to be bought if item \(X\) is bought, while a value less than 1 means that item \(Y\) is unlikely to be bought if item \(X\) is bought.
Table 3

For more introductions, see Complete guide to Association Rules 1 and Complete guide to Association Rules 2.

1.1 Groceries example

1.1.1 Find association rules by arules package

The Groceries dataset contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.

arules package in R provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules.

library(arules)
data("Groceries")
#run summary report
summary(Groceries)
## transactions as itemMatrix in sparse format with
##  9835 rows (elements/itemsets/transactions) and
##  169 columns (items) and a density of 0.02609146 
## 
## most frequent items:
##       whole milk other vegetables       rolls/buns             soda 
##             2513             1903             1809             1715 
##           yogurt          (Other) 
##             1372            34055 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 2159 1643 1299 1005  855  645  545  438  350  246  182  117   78   77   55   46 
##   17   18   19   20   21   22   23   24   26   27   28   29   32 
##   29   14   14    9   11    4    6    1    1    1    1    3    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   3.000   4.409   6.000  32.000 
## 
## includes extended item information - examples:
##        labels  level2           level1
## 1 frankfurter sausage meat and sausage
## 2     sausage sausage meat and sausage
## 3  liver loaf sausage meat and sausage

summary() displays the most frequent items in the data set, information about the transaction length distribution and that the data set contains some extended transaction information. We see that the data set contains transaction IDs. This additional information can be used for analyzing the data set.

To find the very long transactions we can use the size() and select very long transactions (containing more than 30 items).

# size() can specify size of items in transactions
x = Groceries[size(Groceries) > 30]
inspect(x)
##     items               
## [1] {frankfurter,       
##      sausage,           
##      liver loaf,        
##      ham,               
##      chicken,           
##      beef,              
##      citrus fruit,      
##      tropical fruit,    
##      root vegetables,   
##      other vegetables,  
##      whole milk,        
##      butter,            
##      curd,              
##      yogurt,            
##      whipped/sour cream,
##      beverages,         
##      soft cheese,       
##      hard cheese,       
##      cream cheese ,     
##      mayonnaise,        
##      domestic eggs,     
##      rolls/buns,        
##      roll products ,    
##      flour,             
##      pasta,             
##      margarine,         
##      specialty fat,     
##      sugar,             
##      soups,             
##      skin care,         
##      hygiene articles,  
##      candles}

To see which items are important in the data set we can use the itemFrequencyPlot(). To reduce the number of items, we only plot the item frequency for items with a support greater than 10%. The label size is reduced with the parameter cex.names.

# itemFrequencyPlot() shows the frequency for items
itemFrequencyPlot(Groceries, support = 0.1, cex.names=0.8)

Use apriori() algorithm to find all rules (the default association type for apriori()) with a minimum support of 0.3% and a confidence of 0.5.

# Run the apriori algorithm
basket_rules <- apriori(Groceries,parameter = list(sup = 0.003, conf = 0.5,target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5   0.003      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 29 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [136 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [421 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(basket_rules)
## set of 421 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3   4   5 
##   5 281 128   7 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   3.000   3.000   3.325   4.000   5.000 
## 
## summary of quality measures:
##     support           confidence        coverage             lift      
##  Min.   :0.003050   Min.   :0.5000   Min.   :0.003559   Min.   :1.957  
##  1st Qu.:0.003355   1st Qu.:0.5238   1st Qu.:0.005999   1st Qu.:2.135  
##  Median :0.003965   Median :0.5556   Median :0.007016   Median :2.426  
##  Mean   :0.004754   Mean   :0.5715   Mean   :0.008477   Mean   :2.522  
##  3rd Qu.:0.005186   3rd Qu.:0.6094   3rd Qu.:0.009456   3rd Qu.:2.766  
##  Max.   :0.022267   Max.   :0.8857   Max.   :0.043416   Max.   :5.804  
##      count       
##  Min.   : 30.00  
##  1st Qu.: 33.00  
##  Median : 39.00  
##  Mean   : 46.75  
##  3rd Qu.: 51.00  
##  Max.   :219.00  
## 
## mining info:
##       data ntransactions support confidence
##  Groceries          9835   0.003        0.5
# Check the generated rules using inspect
inspect(head(basket_rules))
##     lhs                        rhs                support     confidence
## [1] {cereals}               => {whole milk}       0.003660397 0.6428571 
## [2] {specialty cheese}      => {other vegetables} 0.004270463 0.5000000 
## [3] {rice}                  => {other vegetables} 0.003965430 0.5200000 
## [4] {rice}                  => {whole milk}       0.004677173 0.6133333 
## [5] {baking powder}         => {whole milk}       0.009252669 0.5229885 
## [6] {root vegetables,herbs} => {other vegetables} 0.003863752 0.5507246 
##     coverage    lift     count
## [1] 0.005693950 2.515917 36   
## [2] 0.008540925 2.584078 42   
## [3] 0.007625826 2.687441 39   
## [4] 0.007625826 2.400371 46   
## [5] 0.017691917 2.046793 91   
## [6] 0.007015760 2.846231 38

As typical for association rule mining, the number of rules found is huge. To analyze these rules, for example, subset() can be used to produce separate subsets of rules. Now find the subset of rules that has 4 or more length (LHS+RHS).

#Basket rules of size greater than 4
inspect(subset(basket_rules, size(basket_rules)>4))
##     lhs                   rhs                    support confidence    coverage     lift count
## [1] {citrus fruit,                                                                            
##      tropical fruit,                                                                          
##      root vegetables,                                                                         
##      other vegetables} => {whole milk}       0.003152008  0.7045455 0.004473818 2.757344    31
## [2] {citrus fruit,                                                                            
##      tropical fruit,                                                                          
##      root vegetables,                                                                         
##      whole milk}       => {other vegetables} 0.003152008  0.8857143 0.003558719 4.577509    31
## [3] {citrus fruit,                                                                            
##      tropical fruit,                                                                          
##      other vegetables,                                                                        
##      whole milk}       => {root vegetables}  0.003152008  0.6326531 0.004982206 5.804238    31
## [4] {citrus fruit,                                                                            
##      root vegetables,                                                                         
##      other vegetables,                                                                        
##      whole milk}       => {tropical fruit}   0.003152008  0.5438596 0.005795628 5.183004    31
## [5] {tropical fruit,                                                                          
##      root vegetables,                                                                         
##      other vegetables,                                                                        
##      yogurt}           => {whole milk}       0.003558719  0.7142857 0.004982206 2.795464    35
## [6] {tropical fruit,                                                                          
##      root vegetables,                                                                         
##      whole milk,                                                                              
##      yogurt}           => {other vegetables} 0.003558719  0.6250000 0.005693950 3.230097    35
## [7] {tropical fruit,                                                                          
##      root vegetables,                                                                         
##      other vegetables,                                                                        
##      whole milk}       => {yogurt}           0.003558719  0.5072464 0.007015760 3.636128    35

Find the subset of rules with lift greater than 5:

inspect(subset(basket_rules, lift>5))
##     lhs                   rhs                   support confidence    coverage     lift count
## [1] {citrus fruit,                                                                           
##      tropical fruit,                                                                         
##      other vegetables,                                                                       
##      whole milk}       => {root vegetables} 0.003152008  0.6326531 0.004982206 5.804238    31
## [2] {citrus fruit,                                                                           
##      root vegetables,                                                                        
##      other vegetables,                                                                       
##      whole milk}       => {tropical fruit}  0.003152008  0.5438596 0.005795628 5.183004    31

Now find the subset rules that has Yogurt in the right hand side. Here we require lift measure exceeds 3.5.

yogurt.rhs <- subset(basket_rules, subset = rhs %in% "yogurt" & lift>3.5)

Now inspect the subset rules

inspect(yogurt.rhs)
##     lhs                     rhs          support confidence    coverage     lift count
## [1] {whipped/sour cream,                                                              
##      cream cheese }      => {yogurt} 0.003355363  0.5238095 0.006405694 3.754859    33
## [2] {root vegetables,                                                                 
##      cream cheese }      => {yogurt} 0.003762074  0.5000000 0.007524148 3.584184    37
## [3] {tropical fruit,                                                                  
##      curd}               => {yogurt} 0.005287239  0.5148515 0.010269446 3.690645    52
## [4] {other vegetables,                                                                
##      whole milk,                                                                      
##      cream cheese }      => {yogurt} 0.003457041  0.5151515 0.006710727 3.692795    34
## [5] {tropical fruit,                                                                  
##      whole milk,                                                                      
##      curd}               => {yogurt} 0.003965430  0.6093750 0.006507372 4.368224    39
## [6] {tropical fruit,                                                                  
##      other vegetables,                                                                
##      butter}             => {yogurt} 0.003050330  0.5555556 0.005490595 3.982426    30
## [7] {tropical fruit,                                                                  
##      whole milk,                                                                      
##      butter}             => {yogurt} 0.003355363  0.5409836 0.006202339 3.877969    33
## [8] {tropical fruit,                                                                  
##      whole milk,                                                                      
##      whipped/sour cream} => {yogurt} 0.004372140  0.5512821 0.007930859 3.951792    43
## [9] {tropical fruit,                                                                  
##      root vegetables,                                                                 
##      other vegetables,                                                                
##      whole milk}         => {yogurt} 0.003558719  0.5072464 0.007015760 3.636128    35

Now find the subset rules that has Meat in the left hand side. Here we require lift measure exceeds 2.

meat_lhs <- subset(basket_rules, subset = lhs %in% "meat" & lift>2)

Now inspect the subset rules

inspect(meat_lhs)
##     lhs                       rhs          support     confidence coverage   
## [1] {meat,root vegetables} => {whole milk} 0.003152008 0.62       0.005083884
##     lift     count
## [1] 2.426462 31

1.1.2 Visualize the rules by arulesViz

We can use the arulesViz package to visualize the rules, for a more complete introduction, see the package manual.

install.packages('arulesViz')
library('arulesViz')
plot(basket_rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

The plot function has an interactive mode for you to inspect individual rules:

plot(basket_rules, interactive=TRUE)

Graph-based visualization can be used for very small sets of rules. The vertices are represented by items for the 10 rules with highest lift:

plot(head(sort(basket_rules, by="lift"), 10), method = "graph")

The package comes with an approach to cluster association rules and itemsets:

plot(basket_rules, method="grouped")

go to top

2 Case Starter Code

For Cincinnati Zoo data, use the following code to load the transaction data for association rules mining. as() function coerce the dataset into transaction data type for association rules mining. In the zoo data, the support for the rules is relatively low, with a maximum support of no more than 3%.

TransFood <- read.csv('https://yanyudm.github.io/Data-Mining-R/data/food_4_association.csv')
TransFood <- TransFood[, -1]
# Find out elements that are not equal to 0 or 1 and change them to 1.
Others <- which(!(as.matrix(TransFood) ==1 | as.matrix(TransFood) ==0), arr.ind=T )
TransFood[Others] <- 1
TransFood <- as(as.matrix(TransFood), "transactions")

go to top


  1. The figures above are borrowed from kdnuggets.com.↩︎