Association Rules is a popular and well researched method for discovering interesting relations between itemsets in large databases. We start from defining a few ways to measure association.
Support: The support, \(\text{supp}(X)\), measures how popular an itemset (\(X\)) is. It is calculated as the proportion of transactions in the data set which contain the itemset.
Confidence: The confidence of a rule measures how likely item \(Y\) is purchased when item \(X\) is purchased, defined as \(\text{conf}( X\Rightarrow Y) = \text{supp}( X \cap Y )/\text{supp}(X)\). This is measured by the proportion of transactions with item \(X\), in which item \(Y\) also appears.
Lift: Lift is a popular measure of to filter or rank found rules. It measures how likely item \(Y\) is purchased when item \(X\) is purchased, while controlling for how popular item Y is, which is defined as \(\text{lift}(X \Rightarrow Y ) = \text{supp}(X \cap Y )/(\text{supp}(X)\text{supp}(Y))\). Lift can be interpreted as the deviation of the support of the whole rule from the support expected under independence given the supports of the LHS and the RHS. Greater than 1 lift values indicate stronger associations.
For more introductions, see Complete guide to Association Rules 1 and Complete guide to Association Rules 2.
The Groceries dataset contains 1 month (30 days) of real-world point-of-sale transaction data from a typical local grocery outlet. The data set contains 9835 transactions and the items are aggregated to 169 categories.
arules
package in R provides a basic infrastructure for creating and manipulating input data sets and for analyzing the resulting itemsets and rules.
For an introduction to arules and additional case studies, see Introduction to arules.
For the reference manual of the package, see arules package manual.
library(arules)
data("Groceries")
#run summary report
summary(Groceries)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55 46
## 17 18 19 20 21 22 23 24 26 27 28 29 32
## 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels level2 level1
## 1 frankfurter sausage meat and sausage
## 2 sausage sausage meat and sausage
## 3 liver loaf sausage meat and sausage
summary()
displays the most frequent items in the data set, information about the transaction length distribution and that the data set contains some extended transaction information. We see that the data set contains transaction IDs. This additional information can be used for analyzing the data set.
To find the very long transactions we can use the size()
and select very long transactions (containing more than 30 items).
# size() can specify size of items in transactions
x = Groceries[size(Groceries) > 30]
inspect(x)
## items
## [1] {frankfurter,
## sausage,
## liver loaf,
## ham,
## chicken,
## beef,
## citrus fruit,
## tropical fruit,
## root vegetables,
## other vegetables,
## whole milk,
## butter,
## curd,
## yogurt,
## whipped/sour cream,
## beverages,
## soft cheese,
## hard cheese,
## cream cheese ,
## mayonnaise,
## domestic eggs,
## rolls/buns,
## roll products ,
## flour,
## pasta,
## margarine,
## specialty fat,
## sugar,
## soups,
## skin care,
## hygiene articles,
## candles}
To see which items are important in the data set we can use the itemFrequencyPlot()
. To reduce the number of items, we only plot the item frequency for items with a support greater than 10%. The label size is reduced with the parameter cex.names
.
# itemFrequencyPlot() shows the frequency for items
itemFrequencyPlot(Groceries, support = 0.1, cex.names=0.8)
Use apriori()
algorithm to find all rules (the default association type for apriori()
) with a minimum support of 0.3% and a confidence of 0.5.
# Run the apriori algorithm
basket_rules <- apriori(Groceries,parameter = list(sup = 0.003, conf = 0.5,target="rules"))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.003 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 29
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [136 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 done [0.00s].
## writing ... [421 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
summary(basket_rules)
## set of 421 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5
## 5 281 128 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 3.000 3.000 3.325 4.000 5.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.003050 Min. :0.5000 Min. :0.003559 Min. :1.957
## 1st Qu.:0.003355 1st Qu.:0.5238 1st Qu.:0.005999 1st Qu.:2.135
## Median :0.003965 Median :0.5556 Median :0.007016 Median :2.426
## Mean :0.004754 Mean :0.5715 Mean :0.008477 Mean :2.522
## 3rd Qu.:0.005186 3rd Qu.:0.6094 3rd Qu.:0.009456 3rd Qu.:2.766
## Max. :0.022267 Max. :0.8857 Max. :0.043416 Max. :5.804
## count
## Min. : 30.00
## 1st Qu.: 33.00
## Median : 39.00
## Mean : 46.75
## 3rd Qu.: 51.00
## Max. :219.00
##
## mining info:
## data ntransactions support confidence
## Groceries 9835 0.003 0.5
# Check the generated rules using inspect
inspect(head(basket_rules))
## lhs rhs support confidence
## [1] {cereals} => {whole milk} 0.003660397 0.6428571
## [2] {specialty cheese} => {other vegetables} 0.004270463 0.5000000
## [3] {rice} => {other vegetables} 0.003965430 0.5200000
## [4] {rice} => {whole milk} 0.004677173 0.6133333
## [5] {baking powder} => {whole milk} 0.009252669 0.5229885
## [6] {root vegetables,herbs} => {other vegetables} 0.003863752 0.5507246
## coverage lift count
## [1] 0.005693950 2.515917 36
## [2] 0.008540925 2.584078 42
## [3] 0.007625826 2.687441 39
## [4] 0.007625826 2.400371 46
## [5] 0.017691917 2.046793 91
## [6] 0.007015760 2.846231 38
As typical for association rule mining, the number of rules found is huge. To analyze these rules, for example, subset()
can be used to produce separate subsets of rules. Now find the subset of rules that has 4 or more length (LHS+RHS).
#Basket rules of size greater than 4
inspect(subset(basket_rules, size(basket_rules)>4))
## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## tropical fruit,
## root vegetables,
## other vegetables} => {whole milk} 0.003152008 0.7045455 0.004473818 2.757344 31
## [2] {citrus fruit,
## tropical fruit,
## root vegetables,
## whole milk} => {other vegetables} 0.003152008 0.8857143 0.003558719 4.577509 31
## [3] {citrus fruit,
## tropical fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.003152008 0.6326531 0.004982206 5.804238 31
## [4] {citrus fruit,
## root vegetables,
## other vegetables,
## whole milk} => {tropical fruit} 0.003152008 0.5438596 0.005795628 5.183004 31
## [5] {tropical fruit,
## root vegetables,
## other vegetables,
## yogurt} => {whole milk} 0.003558719 0.7142857 0.004982206 2.795464 35
## [6] {tropical fruit,
## root vegetables,
## whole milk,
## yogurt} => {other vegetables} 0.003558719 0.6250000 0.005693950 3.230097 35
## [7] {tropical fruit,
## root vegetables,
## other vegetables,
## whole milk} => {yogurt} 0.003558719 0.5072464 0.007015760 3.636128 35
Find the subset of rules with lift greater than 5:
inspect(subset(basket_rules, lift>5))
## lhs rhs support confidence coverage lift count
## [1] {citrus fruit,
## tropical fruit,
## other vegetables,
## whole milk} => {root vegetables} 0.003152008 0.6326531 0.004982206 5.804238 31
## [2] {citrus fruit,
## root vegetables,
## other vegetables,
## whole milk} => {tropical fruit} 0.003152008 0.5438596 0.005795628 5.183004 31
Now find the subset rules that has Yogurt in the right hand side. Here we require lift measure exceeds 3.5.
yogurt.rhs <- subset(basket_rules, subset = rhs %in% "yogurt" & lift>3.5)
Now inspect the subset rules
inspect(yogurt.rhs)
## lhs rhs support confidence coverage lift count
## [1] {whipped/sour cream,
## cream cheese } => {yogurt} 0.003355363 0.5238095 0.006405694 3.754859 33
## [2] {root vegetables,
## cream cheese } => {yogurt} 0.003762074 0.5000000 0.007524148 3.584184 37
## [3] {tropical fruit,
## curd} => {yogurt} 0.005287239 0.5148515 0.010269446 3.690645 52
## [4] {other vegetables,
## whole milk,
## cream cheese } => {yogurt} 0.003457041 0.5151515 0.006710727 3.692795 34
## [5] {tropical fruit,
## whole milk,
## curd} => {yogurt} 0.003965430 0.6093750 0.006507372 4.368224 39
## [6] {tropical fruit,
## other vegetables,
## butter} => {yogurt} 0.003050330 0.5555556 0.005490595 3.982426 30
## [7] {tropical fruit,
## whole milk,
## butter} => {yogurt} 0.003355363 0.5409836 0.006202339 3.877969 33
## [8] {tropical fruit,
## whole milk,
## whipped/sour cream} => {yogurt} 0.004372140 0.5512821 0.007930859 3.951792 43
## [9] {tropical fruit,
## root vegetables,
## other vegetables,
## whole milk} => {yogurt} 0.003558719 0.5072464 0.007015760 3.636128 35
Now find the subset rules that has Meat in the left hand side. Here we require lift measure exceeds 2.
meat_lhs <- subset(basket_rules, subset = lhs %in% "meat" & lift>2)
Now inspect the subset rules
inspect(meat_lhs)
## lhs rhs support confidence coverage
## [1] {meat,root vegetables} => {whole milk} 0.003152008 0.62 0.005083884
## lift count
## [1] 2.426462 31
We can use the arulesViz
package to visualize the rules, for a more complete introduction, see the package manual.
install.packages('arulesViz')
library('arulesViz')
plot(basket_rules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The plot function has an interactive mode for you to inspect individual rules:
plot(basket_rules, interactive=TRUE)
Graph-based visualization can be used for very small sets of rules. The vertices are represented by items for the 10 rules with highest lift:
plot(head(sort(basket_rules, by="lift"), 10), method = "graph")
The package comes with an approach to cluster association rules and itemsets:
plot(basket_rules, method="grouped")
For Cincinnati Zoo data, use the following code to load the transaction data for association rules mining. as()
function coerce the dataset into transaction data type for association rules mining. In the zoo data, the support for the rules is relatively low, with a maximum support of no more than 3%.
TransFood <- read.csv('https://yanyudm.github.io/Data-Mining-R/data/food_4_association.csv')
TransFood <- TransFood[, -1]
# Find out elements that are not equal to 0 or 1 and change them to 1.
Others <- which(!(as.matrix(TransFood) ==1 | as.matrix(TransFood) ==0), arr.ind=T )
TransFood[Others] <- 1
TransFood <- as(as.matrix(TransFood), "transactions")
The figures above are borrowed from kdnuggets.com.↩︎