Posted under » Python Data Analysis on 13 June 2023
Initially I thought get dummies is to get permutations and combinations eg. this website.
I often create an a,b and c list with 3 leading zeroes like this for 200 data.
questions = []
for i in range(1, 71):
questions.append(f"a{i:03d}")
for i in range(71, 171):
questions.append(f"b{i:03d}")
for i in range(171, 201):
questions.append(f"c{i:03d}")
Dummy in greek is that "each variable is converted in as many 0/1 variables as there are different values. This get dummies is mainly used for machine learning. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value".
If you don't understand it, you are not alone. Best to show you what it means.
>>> import pandas as pd
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 True False False
1 False True False
2 False False True
3 True False False
There are 3 (abc) unique chars in the list of array of series type, but there are 4 (abca) items. Often there are many repetitions or several `True', but in extreme cases, there are only one `True'. So we can see which are `popular' since they have many `True' occurence. The first [0] is 'a' so it is true while False for b & c. 'a' return True twice because there are 2 a in `abca'.
A slighty complex eg
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'], 'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
C col1_a col1_b col2_a col2_b col2_c
0 1 True False False True False
1 2 False True True False False
2 3 True False False False True
Prefix replaces col A into col1 and col B as col2. Get dummies know there are just 2 distinc characters a and b in A, but in col2 it detects 3 chars (a b & c). However, it leaves col C as it is and not part of the true and false thingie.
Col1 has 2 columns because it just have 2 chars a and b. However, Col2 has 3 cols. because it has a, b and c.
pd.get_dummies can detect and create dummy variables from a Pandas Series, or from a column or columns in a Pandas dataframe, it is often used for creating a sample data frame.
You can also create sample data using ones.