hi , i found a soln. the dataset here is ‘mobile-price-prediction’
df['os'].unique()
array([‘Android 10’, ‘Android Pie 9.0’, ‘Android Pie 9’,
‘Android Oreo 8.1’, ‘Android Pie 10’, ‘Android Nougat 7.1.1’,
‘Android Oreo 8.0’, ‘Android Nougat 7.1.2’, ‘Android KitKat 4.4.2’,
‘Android Marshmallow 6.0.1’, ‘Android Nougat 7.1’,
‘Android Marshmallow 6’, ‘Android Nougat 7’,
‘Android Lollipop 5.4.1’, ‘Android Oreo 8.1.0’, ‘Android Oreo 8’,
‘Android Lollipop 5.1’, ‘Android Lollipop 5.1.1’], dtype=object)
For analysis , we need to extract just the version i.e from Android 7.1.1 we should get only 7 and same way for others
df['os']=df['os'].str.replace(r'^Android\s[a-zA-Z]*\s?',r'',regex=True)
df['os'].unique()
array([‘10’, ‘9.0’, ‘9’, ‘8.1’, ‘7.1.1’, ‘8.0’, ‘7.1.2’, ‘4.4.2’, ‘6.0.1’,
‘7.1’, ‘6’, ‘7’, ‘5.4.1’, ‘8.1.0’, ‘8’, ‘5.1’, ‘5.1.1’],
dtype=object)
with above code , we replaced ‘Android’ and word adjoining it(if any) with nothing ‘’,i.e. we removed it.This is how the feature looks now.
df['os'].value_counts()
9 126
8.1 96
10 84
9.0 82
7.1.2 14
8.1.0 13
7.1 12
8 7
7 7
8.0 6
5.1 3
6 3
6.0.1 2
5.4.1 1
4.4.2 1
7.1.1 1
5.1.1 1
Name: os, dtype: int64
now we need to remove the suffixing part from our expected element i.e to get just 5 from 5.1.1
df[['os','hoax1','hoax2']]=df['os'].str.partition('.')
above we creatd two more hoax/dummy columns to store the seperator and suffixing part ;now we will drop them
df.drop(['hoax1','hoax2'],axis=1,inplace=True)
df['os'].value_counts()
9 208
8 122
10 84
7 34
6 5
5 5
4 1
Name: os, dtype: int64