Help me extract just a part of string

For a feature ‘os’ in my df, help me get a part of its values
e.g. for Android 10.1 how to get just 10 and put it back to df.
similarly for android 7.1.1 just extract 7

Below are the unique values in df[‘os’] , expecting output as [10,9,9,8,10,7…] back in df[‘os’]
[‘Android 10’, ‘Android Pie 9.0’, ‘Android Pie 9’,
‘Android Oreo 8.1’, ‘Android Pie 10’, ‘Android Nougat 7.1.1’,
‘Android Oreo 8.0’, ‘Android Nougat 7.1.2’, ‘Android KitKat 4.4.2’,
‘Android Marshmallow 6.0.1’, ‘Android Nougat 7.1’,
‘Android Marshmallow 6’, ‘Android Nougat 7’,
‘Android Lollipop 5.4.1’, ‘Android Oreo 8.1.0’, ‘Android Oreo 8’,
‘Android Lollipop 5.1’, ‘Android Lollipop 5.1.1’]

1 Like

@niks.baldawa

Try extracting data using anyone of the below :-
1.iloc
2.iat
2.loc
4.at
5.data_frame.values[]

hi , i found a soln. the dataset here is ‘mobile-price-prediction’

df['os'].unique()

array([‘Android 10’, ‘Android Pie 9.0’, ‘Android Pie 9’,
‘Android Oreo 8.1’, ‘Android Pie 10’, ‘Android Nougat 7.1.1’,
‘Android Oreo 8.0’, ‘Android Nougat 7.1.2’, ‘Android KitKat 4.4.2’,
‘Android Marshmallow 6.0.1’, ‘Android Nougat 7.1’,
‘Android Marshmallow 6’, ‘Android Nougat 7’,
‘Android Lollipop 5.4.1’, ‘Android Oreo 8.1.0’, ‘Android Oreo 8’,
‘Android Lollipop 5.1’, ‘Android Lollipop 5.1.1’], dtype=object)

For analysis , we need to extract just the version i.e from Android 7.1.1 we should get only 7 and same way for others

df['os']=df['os'].str.replace(r'^Android\s[a-zA-Z]*\s?',r'',regex=True)
df['os'].unique()

array([‘10’, ‘9.0’, ‘9’, ‘8.1’, ‘7.1.1’, ‘8.0’, ‘7.1.2’, ‘4.4.2’, ‘6.0.1’,
‘7.1’, ‘6’, ‘7’, ‘5.4.1’, ‘8.1.0’, ‘8’, ‘5.1’, ‘5.1.1’],
dtype=object)

with above code , we replaced ‘Android’ and word adjoining it(if any) with nothing ‘’,i.e. we removed it.This is how the feature looks now.

df['os'].value_counts()

9 126
8.1 96
10 84
9.0 82
7.1.2 14
8.1.0 13
7.1 12
8 7
7 7
8.0 6
5.1 3
6 3
6.0.1 2
5.4.1 1
4.4.2 1
7.1.1 1
5.1.1 1
Name: os, dtype: int64

now we need to remove the suffixing part from our expected element i.e to get just 5 from 5.1.1

df[['os','hoax1','hoax2']]=df['os'].str.partition('.')

above we creatd two more hoax/dummy columns to store the seperator and suffixing part ;now we will drop them

df.drop(['hoax1','hoax2'],axis=1,inplace=True)
df['os'].value_counts()

9 208
8 122
10 84
7 34
6 5
5 5
4 1
Name: os, dtype: int64

@niks.baldawa

Awesome attempt . This was a really interesting part of Data Wrangling to learn and apply.