Finding emoji's from text using Regex

When using regex on normal text, we can create different patterns using metacharacters, sets (not the data structure) and special sequences. But for emojis, we have to know their encoding for separating them from rest of the data.
Now, when comes to emojis, we think only of frequently used face emojis(emoticons) but it is one part of emoji. Below are all the categories in Emoji and their encoding(python source format) range:

emoji categories
  1. “\U0001F1E0-\U0001F1FF” # flags (iOS)
  2. “\U0001F300-\U0001F5FF” # symbols & pictographs
  3. “\U0001F600-\U0001F64F” # emoticons
  4. “\U0001F680-\U0001F6FF” # transport & map symbols
  5. “\U0001F700-\U0001F77F” # alchemical symbols
  6. “\U0001F780-\U0001F7FF” # Geometric Shapes Extended
  7. “\U0001F800-\U0001F8FF” # Supplemental Arrows-C
  8. “\U0001F900-\U0001F9FF” # Supplemental Symbols and Pictographs
  9. “\U0001FA00-\U0001FA6F” # Chess Symbols
  10. “\U0001FA70-\U0001FAFF” # Symbols and Pictographs Extended-A
  11. “\U00002702-\U000027B0” # Dingbats

you can use this ranges to find different symbols.
For example,

import re
s = "This line contains 😀"
res = re.findall("[\U0001F600-\U0001F64F]",s)
['😀']

also you can use “|” to perform search on multiple categories as well.

import re
s = "This line contains 😀	🌀"
res = re.findall("[\U0001F600-\U0001F64F]|[\U0001F300-\U0001F5FF]",s)
['😀', '🌀']

The encoding here used is ‘UTF-32’ or python source code. you should be able to find most of the frequently used emojis with this encoding, but if you don’t find any try to change the encoding/

2 Likes

@kharshavardhan31

Thank you for your valuable contribution.
Keep sharing more interesting things with us and others also.
Keep Learning and shining :star2:.