Regular Expressions 101

Community Patterns

Community Library Entry

1

Regular Expression
ECMAScript (JavaScript)

/
((?![\u{23}-\u1F6F3]([^\u{FE0F}]|$))\p{Emoji}(?:(?!\u{200D})\p{EComp}|(?=\u{200D})\u{200D}\p{Emoji})*)
/
gmu

Description

Purpose

I wanted to make a regex that just works. This is it for ECMAScript engines.

Capabilities

Matches all 5024 Emoji specified in the official Unicode website's emoji-test.txt as of (6/14/2024, thank you not, Apple Intelligence).

This regex also fails glyphs which must be part of grapheme cluster but are solitary (more on this in the "Implementation" section)

-- These and similar fail
1 2 3 4 5 6 7 8 9 # * โ€ผ โ†”

-- These succeed
Basic: ๐Ÿ˜€
Basic + Modifier: ๐Ÿฆธ๐Ÿพ
Basic + ZWJ + Basic: ๐Ÿฆโ€๐Ÿ”ฅ
Basic + Modifier + ZWJ + Basic +: โค๏ธโ€๐Ÿ”ฅ
Basic + ZWJ + Basic + Modifier: ๐Ÿปโ€โ„๏ธ
Basic + Modifier + ZWJ + Basic + Modifier + ZWJ + Basic + ZWJ + Basic + Modifier: ๐Ÿ‘ฉ๐Ÿผโ€โค๏ธโ€๐Ÿ’‹โ€๐Ÿ‘ฉ๐Ÿฟ

Where ZWJ means "Zero-Width Joiner," a unicode character U+200D which allows composition between two separate emojis, e.g:

๐Ÿ˜ฎโ€๐Ÿ’จ = ๐Ÿ˜ฎ (U+1F62E) + (U+200D) + ๐Ÿ’จ (U+1F4A8)

Implementation

In order to make this expression robust against new emojis being created, I used the inherent Unicode structure of emojis to validate the string.

Emojis have the following structure:

-- BEGIN

\p{Emoji} -- Class of basic, single-character emoji

-- BEGIN Optional Section

-- Case 1: Arbitrary amount of Non-ZWJ Modifier (skin, hair, simple-grapheme modifier, etc)
-- < negative look ahead for ZWJ >

\p{Emoji_Component}+

-- Case 2: ZWJ followed by basic emoji
-- < check for ZWJ >

\p{Emoji} -- We've composed a new emoji!

-- END Optional Section

-- * Repeat the optional section as many times as possible to get the longest chain of emojis joined by ZWJs

-- END

* The emojis defined by \p{Emoji} also contains characters that are not generally considered emojis like ยฉ or โ„, โœ”. These glyphs may even be used as to compose new emojis as in the case of

๐Ÿ‹โ€โ™‚ = ๐Ÿ‹ (U+1F3CB) + (U+200D) + โ™‚ (U+2642)

Without being part of a larger grapheme cluster, this regex fails these glyphs. That's what the first negative lookahead checks: If you come across one of these glyphs, ensure that the following glyph is a specific variation code point (U+FE0F) they must have.

This variation is what turns โœ” into โœ”๏ธ.

Also of note, there also some glyphs in this range which do act as conventional emojis like โœ… (U+2705). These can also be created with โœ… (U+2705 U+200D), adding a ZWJ at the end. If you continue to adding ZWJs, the graphical difference doesn't change, but you will have more characters to backspace through (at least on my MacBook).

This logic only matters when the glyphs is at the beginning of the match, otherwise it will be proceeded by a ZWJ.

Longevity

So long as emojis are represented in the format specified above, this regex will be robust against new emojis being created because it uses character classes instead of fixed code point ranges.

Submitted by anonymous - 6 months ago