Regular Expressions 101

Order By

Filter by Flavor

Community Patterns

Search among 18,040 community submitted regex patterns...

Community Library Entry

Regular Expression
ECMAScript (JavaScript)

((?![\u{23}-\u1F6F3]([^\u{FE0F}]|$))\p{Emoji}(?:(?!\u{200D})\p{EComp}|(?=\u{200D})\u{200D}\p{Emoji})*)

gmu

Open regex in editor

Description

Purpose

I wanted to make a regex that just works. This is it for ECMAScript engines.

Capabilities

Matches all 5024 Emoji specified in the official Unicode website's emoji-test.txt as of (6/14/2024, thank you not, Apple Intelligence).

This regex also fails glyphs which must be part of grapheme cluster but are solitary (more on this in the "Implementation" section)

-- These and similar fail
1 2 3 4 5 6 7 8 9 # * ‼ ↔

-- These succeed
Basic: 😀
Basic + Modifier: 🦸🏾
Basic + ZWJ + Basic: 🐦‍🔥
Basic + Modifier + ZWJ + Basic +: ❤️‍🔥
Basic + ZWJ + Basic + Modifier: 🐻‍❄️
Basic + Modifier + ZWJ + Basic + Modifier + ZWJ + Basic + ZWJ + Basic + Modifier: 👩🏼‍❤️‍💋‍👩🏿

Where ZWJ means "Zero-Width Joiner," a unicode character U+200D which allows composition between two separate emojis, e.g:

😮‍💨 = 😮 (U+1F62E) + (U+200D) + 💨 (U+1F4A8)

Implementation

In order to make this expression robust against new emojis being created, I used the inherent Unicode structure of emojis to validate the string.

Emojis have the following structure:

-- BEGIN

\p{Emoji} -- Class of basic, single-character emoji

-- BEGIN Optional Section

-- Case 1: Arbitrary amount of Non-ZWJ Modifier (skin, hair, simple-grapheme modifier, etc)
-- < negative look ahead for ZWJ >

\p{Emoji_Component}+

-- Case 2: ZWJ followed by basic emoji
-- < check for ZWJ >

\p{Emoji} -- We've composed a new emoji!

-- END Optional Section

-- * Repeat the optional section as many times as possible to get the longest chain of emojis joined by ZWJs

-- END

* The emojis defined by \p{Emoji} also contains characters that are not generally considered emojis like © or ❄, ✔. These glyphs may even be used as to compose new emojis as in the case of

🏋‍♂ = 🏋 (U+1F3CB) + (U+200D) + ♂ (U+2642)

Without being part of a larger grapheme cluster, this regex fails these glyphs. That's what the first negative lookahead checks: If you come across one of these glyphs, ensure that the following glyph is a specific variation code point (U+FE0F) they must have.

This variation is what turns ✔ into ✔️.

Also of note, there also some glyphs in this range which do act as conventional emojis like ✅ (U+2705). These can also be created with ✅ (U+2705 U+200D), adding a ZWJ at the end. If you continue to adding ZWJs, the graphical difference doesn't change, but you will have more characters to backspace through (at least on my MacBook).

This logic only matters when the glyphs is at the beginning of the match, otherwise it will be proceeded by a ZWJ.

Longevity

So long as emojis are represented in the format specified above, this regex will be robust against new emojis being created because it uses character classes instead of fixed code point ranges.

Submitted by anonymous - 6 months ago

Order By

Filter by Flavor

Community Patterns

Código Postal + Localidade Portugal

European VAT Numbers

youtube-links

ecoDMS-REGEX

JOOMLA_KEY

snake case

Username Middle name Surname regex matcher

Bible Quote Regex

IPV4 Grabber

INI values

basic gitlab rule for job to not be runned automaticaly for specific branch

INSEE code for french municipalities

RPCS3 Compatibility list parser (Playable only)

Email regex

Regex

DEPIT INFO

SUBTOTAL

File Size Parsing Pattern

Time in hh:mm:ss

case insensitive uuidV4