Give the hiii text meaning.
This is a question related to Neural networks to detect "spam"?. I'm wondering how it would be possible to handle the emotion conveyed in text. In informal writing, especially among a juvenile audience, it's usual to find emotion expressed as repetition of characters. For example, "Hi" doesn't mean the same as "Hiiiiiiiiiiiiiii" but "hiiiiii", "hiiiiiiiii", and "hiiiiiiiiii" do.
A naive solution would be to preprocess the input and remove the repeating characters after a certain threshold, say, 4. This would probably reduce most long "hiiiii" to 4 "hiiii", giving a separate meaning (weight in a context?) to "hi" vs "long hi".
The naivete of this solution appears when there are combinations. For example, haha vs hahaha or lol vs lololololol. Again, we could write a regex to reduce lolol[ol]+ to lolol. But then we run into the issue of hahahahahahaha where a typo broke the sequence.
There is also the whole issue of Emoji. Emoji may seem daunting at first since they are special characters. But once understood, emoji may actually become helpful in this situation. For example, may mean a very different thing than , but may mean the same as and
The trick with emojis, to me, is that they might actually be easier to parse. Simply add spaces between to convert to in the text analysis. I would guess that repetition would play a role in training, but unlike "hi", and "hiiii", Word2Vec won't try to categorize and as different words (as I've now forced to be separate words, relying in frequency to detect the emotion of the phrase).
Even more, this would help the detection of "playful" language such as , where the emoji might imply there is anger, but alongside and especially when repeating multiple times, it would be easier for a neural network to understand that the person isn't really angry.
Does any of this make sense or I'm going in the wrong direction?
These kinds of repetitions in text can place recurrence demands on learning algorithms that may or may not be handled without special encoding.
These have the same meaning on one level, but different emotional content and therefore different correlations to categories when detecting the value of an email, which in the simplest case is the placement of a message in one of two categories.
Pass to a recipient
Archive only This is colloquially called spam detection, although not all useless emails are spam and some messages sent by organizations that broadcast spam may be useful, so technically the term spam is not particularly useful. The determinant should usually be the return on investment to the recipient or the organization receiving and categorizing the message.
Is reading the message and potentially responding likely of greater value than the cost of reading it?
That is a high level paraphrase of what the value or cost function must represent when AI components are employed to learn about or track close to (in continuous learning) some business or personal optimality.
The question proposes a normalization scheme that truncates long repetitions of short patterns in characters, but truncation is necessarily destructive. Compression of some type that will both preserve nuance and work with the author's use of Word2Vec is a more flexible and comprehensive approach.
In the case of playful sequences of characters it is anthropomorphic to imagine that an artificial network will understand playfulness or anger, however existing learning devices can certainly learn to use character sequences that humans would call playful or angry in the function that emerges to categorize the message containing them. Just remember that model free learning is not at all like cognition, so the term understanding is placing an expectation on the mental capacities of the AI component that the AI component may not possess.
Since no indication that a recurrent or recursive network will be used but rather the entire message is represented in a fixed width vector, the question becomes which of these two approaches will produce the best outcomes after learning. Leaving the text uncompressed so that an 'H' character followed by ten 'i' characters is distinct as a word from an 'H' character followed by five 'i' characters
Compressing the text to "Hi [9xi]" and "Hi [4xi]" respectively or some such word bifurcation. This second approach produces reasonable behavior with other cases mentioned, such as "