Tag Archives: Natural language processing


Further to a recent discussion on natural language processing on the comp.compilers usenet news group, I was reminded of how the following sentence is grammatically correct in American English:

Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo.

As explained in further detail on Wikipedia, this sentence is an example of how homonyms (words that share the same pronunciation but have different meanings) and homophones (words that are pronounced the same, but differ in meaning, and may differ in spelling) can be used to create complicated linguistic constructs. The above sentence is unpunctuated and uses three different readings of the word “buffalo”; in order of their first use, these are:

  • a. the city of Buffalo, New York, USA, which is used as a noun adjunct in the sentence;
  • n. the noun buffalo (American bison), an animal, in the plural (equivalent to “buffaloes” or “buffalos”), in order to avoid articles;
  • v. the verb “buffalo” meaning to outwit, confuse, deceive, intimidate or baffle.

While the above sentence is syntactically ambiguous, one possible parse would be as follows — a claim that bison who are intimidated or bullied by bison are themselves intimidating or bullying bison (at least in the city of Buffalo):

Buffaloa buffalon Buffaloa buffalon buffalov buffalov Buffaloa buffalon.

Finally, there is nothing special about eight “buffalos”; any sentence consisting solely of the word “buffalo” repeated any number of times is grammatically correct (and is also a useful mechanism for illustrating rewrite rules). The shortest is “Buffalo!”, which can be taken as an imperative instruction (“[You] buffalo!”). Versions of this linguistic oddity can be constructed with other words which similarly simultaneously serve as collective noun, adjective, and verb, some of which need no capitalisation (e.g. “police”).

Tagged , , ,

Paper at AI-2013: “‘The First Day of Summer’: Parsing Temporal Expressions with Distributed Semantics”

In December, my PhD student Benjamin Blamey presented a joint paper entitled: ‘The First Day of Summer’: Parsing Temporal Expressions with Distributed Semantics at AI-2013, the 33rd SGAI International Conference on Artificial Intelligence in Cambridge.

If you do not have institutional access to SpringerLink, especially the Research and Development in Intelligent Systems series, you can download our pre-print. The abstract is as follows:

Detecting and understanding temporal expressions are key tasks in natural language processing (NLP), and are important for event detection and information retrieval. In the existing approaches, temporal semantics are typically represented as discrete ranges or specific dates, and the task is restricted to text that conforms to this representation. We propose an alternate paradigm: that of distributed temporal semantics –- where a probability density function models relative probabilities of the various interpretations. We extend SUTime, a state-of-the-art NLP system to incorporate our approach, and build definitions of new and existing temporal expressions. A worked example is used to demonstrate our approach: the estimation of the creation time of photos in online social networks (OSNs), with a brief discussion of how the proposed paradigm relates to the point- and interval-based systems of time. An interactive demonstration, along with source code and datasets, are available online.

(see Publications)

Tagged , , , , ,

Paper at AI-2012: “R U :-) or :-( ? Character- vs. Word-Gram Feature Selection for Sentiment Classification of OSN Corpora”

In December, my PhD student Benjamin Blamey presented a joint paper entitled: R U :-) or :-( ? Character- vs. Word-Gram Feature Selection for Sentiment Classification of OSN Corpora at AI-2012, the 32nd SGAI International Conference on Artificial Intelligence in Cambridge (for which he also won the best poster prize).

If you do not have institutional access to SpringerLink, especially the Research and Development in Intelligent Systems series, you can download our pre-print. The abstract is as follows:

Binary sentiment classification, or sentiment analysis, is the task of computing the sentiment of a document, i.e. whether it contains broadly positive or negative opinions. The topic is well-studied, and the intuitive approach of using words as classification features is the basis of most techniques documented in the literature. The alternative character n-gram language model has been applied successfully to a range of NLP tasks, but its effectiveness at sentiment classification seems to be under-investigated, and results are mixed. We present an investigation of the application of the character n-gram model to text classification of corpora from online social networks, the first such documented study, where text is known to be rich in so-called unnatural language, also introducing a novel corpus of Facebook photo comments. Despite hoping that the flexibility of the character n-gram approach would be well-suited to unnatural language phenomenon, we find little improvement over the baseline algorithms employing the word n-gram language model.

(see Publications)

Tagged , , , ,

“That’s what she said”

Turning seemingly innocuous comments into sexual innuendo by adding the words “That’s what she said” (TWSS) has become a (chiefly American, occasionally annoying) cultural phenomenon. Unfortunately, identifying humour and double entendre through software is hard. This is interesting to me from a research perspective: I am interested in the wider area of knowledge representation and reasoning, particularly declarative problem-solving. It is hard to perform sentiment analysis and infer meaning from human language statements that can have non-standard structures, particularly if you want to do it with large-scale datasets (think Twitter, et al.).

For many years, artificial intelligence researchers have been trying to solve the natural language processing (NLP) problem. This field bridges computer science and linguistics and aims to build software that can analyse, understand and generate languages that humans use naturally, so that eventually you will be able to address your computer as though you were addressing another person. Natural language understanding is sometimes referred to as an AI-complete problem, because it requires extensive world knowledge and the ability to manipulate it; to call a problem AI-complete reflects an attitude that it cannot be solved by a simple algorithm. In NLP, the meaning of a sentence will often vary based on the context in which it is presented (since text can contain information at many different granularities), and this is something that is difficult to represent in software. When you add humour, puns and double entendre, this can get substantially harder.

But maybe the first steps have been made: a recent paper (That’s What She Said: Double Entendre Identification) by Chloe Kiddon and Yuriy Brun, computer scientists from the University of Washington, presents a software program capable of understanding a specific type of humour, the TWSS problem: “Double Entendre via Noun Transfer” or DEviaNT for short.

Kiddon and Brun’s approach consists of three functions that are used to score words based on a number of sample sentences sourced from either an erotic corpus or from the Brown corpus, the standard used in this field. And this was the part that caught my geek attention: the noun sexiness function, NS(n), rates nouns based on their relative frequencies and whether they are euphemisms for sexually explicit nouns. For example, words with high NS(n) scores include “rod” and “meat”. The two other functions are the adjective sexiness function, AS(a), which detects adjectives such as “hot” and “wet”, and the verb sexiness function, VS(v).

These three functions are used to score sentences for noun euphemisms i.e. does a test sentence include a word likely to be used in an erotic sentence. Other scoring elements include the presence of adjectives and verbs combinations more likely to be used in erotic literature. Finally, they use information such as the number of punctuation and non-punctuation items in sentences. These scores were used to train the WEKA toolkit, an open source collection of machine learning algorithms for data mining tasks. Using this test set they were able to show a high level (around 72% accuracy) of identification of sentences which were suitable for TWSS-style jokes, while keeping false negatives to a minimum: the authors flagging that making the joke when the sentence is not appropriate is much worse than not making the joke when it is appropriate.

While this is preliminary work (the authors will be presenting it at the 49th Annual Meeting of the Association for Computation Linguistics: Human Language Technologies in June), the technique of metaphorical mapping may be generalised to identify other types of double entendres and other forms of humour.

Or maybe it’s just far too big to get a grip on…

Tagged , ,