This paper presents a multimodal fuzzy inference system for emotion detection. The system extracts and merges visual, acoustic and context relevant features. The experiments have been performed as part of the AVEC 2012 challenge. Facial expressions play an important role in emotion detection. However, having an automatic system to detect facial emotional expressions on unknown subjects is still a challenging problem. Here, we propose a method that adapts to the morphology of the subject and that is based on an invariant representation of facial expressions. Our method relies on 8 key expressions of emotions of the subject. In our system, each image of a video sequence is defined by its relative position to these 8 expressions. These 8 expressions are synthesized for each subject from plausible distortions learnt on other subjects and transferred on the neutral face of the subject. Expression recognition in a video sequence is performed in this space with a basic intensity-area detector. The emotion is described in the 4 dimensions : valence, arousal, power and expectancy. The results show that the duration of high intensity smile is an expression that is meaningful for continuous valence detection and can also be used to improve arousal detection. The main variations in power and expectancy are given by context data.