Predictive coding and how the dynamical Bayesian brain achieves specialization and integration

Authors note: this marks the first in a new series of journal-entry style posts in which I write freely about things I like to think about. The style is meant to be informal and off the cuff, building towards a sort of socratic dialogue. Please feel free to argue or debate any point you like. These are meant to serve as exercises in writing and thinking,  to improve the quality of both and lay groundwork for future papers. 

My wife Francesca and I are spending the winter holidays vacationing in the north Italian countryside with her family. Today in our free time our discussions turned to how predictive coding and generative models can accomplish the multimodal perception that characterizes the brain. To this end Francesca asked a question we found particularly thought provoking: if the brain at all levels is only communicating forward what is not predicted (prediction error), how can you explain the functional specialization that characterizes the different senses? For example, if each sensory hierarchy is only communicating prediction errors, what explains their unique specialization in terms of e.g. the frequency, intensity, or quality of sensory inputs? Put another way, how can the different sensations be represented, if the entire brain is only communicating in one format?

We found this quite interesting, as it seems straightforward and yet the answer lies at the very basis of predictive coding schemes. To arrive at an answer we first had to lay a little groundwork in terms of information theory and basic neurobiology. What follows is a grossly oversimplified account of the basic neurobiology of perception, which serves only as a kind of philosopher’s toy example to consider the question. Please feel free to correct any gross misunderstandings.

To begin, it is clear at least according to Shannon’s theory of information, that any sensory property can be encoded in a simple system of ones and zeros (or nerve impulses). Frequency, time, intensity, and so on can all be re-described in terms of a simplistic encoding scheme. If this were not the case then modern television wouldn’t work. Second, each sensory hierarchy presumably  begins with a sensory effector, which directly transduces physical fluctuations into a neuronal code. For example, in the auditory hierarchy the cochlea contains small hairs that vibrate only to a particular frequency of sound wave. This vibration, through a complex neuro-mechanic relay, results in a tonitopic depolarization of first order neurons in the spiral ganglion.

f1
The human cochlea, a fascinating neural-mechanic apparatus to directly transduce air vibrations into neural representations.

It is here at the first-order neuron where the hierarchy presumably begins, and also where functional specialization becomes possible. It seems to us that predictive coding should say that the first neuron is simply predicting a particular pattern of inputs, which correspond directly to an expected external physical property. To try and give a toy example, say we present the brain with a series of tones, which reliably increase in frequency at 1 Hz intervals. At the lowest level the neuron will fire at a constant rate if the frequency at interval n is 1 greater than the previous interval, and will fire more or less if the frequency is greater or less than this basic expectation, creating a positive or negative prediction error (remember that the neuron should only alter its firing pattern if something unexpected happens). Since frequency here is being signaled directly by the mechanical vibration of the cochlear hairs; the first order neuron is simply predicting which frequency will be signaled. More realistically, each sensory neuron is probably only predicting whether or not a particular frequency will be signaled – we know from neurobiology that low-level neurons are basically tuned to a particular sensory feature, whereas higher level neurons encode receptive fields across multiple neurons or features. All this is to say that the first-order neuron is specialized for frequency because all it can predict is frequency; the only afferent input is the direct result of sensory transduction. The point here is that specialization in each sensory system arises in virtue of the fact that the inputs correspond directly to a physical property.

f2
Presumably, first order neurons predict the presence or absence of a particular, specialized sensory feature owing to their input. Credit: wikipedia.

Now, as one ascends higher in the hierarchy, each subsequent level is predicting the activity of the previous. The first-order neuron predicts whether a given frequency is presented, the second perhaps predicts if a receptive field is activated across several similarly tuned neurons, the third predicts a particular temporal pattern across multiple receptive fields, and so on. Each subsequent level is predicting a “hyperprior” encoding a higher order feature of the previous level. Eventually we get to a level where the prediction is no longer bound to a single sensory domain, but instead has to do with complex, non-linear interactions between multiple features. A parietal neuron thus might predict that an object in the world is a bird if it sings at a particular frequency and has a particular bodily shape.

f3
The motif of hierarchical message passing which encompasses the nervous system, according the the Free Energy principle.

If this general scheme is correct, then according to hierarchical predictive coding functional specialization primarily arises in virtue of the fact that at the lowest level each hierarchy is receiving inputs that strictly correspond to a particular feature. The cochlea is picking up fluctuations in air vibration (sound), the retina is picking up fluctuations in light frequency (light), and the skin is picking up changes in thermal amplitude and tactile frequency (touch). The specialization of each system is due to the fact that each is attempting to predict higher and higher order properties of those low-level inputs, which are by definition particular to a given sensory domain. Any further specialization in the hierarchy must then arise from the fact that higher levels of the brain predict inputs from multiple sensory systems – we might find multimodal object-related areas simply because the best hyper-prior governing nonlinear relationships between frequency and shape is an amodal or cross-model object. The actual etiology of higher-level modules is a bit more complicate than this, and requires an appeal to evolution to explain in detail, but we felt this was a generally sufficient explanation of specialization.

Nonlinearity of the world and perception: prediction as integration

At this point, we felt like we had some insight into how predictive coding can explain functional specialization without needing to appeal to special classes of cortical neurons for each sensation. Beyond the sensory effectors, the function of each system can be realized simply by means of a canonical, hierarchical prediction of each layered input, right down to the point of neurons which predict which frequency will be signaled. However, something still was missing, prompting Francesca to ask – how can this scheme explain the coherent, multi-modal, integrated perception, which characterizes conscious experience?

Indeed, we certainly do not experience perception as a series of nested predictions. All of the aforementioned machinery functions seamlessly beyond the point of awareness. In phenomenology a way to describe such influences is as being prenoetic (before knowing; see also prereflective); i.e. things that influence conscious experience without themselves appearing in experience. How then can predictive coding explain the transition from segregated, feature specific predictions to the unified percept we experience?

f4
When we arrange sensory hierarchies laterally, we see the “markov blanket” structure of the brain emerge. Each level predicts the control parameters of subsequent levels. In this way integration arises naturally from the predictive brain.

As you might guess, we already hinted at part of the answer. Imagine if instead of picturing each sensory hierarchy as an isolated pyramid, we instead arrange them such that each level is parallel to its equivalent in the ‘neighboring’ hierarchy. On this view, we can see that relatively early in each hierarchy you arrive at multi-sensory neurons that are predicting conjoint expectations over multiple sensory inputs. Conveniently, this observation matches what we actually know about the brain; audition, touch, and vision all converge in tempo-parietal association areas.

Perceptual integration is thus achieved as easily as specialization; it arises from the fact that each level predicts a hyperprior on the previous level. As one moves upwards through the hierarchy, this means that each level predicts more integrated, abstract, amodal entities. Association areas don’t predict just that a certain sight or sound will appear, but instead encode a joint expectation across both (or all) modalities. Just like the fusiform face area predicts complex, nonlinear conjunctions of lower-level visual features, multimodal areas predict nonlinear interactions between the senses.

f5
A half-cat half post, or a cat behind a post? The deep convolutional nature of the brain helps us solve this and similar nonlinear problems.

It is this nonlinearity that makes predictive schemes so powerful and attractive. To understand why, consider the task the brain must solve to be useful. Sensory impressions are not generated by simple linear inputs; certainly for perception to be useful to an organism it must process the world at a level that is relevant for that organism. This is the world of objects, persons, and things, not disjointed, individual sensory properties. When I watch a cat walk behind a fence, I don’t perceive it as two halves of a cat and a fence post, but rather as a cat hidden behind a fence. These kinds of nonlinear interactions between objects and properties of the world are ubiquitous in perception; the brain must solve not for the immediately available sensory inputs but rather the complex hidden causes underlying them. This is achieved in a similar manner to a deep convolutional network; each level performs the same canonical prediction, yet together the hierarchy will extract the best-hidden features to explain the complex interactions that produce physical sensations. In this way the predictive brain summersaults the binding problem of perception; perception is integrated precisely because conjoint hypothesis are better, more useful explanations than discrete ones. As long as the network has sufficient hierarchical depth, it will always arrive at these complex representations. It’s worth noting we can observe the flip-side of this process in common visual illusions, where the higher-order percept or prior “fills in” our actual sensory experience (e.g. when we perceive a convex circle as being lit from above).

teaser-convexconcave-01
Our higher-level, integrative priors “fill in” our perception.

Beating the homunculus: the dynamic, enactive Bayesian brain

Feeling satisfied with this, Francesca and I concluded our fun holiday discussion by thinking about some common misunderstandings this scheme might lead one into. For example, the notion of hierarchical prediction explored above might lead one to expect that there has to be a “top” level, a kind of super-homunculus who sits in the prefrontal cortex, predicting the entire sensorium. This would be an impossible solution; how could any subsystem of the brain possibly predict the entire activity of the rest? And wouldn’t that level itself need to be predicted, to be realised in perception, leading to infinite regress? Luckily the intuition that these myriad hypotheses must “come together” fundamentally misunderstands the Bayesian brain.

Remember that each level is only predicting the activity of that before it. The integrative parietal neuron is not predicting the exact sensory input at the retina; rather it is only predicting what pattern of inputs it should receive if the sensory input is an apple, or a bat, or whatever. The entire scheme is linked up this way; the individual units are just stupid predictors of immediate input. It is only when you link them all up together in a deep network, that the brain can recapitulate the complex web of causal interactions that make up the world.

This point cannot be stressed enough: predictive coding is not a localizationist enterprise. Perception does not come about because a magical brain area inverts an entire world model. It comes about in virtue of the distributed, dynamic activity of the entire brain as it constantly attempts to minimize prediction error across all levels. Ultimately the “model” is not contained “anywhere” in the brain; the entire brain itself, and the full network of connection weights, is itself the model of the world. The power to predict complex nonlinear sensory causes arises because the best overall pattern of interactions will be that which most accurately (or usefully) explains sensory inputs and the complex web of interactions which causes them. You might rephrase the famous saying as “the brain is it’s own best model of the world”.

As a final consideration, it is worth noting some misconceptions may arise from the way we ourselves perform Bayesian statistics. As an experimenter, I formalize a discrete hypothesis (or set of hypotheses) about something and then invert that model to explain data in a single step. In the brain however the “inversion” is just the constant interplay of input and feedback across the nervous system at all levels. In fact, under this distributed view (at least according to the Free Energy Principle), neural computation is deeply embodied, as actions themselves complete the inferential flow to minimize error. Thus just like neural feedback, actions function as  ‘predictions’, generated by the inferential mechanism to render the world more sensible to our predictions. This ultimately minimises prediction error just as internal model updates do, albeit in a different ‘direction of fit’ (world to model, instead of model to world). In this way the ‘model’ is distributed across the brain and body; actions themselves are as much a part of the computation as the brain itself and constitute a form of “active inference”. In fact, if one extends their view to evolution, the morphological shape of the organism is itself a kind of prior, predicting the kinds of sensations, environments, and actions the agent is likely to inhabit. This intriguing idea will be the subject of a future blog post.

Conclusion

We feel this is an extremely exciting view of the brain. The idea that an organism can achieve complex intelligence simply by embedding a simple repetitive motif within a dynamical body seems to us to be a fundamentally novel approach to the mind. In future posts and papers, we hope to further explore the notions introduced here, considering questions about “where” these embodied priors come from and what they mean for the brain, as well as the role of precision in integration.

Questions? Comments? Feel like i’m an idiot? Sound off in the comments!

Further Reading:

Brown, H., Adams, R. A., Parees, I., Edwards, M., & Friston, K. (2013). Active inference, sensory attenuation and illusions. Cognitive Processing, 14(4), 411–427. http://doi.org/10.1007/s10339-013-0571-3
Feldman, H., & Friston, K. J. (2010). Attention, Uncertainty, and Free-Energy. Frontiers in Human Neuroscience, 4. http://doi.org/10.3389/fnhum.2010.00215
Friston, K., Adams, R. A., Perrinet, L., & Breakspear, M. (2012). Perceptions as Hypotheses: Saccades as Experiments. Frontiers in Psychology, 3. http://doi.org/10.3389/fpsyg.2012.00151
Friston, K., & Kiebel, S. (2009). Predictive coding under the free-energy principle. Philosophical Transactions of the Royal Society of London B: Biological Sciences, 364(1521), 1211–1221. http://doi.org/10.1098/rstb.2008.0300
Friston, K., Thornton, C., & Clark, A. (2012). Free-Energy Minimization and the Dark-Room Problem. Frontiers in Psychology, 3. http://doi.org/10.3389/fpsyg.2012.00130
Moran, R. J., Campo, P., Symmonds, M., Stephan, K. E., Dolan, R. J., & Friston, K. J. (2013). Free Energy, Precision and Learning: The Role of Cholinergic Neuromodulation. The Journal of Neuroscience, 33(19), 8227–8236. http://doi.org/10.1523/JNEUROSCI.4255-12.2013

 

#MethodsWeDontReport – brief thought on Jason Mitchell versus the replicators

This morning Jason Mitchell self-published an interesting essay espousing his views on why replication attempts are essentially worthless. At first I was merely interested by the fact that what would obviously become a topic of heated debate was self-published, rather than going through the long slog of a traditional academic medium. Score one for self publication, I suppose. Jason’s argument is essentially that null results don’t yield anything of value and that we should be improving the way science is conducted and reported rather than publicising our nulls. I found particularly interesting his short example list of things that he sees as critical to experimental results which nevertheless go unreported:

These experimental events, and countless more like them, go unreported in our method section for the simple fact that they are part of the shared, tacit know-how of competent researchers in my field; we also fail to report that the experimenters wore clothes and refrained from smoking throughout the session.  Someone without full possession of such know-how—perhaps because he is globally incompetent, or new to science, or even just new to neuroimaging specifically—could well be expected to bungle one or more of these important, yet unstated, experimental details.

While I don’t agree with the overall logic or conclusion of Jason’s argument (I particularly like Chris Said’s Bayesian response), I do think it raises some important or at least interesting points for discussion. For example, I agree that there is loads of potentially important stuff that goes on in the lab, particularly with human subjects and large scanners, that isn’t reported. I’m not sure to what extent that stuff can or should be reported, and I think that’s one of the interesting and under-examined topics in the larger debate. I tend to lean towards the stance that we should report just about anything we can – but of course publication pressures and tacit norms means most of it won’t be published. And probably at least some of it doesn’t need to be? But which things exactly? And how do we go about reporting stuff like how we respond to random participant questions regarding our hypothesis?

To find out, I’d love to see a list of things you can’t or don’t regularly report using the #methodswedontreport hashtag. Quite a few are starting to show up- most are funny or outright snarky (as seems to be the general mood of the response to Jason’s post), but I think a few are pretty common lab occurrences and are even though provoking in terms of their potentially serious experimental side-effects. Surely we don’t want to report all of these ‘tacit’ skills in our burgeoning method sections; the question is which ones need to be reported, and why are they important in the first place?

Mind-wandering and metacognition: variation between internal and external thought predicts improved error awareness

Yesterday I published my first paper on mind-wandering and metacognition, with Jonny Smallwood, Antoine Lutz, and collaborators. This was a fun project for me as I spent much of my PhD exhaustively reading the literature on mind-wandering and default mode activity, resulting in a lot of intense debate a my research center. When we had Jonny over as an opponent at my PhD defense, the chance to collaborate was simply too good to pass up. Mind-wandering is super interesting precisely because we do it so often. One of my favourite anecdotes comes from around the time I was arguing heavily for the role of the default mode in spontaneous cognition to some very skeptical colleagues.  The next day while waiting to cross the street, one such colleague rode up next to me on his bicycle and joked, “are you thinking about the default mode?” And indeed I was – meta-mind-wandering!

One thing that has really bothered me about much of the mind-wandering literature is how frequently it is presented as attention = good, mind-wandering = bad. Can you imagine how unpleasant it would be if we never mind-wandered? Just picture trying to solve a difficult task while being totally 100% focused. This kind of hyper-locking attention can easily become pathological, preventing us from altering course when our behaviour goes awry or when something internal needs to be adjusted. Mind-wandering serves many positive purposes, from stimulating our imaginations, to motivating us in boring situations with internal rewards (boring task… “ahhhh remember that nice mojito you had on the beach last year?”). Yet we largely see papers exploring the costs – mood deficits, cognitive control failure, and so on. In the meditation literature this has even been taken up to form the misguided idea that meditation should reduce or eliminate mind-wandering (even though there is almost zero evidence to this effect…)

Sometimes our theories end up reflecting our methodological apparatus, to the extent that they may not fully capture reality. I think this is part of what has happened with mind-wandering, which was originally defined in relation to difficult (and boring) attention tasks. Worse, mind-wandering is usually operationalized as a dichotomous state (“offtask” vs “ontask”) when a little introspection seems to strongly suggest it is much more of a fuzzy, dynamic transition between meta-cognitive and sensory processes. By studying mind-wandering just as the ‘amount’ (or mean) number of times you were “offtask”, we’re taking the stream of consciousness and acting as if the ‘depth’ at one point in the river is the entire story – but what about flow rate, tidal patterns, fishies, and all the dynamic variability that define the river? My idea was that one simple way get at this is by looking at the within-subject variability of mind-wandering, rather than just the overall mean “rate”.  In this way we could get some idea of the extent to which a person’s mind-wandering was fluctuating over time, rather than just categorising these events dichotomously.

The EAT task used in my study, with thought probes.
The EAT task used in my study, with thought probes.

To do this, we combined a classical meta-cognitive response inhibition paradigm, the “error awareness task” (pictured above), with standard interleaved “thought-probes” asking participants to rate on a scale of 1-7 the “subjective frequency” of task-unrelated thoughts in the task interval prior to the probe.  We then examined the relationship between the ability to perform the task or “stop accuracy” and each participant’s mean task-unrelated thought (TUT). Here we expected to replicate the well-established relationship between TUTs and attention decrements (after all, it’s difficult to inhibit your behaviour if you are thinking about the hunky babe you saw at the beach last year!). We further examined if the standard deviation of TUT (TUT variability) within each participant would predict error monitoring, reflecting a relationship between metacognition and increased fluctuation between internal and external cognition (after all, isn’t that kind of the point of metacognition?). Of course for specificity and completeness, we conducted each multiple regression analysis with the contra-variable as control predictors. Here is the key finding from the paper:

Regression analysis of TUT, TUT variability, stop accuracy, and error awareness.
Regression analysis of TUT, TUT variability, stop accuracy, and error awareness.

As you can see in the bottom right, we clearly replicated the relationship of increased overall TUT predicting poorer stop performance. Individuals who report an overall high intensity/frequency of mind-wandering unsurprisingly commit more errors. What was really interesting, however, was that the more variable a participants’ mind-wandering, the greater error-monitoring capacity (top left). This suggests that individuals who show more fluctuation between internally and externally oriented attention may be able to better enjoy the benefits of mind-wandering while simultaneously limiting its costs. Of course, these are only individual differences (i.e. correlations) and should be treated as highly preliminary. It is possible for example that participants who use more of the TUT scale have higher meta-cognitive ability in general, rather than the two variables being causally linked in the way we suggest.  We are careful to raise these and other limitations in the paper, but I do think this finding is a nice first step.

To ‘probe’ a bit further we looked at the BOLD responses to correct stops, and the parametric correlation of task-related BOLD with the TUT ratings:

Activations during correct stop trials.
Activations during correct stop trials.
Deactivations to stop trials (blue) and parametric correlation with TUT reports (red)
Deactivations to stop trials (blue) and parametric correlation with TUT reports (red)

As you can see, correct stop trials elicit a rather canonical activation pattern on the motor-inhibition and salience networks, with concurrent deactivations in visual cortex and the default mode network (second figure, blue blobs). I think of this pattern a bit like when the brain receives the ‘stop signal’ it goes, (a la Picard): “FULL STOP, MAIN VIEWER OFF, FIRE THE PHOTON TORPEDOS!”, launching into full response recovery mode. Interestingly, while we replicated the finding of medial-prefrontal co-variation with TUTS (second figure, red blob), this area was substantially more rostral than the stop-related deactivations, supporting previous findings of some degree of functional segregation between the inhibitory and mind-wandering related components of the DMN.

Finally, when examining the Aware > Unaware errors contrast, we replicated the typical salience network activations (mid-cingulate and anterior insula). Interestingly we also found strong bilateral activations in an area of the inferior parietal cortex also considered to be a part of the default mode. This finding further strengthens the link between mind-wandering and metacognition, indicating that the salience and default mode network may work in concert during conscious error awareness:

Activations to Aware > Unaware errors contrast.
Activations to Aware > Unaware errors contrast.

In all, this was a very valuable and fun study for me. As a PhD student being able to replicate the function of classic “executive, salience, and default mode” ‘resting state’ networks with a basic task was a great experience, helping me place some confidence in these labels.  I was also able to combine a classical behavioral metacognition task with some introspective thought probes, and show that they do indeed contain valuable information about task performance and related brain processes. Importantly though, we showed that the ‘content’ of the mind-wandering reports doesn’t tell the whole story of spontaneous cognition. In the future I would like to explore this idea further, perhaps by taking a time series approach to probe the dynamics of mind-wandering, using a simple continuous feedback device that participants could use throughout an experiment. In the affect literature such devices have been used to probe the dynamics of valence-arousal when participants view naturalistic movies, and I believe such an approach could reveal even greater granularity in how the experience of mind-wandering (and it’s fluctuation) interacts with cognition. Our findings suggest that the relationship between mind-wandering and task performance may be more nuanced than mere antagonism, an important finding I hope to explore in future research.

Citation: Allen M, Smallwood J, Christensen J, Gramm D, Rasmussen B, Jensen CG, Roepstorff A and Lutz A (2013) The balanced mind: the variability of task-unrelated thoughts predicts error monitoringFront. Hum. Neurosci7:743. doi: 10.3389/fnhum.2013.00743

Short post: why I share (and share often)

If you follow my social media activities I am sure by now that you know me as a compulsive share-addict. Over the past four years I have gradually increased both the amount of incoming and outgoing information I attempt to integrate on a daily basis. I start every day with a now routine ritual of scanning new publications from 60+ journals and blogs using my firehose RSS feed, as well as integrating new links from various Science sub-reddits, my curated twitter cogneuro list, my friends and colleagues on Facebook, and email lists. I then in turn curate the best, most relevant to my interests, or in some cases the most outrageous of these links and share them back to twitter, facebook, reddit, and colleagues.

Of course in doing so, a frequent response from (particularly more senior) colleagues is: why?! Why do I choose to spend the time to both take in all that information and to share it back to the world? The answer is quite simple- in sharing this stuff I get critical feedback from an ever-growing network of peers and collaborators. I can’t even count the number of times someone has pointed out something (for better or worse) that I would have otherwise missed in an article or idea. That’s right, I share it so I can see what you think of it!  In this way I have been able to not only stay up to date with the latest research and concepts, but to receive constant invaluable feedback from all of you lovely brains :). In some sense I literally distribute my cognition throughout my network – thanks for the extra neurons!

From the beginning, I have been able not only to assess the impact of this stuff, but also gain deeper and more varied insights into its meaning. When I began my PhD I had the moderate statistical training of a BSc in psychology with little direct knowledge of neuroimaging methods or theory. Frankly it was bewildering. Just figuring out which methods to pay attention to, or what problems to look out for, was a headache-inducing nightmare. But I had to start somewhere and so I started by sharing, and sharing often. As a result almost every day I get amazing feedback pointing out critical insights or flaws in the things I share that I would have otherwise missed. In this way the entire world has become my interactive classroom! It is difficult to overstate the degree to which this interaction has enriched my abilities as a scientists and thinker.

It is only natural however for more senior investigators to worry about how much time one might spend on all this. I admit in the early days of my PhD I may have spent a bit too long lingering amongst the RSS trees and twitter swarms. But then again, it is difficult to place a price on the knowledge and know-how I garnered in this process (not to mention the invaluable social capital generated in building such a network!). I am a firm believer in “power procrastination”, which is just the process of regularly switching from more difficult but higher priority to more interesting but lower priority tasks. I believe that by spending my downtime taking in and sharing information, I’m letting my ‘default mode’ take a much needed rest, while still feeding it with inputs that will actually make the hard tasks easier.

In all, on a good day I’d say I spend about 20 minutes each morning taking in inputs and another 20 minutes throughout the day sharing them. Of course some days (looking at you Fridays) I don’t always adhere to that and there are those times when I have to ‘just say no’ and wait until the evening to get into that workflow. Productivity apps like Pomodoro have helped make sure I respect the balance when particularly difficult tasks arise. All in all however, the time I spend sharing is paid back tenfold in new knowledge and deeper understanding.

Really I should be thanking all of you, the invaluable peers, friends, colleagues, followers, and readers who give me the feedback that is so totally essential to my cognitive evolution. So long as you keep reading- I’ll keep sharing! Thanks!!

Notes: I haven’t even touched on the value of blogging and post-publication peer review, which of course sums with the benefits mentioned here, but also has vastly improved my writing and comprehension skills! But that’s a topic for another post!

( don’t worry, the skim-share cycle is no replacement for deep individual learning, which I also spend plenty of time doing!)

“you are a von economo neuron!” – Francesca 🙂

Fun fact – I read the excellent scifi novel Accelerando just prior to beginning my PhD. In the novel the main character is an info-addict who integrates so much information he gains a “5 second” prescience on events as they unfold. He then shares these insights for free with anyone who wants them, generating billion dollar companies (of which he owns no part in) and gradually manipulating global events to bring about a technological singularity. I guess you could say I found this to be a pretty neat character 🙂 In a serious vein though, I am a firm believer in free and open science, self-publication, and sharing-based economies. Information deserves to be free!

When is expectation not a confound? On the necessity of active controls.

Learning and plasticity are hot topics in neuroscience. Whether exploring old world wisdom or new age science fiction, the possibility that playing videogames might turn us into attention superheroes or that practicing esoteric meditation techniques might heal troubled minds is an exciting avenue for research. Indeed findings suggesting that exotic behaviors or novel therapeutic treatments might radically alter our brain (and behavior) are ripe for sensational science-fiction headlines purporting vast brain benefits.  For those of you not totally bored of methodological crisis, here we have one brewing anew. You see the standard recommendation for those interested in intervention research is the active-controlled experimental design. Unfortunately in both clinical research on psychotherapy (including meditation) and more Sci-Fi areas of brain training and gaming, use of active controls is rare at best when compared to the more convenient (but causally ineffective) passive control group. Now a new article in Perspectives in Psychological Science suggests that even standard active controls may not be sufficient to rule out confounds in the treatment effect of interest.

Why is that? And why exactly do we need  active controls in the first place? As the authors clearly point out, what you want to show with such a study is the causal efficacy of the treatment of interest. Quite simply what that means is that the thing you think should have some interesting effect should actually be causally responsible for creating that effect. If you want to argue that standing upside down for twenty minutes a day will make me better at playing videogames in Australia, it must be shown that it is actually standing upside down that causes my increased performance down under. If my improved performance on Minecraft Australian Edition is simply a product of my belief in the power of standing upside down, or my expectation that standing upside down is a great way to best kangaroo-creepers, then we have no way of determining what actually produced that performance benefit. Research on placebos and the power of expectations shows that these kinds of subjective beliefs can have a big impact on everything from attentional performance to mortality rates.

Useful flowchart from Boot et al on whether or not a study can make causal claims for treatment.
Useful flowchart from Boot et al on whether or not a study can make causal claims for treatment.

Typically researchers attempt to control for such confounds through the use of a control group performing a task as similar as possible to the intervention of interest. But how do we know participants in the two groups don’t end up with different expectations about how they should improve as a result of the training? Boot et al point out that without actually measuring these variables, we have no idea and no way of knowing for sure that expectation biases don’t produce our observed improvements. They then provide a rather clever demonstration of their concern, in an experiment where participants view videos of various cognition tests as well as videos of a training task they might later receive, in this case either the first-person shooter Unreal Tournament or the spatial puzzle game Tetris. Finally they asked the participants in each group which tests they thought they’d do better on as a result of the training video. Importantly the authors show that not only did UT and Tetris lead to significantly different expectations, but also that those expectation benefits were specific to the modality of trained and tested tasks. Thus participant who watched the action-intensive Unreal Tournament videos expected greater improvements on tests of reaction time and visual performance, whereas participants viewing Tetris rated themselves as likely to do better on tests of spatial memory.

This is a critically important finding for intervention research. Many researchers, myself included, have often thought of the expectation and demand characteristic confounds in a rather general way. Generally speaking until recently I wouldn’t have expected the expectation bias to go much beyond a general “I’m doing something effective” belief. Boot et al show that our participants are a good deal cleverer than that, forming expectations-for-improvement that map onto specific dimensions of training. This means that to the degree that an experimenter’s hypothesis can be discerned from either the training or the test, participants are likely to form unbalanced expectations.

The good news is that the authors provide several reasonable fixes for this dilemma. The first is just to actually measure participant’s expectations, specifically in relation to the measures of interest. Another useful suggestion is to run pilot studies ensuring that the two treatments do not evoke differential expectations, or similarly to check that your outcome measures are not subject to these biases. Boot and colleagues throw the proverbial glove down, daring readers to attempt experiments where the “control condition” actually elicits greater expectations yet the treatment effect is preserved. Further common concerns, such as worries about balancing false positives against false negatives, are address at length.

The entire article is a great read, timely and full of excellent suggestions for caution in future research. It also brought something I’ve been chewing on for some time quite clearly into focus. From the general perspective of learning and plasticity, I have to ask at what point is an expectation no longer a confound. Boot et al give an interesting discussion on this point, in which they suggest that even in the case of balanced expectations and positive treatment effects, an expectation dependent response (in which outcome correlates with expectation) may still give cause for concern as to the causal efficacy of the trained task. This is a difficult question that I believe ventures far into the territory of what exactly constitutes the minimal necessary features for learning. As the authors point out, placebo and expectations effects are “real” products of the brain, with serious consequences for behavior and treatment outcome. Yet even in the medical community there is a growing understanding that such effects may be essential parts of the causal machinery of healing.

Possible outcome of a training experiment, in which the control shows no dependence between expectation and outcome (top panel) and the treatment of interest shows dependence (bottom panel). Boot et al suggest that such a case may invalidate causal claims for treatment efficacy.
Possible outcome of a training experiment, in which the control shows no dependence between expectation and outcome (top panel) and the treatment of interest shows dependence (bottom panel). Boot et al suggest that such a case may invalidate causal claims for treatment efficacy.

To what extent might this also be true of learning or cognitive training? For sure we can assume that expectations shape training outcomes, otherwise the whole point about active controls would be moot. But can one really have meaningful learning if there is no expectation to improve? I realize that from an experimental/clinical perspective, the question is not “is expectation important for this outcome” but “can we observe a treatment outcome when expectations are balanced”. Still when we begin to argue that the observation of expectation-dependent responses in a balanced design might invalidate our outcome findings, I have to wonder if we are at risk of valuing methodology over phenomena. If expectation is a powerful, potentially central mechanism in the causal apparatus of learning and plasticity, we shouldn’t be surprised when even efficacious treatments are modulated by such beliefs. In the end I am left wondering if this is simply an inherent limitation in our attempt to apply the reductive apparatus of science to increasingly holistic domains.

Please do read the paper, as it is an excellent treatment of a critically ignored issue in the cognitive and clinical sciences. Anyone undertaking related work should expect this reference to appear in reviewer’s replies in the near future.

EDIT:
Professor Simons, a co-author of the paper, was nice enough to answer my question on twitter. Simons pointed out that a study that balanced expectation, found group outcome differences, and further found correlations of those differences with expectation could conclude that the treatment was causally efficacious, but that it also depends on expectations (effect + expectation). This would obviously be superior to an unbalanced designed or one without measurement of expectation, as it would actually tell us something about the importance of expectation in producing the causal outcome. Be sure to read through the very helpful FAQ they’ve posted as an addendum to the paper, which covers these questions and more in greater detail. Here is the answer to my specific question:

What if expectations are necessary for a treatment to work? Wouldn’t controlling for them eliminate the treatment effect?

No. We are not suggesting that expectations for improvement must be eliminated entirely. Rather, we are arguing for the need to equate such expectations across conditions. Expectations can still affect the treatment condition in a double-blind, placebo-controlled design. And, it is possible that some treatments will only have an effect when they interact with expectations. But, the key to that design is that the expectations are equated across the treatment and control conditions. If the treatment group outperforms the control group, and expectations are equated, then something about the treatment must have contributed to the improvement. The improvement could have resulted from the critical ingredients of the treatment alone or from some interaction between the treatment and expectations. It would be possible to isolate the treatment effect by eliminating expectations, but that is not essential in order to claim that the treatment had an effect.

In a typical psychology intervention, expectations are not equated between the treatment and control condition. If the treatment group improves more than the control group, we have no conclusive evidence that the ingredients of the treatment mattered. The improvement could have resulted from the treatment ingredients alone, from expectations alone, or from an interaction between the two. The results of any intervention that does not equate expectations across the treatment and control condition cannot provide conclusive evidence that the treatment was necessary for the improvement. It could be due to the difference in expectations alone. That is why double blind designs are ideal, and it is why psychology interventions must take steps to address the shortcomings that result from the impossibility of using a double blind design. It is possible to control for expectation differences without eliminating expectations altogether.

MOOC on non-linear approaches to social and cognitive sciences. Votes needed!

My colleagues at Aarhus University have put together a fascinating proposal for a Massive Online Open Course (MOOC) on “Analyzing Behavioral Dynamics: non-linear approaches to social and cognitive sciences”. I’ve worked with Riccardo and Kristian since my masters and I can promise you the course will be excellent. They’ve spent the past 5 years exhaustively pursuing methodology in non-linear dynamics, graph theoretical, and semantic/semiotic analyses and I think will have a lot of interesting practical insights to offer. Best of all the course is free to all, as long as it gets enough votes on the MPF website. I’ve been a bit on the fence regarding my feelings about MOOCs, but in this case I think it’s really a great opportunity to give novel methodologies more exposure. Check it out- if you like it, give them a vote and consider joining the course!

https://moocfellowship.org/submissions/analyzing-behavioral-dynamics-non-linear-approaches-to-social-and-cognitive-sciences

Course Description

In the last decades, the social sciences have come to confront the temporal nature of human behavior and cognition: How do changes of heartbeat underlie emotions? How do we regulate our voices in a conversation? How do groups develop coordinative strategies to solve complex problems together?
This course enables you to tackle this sort of questions: addresses methods of analysis from nonlinear dynamics and complexity theory, which are designed to find and characterize patterns in this kind of complicated data. Traditionally developed in fields like physics and biology, non-linear methods are often neglected in social and cognitive sciences.

The course consists of two parts:

  1. The dynamics of behavior and cognition
    In this part of the course you are introduced some examples of human behavior that challenge the assumptions of linear statistics: reading time, voice dynamics in clinical populations, etc. You are then shown step-by-step how to characterize and quantify patterns and temporal dynamics in these behaviors using non-linear methods, such as recurrence quantification analysis.
  2. The dynamics of interpersonal coordination
    In this second part of the course we focus on interpersonal coordination: how do people manage to coordinate action, emotion and cognition? We consider several real-world cases: heart beat synchronization during firewalking rituals, voice adaptation during conversations, joint problem solving in creative tasks – such as building Lego models together. You are then shown ways to analyze how two or more behaviors are coordinated and how to characterize their coupling – or lack-thereof.

This course provides a theoretical and practical introduction to non-linear techniques for social and cognitive sciences. It presents concrete case studies from actual research projects on human behavior and cognition. It encourages you to put all this to practice via practical exercises and quizzes. By the end of this course you will be fully equipped to go out and do your own research projects applying non-linear methods on human behavior and coordination.

Learning objectives

  • Given a timeseries (e.g. a speech recording, or a sequence of reaction times), characterize its patterns: does it contain repetitions? How stable? How complex?
  • Given a timeseries (e.g. a speech recording, or a sequence of reaction times), characterize how it changes over time.
  • Given two timeseries (e.g. the movements of two dancers) characterize their coupling: how do they coordinate? Do they become more similar over time? Can you identify who is leading and who is following?

MOOC relevance

Social and cognitive research is increasingly investigating phenomena that are temporally unfolding and non-linear. However, most educational institutions only offer courses in linear statistics for social scientists. Hence, there is a need for an easy to understand introduction to non-linear analytical tools in a way that is specifically aimed at social and cognitive sciences. The combination of actual cases and concrete tools to analyze them will give the course a wide appeal.
Additionally, methods oriented courses on MOOC platforms such as Coursera have generally proved very attractive for students.

Please spread the word about this interesting course!

Twitter Follow-up: Can MVPA Invalidate Simulation Theory?

Thanks to the wonders of social media, while I was out grocery shopping I received several interesting and useful responses to my previous post on the relationship between multivariate pattern analysis and simulation theory. Rather than try and fit my responses into 140 characters, I figured i’d take a bit more space here to hash them out. I think the idea is really enhanced by these responses, which point to several findings and features of which I was not aware. The short answer seems to be, no MVPA does not invalidate simulation theory (ST) and may even provide evidence for it in the realm of motor intentions, but that we might be able to point towards a better standard of evidence for more exploratory applications of ST (e.g. empathy-for-pain). An important point to come out of these responses as one might expect, is that the interpretation of these methodologies is not always straightforward.

I’ll start with Antonia Hamilton’s question, as it points to a bit of literature that speaks directly to the issue:

antonio_reply

Antonia is referring to this paper by Oosterhof and colleagues, where they directly compare passive viewing and active performance of the same paradigm using decoding techniques. I don’t read nearly as much social cognition literature as I used to, and wasn’t previously aware of this paper. It’s really a fascinating project and I suggest anyone interested in this issue read it at once (it’s open access, yay!). In the introduction the authors point out that spatial overlap alone cannot demonstrate equivalent mechanisms for viewing and performing the same action:

Numerous functional neuroimaging studies have identified brain regions that are active during both the observation and the execution of actions (e.g., Etzel et al. 2008; Iacoboni et al. 1999). Although these studies show spatial overlap of frontal and parietal activations elicited by action observation and execution, they do not demonstrate representational overlap between visual and motor action representations. That is, spatially overlapping activations could reflect different neural populations in the same broad brain regions (Gazzola and Keysers 2009; Morrison and Downing 2007; Peelen and Downing 2007b). Spatial overlap of activations per se cannot establish whether the patterns of neural response are similar for a given action (whether it is seen or performed) but different for different actions, an essential property of the “mirror system” hypothesis.”

They then go on to explain that while MVPA could conceivably demonstrate a simulation-like mechanism (i.e. a common neural representation for viewing/doing), several previous papers attempting to show just that failed to do so. The authors suggest that this may be due to a variety of methodological limitations, which they set out to correct for in their JNPhys publication. Oosterhof et al show that clusters of voxels located primarily in the intraparietal and superior temporal sulci encode cross-modal information, that is code similar information both when viewing and doing:

Click to go to PDF.
From Oosterhof et al, showing combined classification accuray for (train see, test do; train do, test see).

Essentially Oosterhof et al trained their classifier on one modality (see or do) , tested the classifier on the opposite modality in another session, and then repeated this procedure for all possible combinations of session and modality (while appropriately correcting for multiple comparisons). The map above represents the combined classification accuracy from both train-test combinations; interestingly in the supplementary info they show that the maps do slightly differ depend on what was trained:

Click to go to SI.
From supplementary info, A shows classifier trained on see, tested on do, B shows the opposite.

Oosterhof and colleagues also investigate the specificity of information for particular gestures in a second experiment, but for our purposes lets focus on just the first. My first thought is that this does actually provide some evidence for a simulation theory of understanding motor intentions. Clearly there is enough information in each modality to accurately decode the opposite modality: there are populations of neurons encoding similar information both for action execution and perception. Realistically I think this has to be the minimal burden of proof needed to consider an imaging finding to be evidence for simulation theory. So the results of Oosterhof et al do provide supporting evidence for simulation theory in the domain of motor intentions.

Nonetheless, the results also strengthen the argument that more exploratory extentions of ST (like empathy-for-pain) must be held to a similar burden of proof before generalization in these domains is supported. Simply showing spatial overlap is not evidence of simulation, as Oosterhof themselves argue. I think it is interesting to note the slight spatial divergence between the two train-test maps (see on do, do on see). While we can obviously identify voxels encoding cross-modality information, it is interesting that those voxels do not subsume the entirety of whatever neural computation relates these two modalities; each has something unique to predict in the other. I don’t think that observation invalidates simulation theory, but it might suggest an interesting mechanism not specified in the ‘vanilla’ flavor of ST. To be extra boring, it would be really nice to see an independent replication of this finding, since as Oosterhof themselves point out, the evidence for cross-modal information is inconsistent across studies. Even though the classifier performs well above chance in this study,  it is also worth noting that the majority of surviving voxels in their study show somewhere around 40-50% classification accuracy, not exactly gangbusters. It would be interesting to see if they could identify voxels within these regions that selectively encode only viewing or performing; this might be evidence for a hybrid-theory account of motor intentions.

leoreply

Leonhard’s question is an interesting one that I don’t have a ready response for. As I understand it, the idea is that demonstrating no difference of patterns between a self and other-related condition (e.g. performing an action vs watching someone else do it) might actually be an argument for simulation, since this could be caused by that region using isomorphic computations for both conditions. This an interesting point – i’m not sure what the status of null findings is in the decoding literature, but this merits further thought.

The next two came from James Kilner and Tal Yarkoni. I’ve put them together as I think they fall under a more methodological class of questions/comments and I don’t feel quite experienced enough to answer them- but i’d love to hear from someone with more experience in multivariate/multivoxel techniques:

kilner_reply

talreply

James Kilner asks about the performance of MVPA in the case that the pattern might be spatially overlapping but not identical for two conditions. This is an interesting question and i’m not sure I know the correct answer; my intuition is that you could accurately discriminate both conditions using the same voxels and that this would be strong evidence against a simple simulation theory account (spatial overlap but representational heterogeneity).

Here is more precise answer to James’ question from Sam Schwarzkopf, posted in the comments of the original post:

2. The multivariate aspect obviously adds sensitivity by looking at pattern information, or generally any information of more than one variable (e.g. voxels in a region). As such it is more sensitive to the information content in a region than just looking at the average response from that region. Such an approach can reveal that region A contains some diagnostic information about an experimental variable while region B does not, even though they both show the same mean activation. This is certainly useful knowledge that can help us advance our understanding of the brain – but in the end it is still only one small piece in the puzzle. And as both Tal and James pointed out (in their own ways) and as you discussed as well, you can’t really tell what the diagnostic information actually represents.
Conversely, you can’t be sure that just because MVPA does not pick up diagnostic information from a region that it therefore doesn’t contain any information about the variable of interest. MVPA can only work as long as there is a pattern of information within the features you used.

This last point is most relevant to James’ comment. Say you are using voxels as features to decode some experimental variable. If all the neurons with different tuning characteristics in an area are completely intermingled (like orientation-preference in mouse visual cortex for instance) you should not really see any decoding – even if the neurons in that area are demonstrably selective to the experimental variable.

In general it is clear that the interpretation of decoded patterns is not straightforward- it isn’t clear precisely what information they reflect, and it seems like if a region contained a totally heterogeneous population of neurons you wouldn’t pick up any decoding at all. With respect to ST,  I don’t know if this completely invalidates our ability to test predictions- I don’t think one would expect such radical heterogeneity in a region like STS, but rather a few sub-populations responding selectively to self and other, which MVPA might be able to reveal. It’s an important point to consider though.

Tal’s point is an important one regarding the different sources of information that GLM and MVPA techniques pick up. The paper he refers to by Jimura and Poldrack set out to investigate exactly this by comparing the spatial conjunction and divergent sensitivity of each method. Importantly they subtracted the mean of each beta-coefficient from the multivariate analysis to insure that the analysis contained only information not in the GLM:

pold_mvpa

As you can see in the above, Jimura and Poldrack show that MVPA picks up a large number of voxels not found in the GLM analysis. Their interpretation is that the GLM is designed to pick up regions responding globally or in most cases to stimulation, whereas MVPA likely picks up globally distributed responses that show variance in their response. This is a bit like the difference between functional integration and localization; both are complementary to the understanding of some cognitive function. I take Tal’s point to be that the MVPA and GLM are sensitive to different sources of information and that this blurs the ability of the technique to evaluate simulation theory- you might observe differences between the two that would resemble evidence against ST (different information in different areas) when in reality you would be modelling altogether different aspects of the cognition. edit: after more discussion with Tal on Twitter, it’s clear that he meant to point out the ambiguity inherent in interpreting the predictive power of MVPA; by nature these analyses will pick up a lot of confounding a causal noise- arousal, reaction time, respiration, etc, which would be excluded in a GLM analysis. So these are not necessarily or even likely to be “direct read-outs” of representations, particularly to the extent that such confounds correlate with the task. See this helpful post by neuroskeptic for an overview of one recent paper examining this issue. See here for a study investigating the complex neurovascular origins of MVPA for fMRI. 

Thanks sincerely for these responses, as it’s been really interesting and instructive for me to go through these papers and think about their implications. I’m still new to these techniques and it is exciting to gain a deeper appreciation of the subtleties involved in their interpretation. On that note, I must direct you to check out Sam Schwarzkopf’s excellent reply to my original post. Sam points out some common misunderstandings (of which I am perhaps guilty of several) regarding the interpretation of MVPA/decoding versus GLM techniques, arguing essentially that they pick up much of the same information and can both be considered ‘decoding’ in some sense, further muddying their ability to resolves debates like that surrounding simulation theory.