Blog Post
Tolya Evdokimov
Author
#data_pipeline #data_analysis #eye_movements #data_science #R
Clear data pipelines are super important for getting clear and consistent results from the analyses conducted in both research and industry settings. Sometimes data collected through proprietary software locks you in in its ecosystem to promote its own software. I thought it would be interesting to pull some pieces of code I used to mechanically clean and put those data in the "semi-tidy" format — just enough to conduct the necessary tests.
Specifically, I used this notebook I'm about to walk you through in the project related to visual search conducted by Dr. Arryn Robbins. I primarily used R and tidyverse for cleaning as well as some regex and string manipulation to get the feature names in order. Hopefully, this will be interesting and will highlight some of the difficulties related to using proprietary software for stimuli presentation and data collection.
I will start by cleaning and preprocessing the behavioral dataset which mostly contains responses to questions inside of the experiment and the reaction times for every trial.
I read in the data and examine the dataset. The first problem is very obvious — column names. Many of them contain symbols that are considered special in R and many other programming languages like square brackets in BlockNum[Session] or dots in PracticeList.Sample. There are also combination of both like in Probe.Device[Block]. So it is absolutely necessary to fix those names to start any kind of data analysis. I will just rename everything with some regex and a small function below.
Before moving on, I will examine the response accuracy rate of every participant and manually remove participants whose accuracy was below a certain level. Here I end up not removing any participants.
Now I will do some work with the eye movement dataset. First, load the CSV.
In a visual search experiment, objects that participants are asked to find are called targets, and all other objects are distractors. By default, the program adds T or D in front of the stimuli file names to mark whether it is a target or a distractor, so to make future analyses easier, I will create another feature to mark them explicitly. I will also mark whether a particular stimulus was fixated on or not with either 1 or 0.
I will now work with reaction times to filter out some outlier data points. So I will calculate the 2.5 standard deviations above the mean of the reaction times in the whole dataset and then filter out trials with reaction times that fall outside of 200 ms and the 2.5 SD ms range.
In this experiment, we were analyzing targets only so I will only include the targets that got a fixation. I will also filter out undefined trial blocks.
It is time to join behavioral and the eye movement data. I joined the data using 3 features, but before that, I needed to convert block number to an integer to match the other dataset. After I join the tables, I actually convert a bunch of important features to numeric values and I was quite surprised to find out they were not numeric in the first place. Then I create a new feature — time to press and filter out some invalid values. Finally, I convert column names to lowercase.
Finally, I will add some more features from other tables to prepare the dataset for the analyses. These would be the labels for the object categories and the image MDS distance labels.
This short notebook showed most of the steps I used to preprocess the eye movement data. Most of the code includes steps to make the data analyzable by converting it to the data types compatible with standard types in R. Mainly, I wanted this blog post to summarize some standard steps I would usually perform to clean the data from EPrime. Hopefully, this notebook will help someone dealing with a similar data outputs.
GitHub Repo