Human activity recognition is typically addressed by detecting key concepts like global and local motion, features related to object classes present in the scene, as well as features related to the global context.
![]()
Disclaimer: The work described in this post was done by me and my classmate at IIT-Kanpur, Ankit Goyal. Here is a link to the presentation that we gave.
This is a follow up of my earlier post, in which I exploredtemporal models, that can be applied to things like part-of-speech tagging, gesture recognition, and any sequential or temporal sources of data in general. In this post, I will describe in more detail the implementation of ourproject that classified RGBD videos according to the activity being performed in them.
Dataset
Quite a few RGBD datasets are available for human activity detection/classification, and we chose to use the MSR Daily Activity 3D dataset. Since we had limited computational resources (the mathserver of IITK),and a limited time before the submission deadline, we chose to use a subset of the above dataset, and worked with only 6 activities. So, our problem was now reduced to 6-class classification.
Features
In any machine learning problem, your model or learning algorithm is useless without a good set offeatures. I read a recent paper whichhad a decent review of the various features used. They were:
The features that we ultimately went ahead were the skeletal joints. The MSR Daily Activity 3D dataset already providesthe skeletal joint coordinates to us, so all we had to was to take that data, and do some basic pre-processing on it.
Preprocessing the features.
The dataset provides us with the 3D coordinates of 15 human body joints. These cordinates are in the frame of reference of the Kinect.The first operation that we perform on them is the following: to transform the points from the Kinect reference frame to the frameof the person. By frame of the person, we refer to the joint corresponding to the torso.
Next thing that we do is what we call “body size normalization”. Basically all the body lengths, such as the distance between the elbo and hand, are scaled up or down to a standard body size. This ensures that the variation in bosy sizes is captured at the feature level itself,and our model does not have to worry about it anymore.
Clicke here to get the MATLAB code that does the feature extraction part from skeleton files that were obtained from the MSR dataset.
Model
Now, as I discussed in my previous post, Hidden ConditionalRandom Fields (HCRFs) was the model that we finally selected. The original authors had released a well documented toolbox, to which we directly fed the features that were computed above.
Results
Five-fold cross-validation without any hyper-parameter tuning yielded a precision of 71%. These results do not seem impressiveon first glance, but it must be noted that all our experiments were performed in the “new person” setting i.e. the person in thetest set did not appear in the training set, and we did not do any hyper parameter tuning. Our results can be summarised in the ollowing heatmap:
The above figure made one thing clear: that accuracy is being seriously harmed by the algorithm’s inabilityto correctly distinguish between drinking and talking on phone. The reason for this is relatively simple. The features that we are using are skeletal features, and therefore we do not pay any attention to whatobjects the human is interacting with. If you look at the skelat stream, talking on the phone, and drinkingwater seem extrmemly similar! In both the cases, the human raises a hand, and brings it near his head. Thus, in order to make a truly useful activity detection system, it is important to model these interactionsexplicitly.
![]()
If we do get around to improving this model, I will post it here.
.actid — Response vector containing the activity IDs in integers: 1, 2, 3, 4, and 5 representing Sitting, Standing, Walking, Running, and Dancing, respectively.actnames — Activity names corresponding to the integer activity IDs.feat — Feature matrix of 60 features for 24,075 observations.featlabels — Labels of the 60 featuresThe Sensor HAR (human activity recognition) App was used to create the humanactivity data set. When measuring the raw acceleration data with this app, a person placed a smartphone in a pocket so that the smartphone was upside down and the screen faced toward the person.
The software then calibrated the measured raw data accordingly and extracted the 60 features from the calibrated data. For details about the calibration and feature extraction, see and, respectively. The Simulink models described later also use the raw acceleration data and include blocks for calibration and feature extraction. Prepare DataThis example uses 90% of the observations to train a model that classifies the five types of human activities and 10% of the observations to validate the trained model. Use to specify a 10% holdout for the test set.
TTrain.Properties.VariableNames = featlabels' 'Activities'; Train Boosted Tree Ensemble Using Classification Learner AppTrain a classification model by using the Classification Learner app. To open the Classification Learner app, enter classificationLearner at the command line. Alternatively, click the Apps tab, and click the arrow at the right of the Apps section to open the gallery.
Then, under Machine Learning, click Classification Learner.On the Classification Learner tab, in the File section, click New Session and select From Workspace.In the New Session dialog box, click the arrow for Workspace Variable, and then select the table tTrain. Classification Learner detects the predictors and the response from the table. SaveLearnerForCoder(classificationEnsemble, 'EnsembleModel.mat');The function block predictActivity in the Simulink models loads the trained model by using and uses the trained model to classify new data. Deploy Simulink Model to DeviceNow that you have prepared a classification model, you can open the Simulink model, depending on which type of smartphone you have, and deploy the model to your device. Note that the Simulink model requires the EnsembleModel.mat file and the calibration matrix file slexHARAndroidCalibrationMatrix.mat or slexHARiOSCalibrationMatrix.mat. If you click the button located in the upper-right section of this page and open this example in MATLAB®, then MATLAB® opens the example folder that includes these calibration matrix files.Type slexHARAndroidExample to open the Simulink model for Android deployment.The Accelerometer block receives raw acceleration data from accelerometer sensors on the device.The calibrate block is a MATLAB Function block that calibrates the raw acceleration data.
This block uses the calibration matrix in the slexHARAndroidCalibrationMatrix.mat file or the slexHARiOSCalibrationMatrix.mat file. If you click the button located in the upper-right section of this page and open this example in MATLAB®, then MATLAB® opens the example folder that includes these files.The display blocks Acc X, Acc Y, and Acc Z are connected to the calibrate block and display calibrated data points for each axis on the device.Each of the Buffer blocks, X Buffer, Y Buffer, andI Z Buffer, buffers 32 samples of an accelerometer axis with 12 samples of overlap between buffered frames.
After collecting 20 samples, each Buffer block joins the 20 samples with 12 samples from the previous frame and passes the total 32 samples to the extractFeatures block. Each Buffer block receives an input sample every 0.1 second and outputs a buffered frame including 32 samples every 2 seconds.The extractFeatures block is a MATLAB Function block that extracts 60 features from a buffered frame of 32 accelerometer samples.
This function block uses DSP System Toolbox™ and Signal Processing Toolbox™.The predictActivity block is a MATLAB Function block that loads the trained model from the EnsembleModel.mat file by using and classifies the user activity using the extracted features. The output is an integer between 1 and 5, corresponding to Sitting, Standing, Walking, Running, and Dancing, respectively.The Predicted Activity block displays the classified user activity values on the device.The Video Output subsystem uses a multiport switch block to choose the corresponding user activity image data to display on the device. The Convert to RGB block decomposes the selected image into separate RGB vectors and passes the image to the Activity Display block.To deploy the Simulink model to your device, follow the steps in. Run the model on your device, place the device in the same way as described earlier for collecting the training data, and try the five activities.
The model displays the classified activity accordingly. To ensure the accuracy of the model, you need to place your device in the same way as described for collecting the training data. If you want to place your device in a different location or orientation, then collect the data in your own way and use your data to train the classification model.The accuracy of the model can be different from the accuracy of the test data set ( testaccuracy), depending on the device. To improve the model, you can consider using additional sensors and updating the calibration matrix.
Also, you can add another output block for audio feedback to the output subsystem using Audio Toolbox™. Use a ThingSpeak™ write block to publish classified activities and acceleration data from your device to the Internet of Things. For details, see.
![]() Comments are closed.
|
AuthorWrite something about yourself. No need to be fancy, just an overview. Archives
March 2023
Categories |