🔥 Burn Fat Fast. Discover How! 💪

VirusTotal Chi2 Hi all, I'm currently teaching myself the bas | Malware News

VirusTotal Chi2
Hi all,

I'm currently teaching myself the basics of malware analysis for my final project at university and have been working on a script to automate some static analysis. In doing so I've been using the VT API and noticed some objects contain a Chi2 value. I think Chi2 is used to measure the difference in distribution of elements in a dataset, but I am unsure which distributions are being compared here? To be specific, I am referring the the Chi2 value referenced in the PEInfo Sections objects. I appreciate any help :)
Origin144

So I think there are two different uses of the chi squared approximation algorithm here. I'll talk about each of them separately.

For the case of virus total, it looks like they're applying the chi-squared approximation algorithm to the entire file stream. The purpose of this calculation is similar to that of entropy and that it should help you determine whether or not a file is packed, encrypted, encoded, or obfuscated. The calculation is a little bit different than entropy, so It may help some machine learning models to differentiate between various specific packing, encryption, encoding, or obfuscation techniques. I don't have an intuitive sense of what values of chi-squared are more or less indicative of malware like I do entropy.

There was another research article that was posted a while back that used the chi squared approximation calculation to measure distance between the expected PE header fields of legitimate files to the file that's currently being looked at. The assumption being that the further the distance between the two data sets, as represented by the chi squared approximation value, the more likely the file is to be malicious.

From a machine learning perspective, the chi squared approximation almost seems to be a way of doing data compression on the initial feature set. As opposed to having a separate feature for each PE header field in the machine learning model, the features are compressed into a single chi squared approximation calculation and that's what's fed into the model. The purpose of doing that would be to reduce the total number of calculations, and thus time, required to classify an individual file. For real time malware detection, extremely short analysis times are required.

Link: https://link.springer.com/chapter/10.1007/978-3-319-19578-034
*FusionCarcass*

[
https://developers.virustotal.com/v3.0/reference#dot\net_assembly](https://developers.virustotal.com/v3.0/reference#dotnetassembly)

>chi2
: <float\> chi-squared test value of stream data.
eclairum115


@malwr