Big Data Science

Channel address:

Categories: Technologies

Language: English

Subscribers: 1.44K

Description from channel

Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼 — https://t.me/bds_job — channel about Data Science jobs and career
💻 — https://t.me/bdscience_ru — Big Data Science [RU]

▲ Vote (1)

Ratings & Reviews

1.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 17

2021-09-10 06:26:50 Adversarial attacks to refine molecular energy predictions
Researchers at MIT have found a new quantitative estimate of the uncertainty of molecular energies using neural networks. Neural networks are often used to predict new resources, speeds, and capabilities orders of magnitude faster than traditional methods such as quo-mechanical simulation. The results obtained can be unreliable, since ML-models are interpolated, it is possible that they fail when applied to the operational data of an external dataset. This is especially for predicting the "potential energy" (PES) or energy map of a molecule in all its configurations. To solve these problems, scientists have proposed safe zones of a neural network using adversarial attacks. The actual simulation is performed only for small parts of the molecule, and the data is fed into the neural network, which learns to predict the same properties for the rest of the molecules. These methods have been successfully tested on new materials, including catalysts for the production of hydrogen from water, cheaper polymer electrolytes for electric vehicles, magnets, etc. However, the accuracy of neural networks depends on the correctness of training data, and incorrect predictions can have disastrous consequences.
One way to find out the uncertainty of a model is to run the same data through several versions of it. To do this, the researchers had several neural networks predicting a potential surface based on the same data. If the network is confident in the prediction, the difference between the outputs of different networks is minimal and the surfaces converge more. Otherwise, the predictions of the various models vary greatly, producing a series of outputs, any of which may be the correct surface.
Forecast scatter represents the uncertainty at a particular point. The ML-model should indicate not only the best forecast, but also the uncertainty of each of them. However, each simulation can take tens to thousands of CPU hours. And to get meaningful results, you need to run multiple models at a sufficient number of points.
Therefore, the new approach only selects data points with low forecast confidence. These molecules are then modified slightly to increase the uncertainty. Additional data is computed for these molecules through simulation, and then the original training pool is added. The neural networks are trained again, and a new set of uncertainties is calculated. This process is repeated until the uncertainty associated with various points on the surface becomes well defined and cannot be further reduced.
The proposed approach has been tested on zeolites - cavernous crystals, selective forms and use in catalysis, gas separation and ion exchange. Modeling large zeolite structures is very expensive, and the researchers show how their method can provide significant savings in computer simulations. But an adversarial approach to retraining neural networks increases performance without significant computational costs.
https://news.mit.edu/2021/using-adversarial-attacks-refine-molecular-energy-predictions-0901

166 views03:26

Open / Comment

2021-09-08 09:57:16 Real-Time ML Predictions with Google's Vertex AI
One of the biggest challenges in serving ML-models is providing near real-time predictions. Some business scenarios are especially sensitive to time latency. For example, recommendation systems for online store users, estimating the delivery time of products for food tech companies, etc. On August 25, 2021, Google announced the possibility of direct interaction with Vertex AI - its unified ML platform through private endpoints. Vertex AI allows you to quickly connect a trained and tested ML model to a working application, upload it to a specially prepared server in the Google Cloud, or export it to the desired format.
Vertex Predictions is a serverless way of serving ML models that can be linked in the cloud and made predictions via a REST API. With online forecasts, it is necessary to obtain a model at the endpoint, which will link it to physical computing resources and allow it to be done in almost real time. With VPC Peering, you can configure a private connection to reach an endpoint. By doing this, user data will not pass through the public Internet, which reduces the latency of online predictions and improves security.
https://cloud.google.com/blog/products/ai-machine-learning/creating-a-private-endpoint-on-vertex-ai

147 views06:57

Open / Comment

2021-09-06 09:14:56 TOP 5 useful Python tools for data engineers and web developers
• Requests is an easy-to-use HTTP library for Python that allows you to make requests and interact with the API https://docs.python-requests.org/en/master/
• Advanced Python Scheduler (APScheduler) - a library for deferred execution of Python code once or with periodic repetition. When the tasks are saved in the database, their states and the restart of the scheduler will also be saved. APScheduler can also be used as a cross-platform application-specific replacement for platform-specific schedulers such as the cron daemon or Windows task scheduler. However, APScheduler is not a daemon or service, and therefore does not come with command line tools, but is intended to run inside existing applications. This library provides some ready-made building blocks for creating a scheduler service or for running it in a separate process. https://apscheduler.readthedocs.io/en/stable/userguide.html
• Watchdog - a module for tracking filesystem events through the Python API and shell utilities https://pypi.org/project/watchdog/
• Twilio - a library for automating the sending of text messages and phone calls. It is very convenient for automatic monitoring of events on third-party sites, for example, prompt tracking of discounts on the right products or the appearance of new products https://pypi.org/project/twilio/
• Random User Agent - a library for adding random user agents to requests, which is useful when web parsing data or sending a large number of requests https://pypi.org/project/random-user-agent/

175 views06:14

Open / Comment

2021-09-03 07:02:18 What is anomaly detection and how does it work
Anomaly detection is a mathematical search for deviations in controlled and uncontrolled numerical data, depending on how much a particular value differs from others or from the standard deviation in a given sample. There are many different methods for detecting anomalies, called outlier detection algorithms, each with different criteria for detecting them and therefore used in different scenarios. The most common methods used to detect anomalies are:
• General density-based methods: K-Nearest Neighbor (KNN), Local Outlier Factor (LOF), Isolation Forests, and other algorithms that can be applied to regression or classification scenarios. Each of these generates the expected behavior by following the line of the highest density of data points. Points that fall by a statistically significant amount outside these dense zones are flagged as anomaly. Most of these methods are based on distance between points, so it is important to normalize the units and scale in the dataset to get accurate results. For example, in KNN, data points are weighted by 1 / k, where k is the distance to the nearest neighbor. Therefore, the points that are closer to each other have a lot of weight, and affect what is the standard, there are more distant points. The algorithm marks points with a low 1 / k value as outliers. This is suitable for normalized data without labels, when there is no desire and ability to use algorithms with more complex calculations.
• One-class support vector machine is a supervised learning algorithm that creates a robust prediction model. Often used for classification. There is a training set of examples, each labeled as part of one of two categories. The system creates criteria for sorting new examples for each category, matches the examples with points in space in order to distinguish both categories as much as possible. The system will flag an outlier if it goes beyond any category. In the absence of labeled data, you can use unsupervised learning, which looks for clustering among the examples to define categories. This is suitable for working with 2 categories of data, when you need to find which data points lie outside each of them.
• Algorithm for clustering K-means, combining KNN-approaches based on the proximity of each data point to other nearby points and SVM, since it focuses on classification into various categories. Here, each data point is categorized based on its characteristics. The category has a center point that serves as the prototype for all other data points in the cluster. They are all compared to these prototypes to determine their k-mean, which acts as a measure of the difference between the prototype and the current data point. Data points with higher k-means are closer to the prototype, forming a cluster. K-Means Clustering can detect anomalies by marking points that do not fit any of the established categories. This is suitable for scenarios where there is untagged data from many different types that need to be organized similar to the prototypes learned.
There are other more sophisticated algorithms for unsupervised anomaly detection and multidimensional datasets. For example, Gaussian as an alternative version of the K-Means algorithm with Gaussian distribution instead of standard deviation. And Bayesian uses Bayesian probability to detect anomalies. Also, to detect anomalies, autoencoders can be used - neural networks that create coded rules for the expected output depending on the input value. Anything beyond these repetitive values is considered an anomaly and is well suited for dimensional detection tasks.

65 views04:02

Open / Comment

2021-09-01 07:25:35

AIPS

115 views04:25

Open / Comment

2021-09-01 07:25:26 What is AIOps
While we got used to MLOps, a new Ops phenomenon happened in IT, the need for which actually arose a long time ago. Meet AIOps - using AI to simplify IT operations management and accelerate and automate problem solving in today's complex IT environments. AIOps leverages the power of big data, analytics and machine learning for the following purposes:
• Collecting and aggregating huge and ever-growing volumes of operational data generated by many IT infrastructure components, applications and performance monitoring tools;
• Filtering useful signals from noise to reveal really important events and patterns related to the performance and availability of systems;
• identifying root causes and responding quickly to problems, sometimes automatically without human intervention.
By replacing many separate tools for manual IT operations with a single intelligent and automated platform, AIOps enables you to respond quickly and even proactively to slowdowns and system failures with much less effort. AIOps bridges the gap between all diverse, dynamic and complex IT landscapes without sacrificing application performance and availability. With more companies moving from traditional IT infrastructure to a dynamic mix of on-premises clusters, private clouds, and public clouds today, AIOps is relevant for many enterprises.
https://medium.com/geekculture/aiops-6e463cbe617a

126 views04:25

Open / Comment

2021-08-31 09:33:27 TOP-15 the most interesting DS-conferences all over the world in September 2021
6-7.09 (offline) and 13-15.09 (online) - AI & Big Data Expo Global, the leading Artificial Intelligence & Big Data Conference & Exhibition, at the Business Design Centre, London https://www.ai-expo.net/global/
9-10.09 – R Conference, New York, Online https://rstats.ai/nyr/
13-17.09 – Data Science Salon Miami Machine Learning & AI Meetup Week. Miami, FL, USA https://www.datascience.salon/miami-ml-meetup-week
14-16.09 - Insurance AI and Innovative Tech USA 2021 – Online Conference by Reuters https://reutersevents.com/events/analyticsusa/
- 15-16.09 - DATA festival #online https://datafestival.de/
- 15-16.09 - Open Data Science Conference, Online https://odsc.com/apac
- 20.09 – 1st Citizen Data Science Summit, Boston https://www.citizen-data-science.org/
- 20-21.09 - International Conference on Advances in Big Data and Data Sciences, Toronto, Canada https://waset.org/advances-in-big-data-and-data-sciences-conference-in-september-2021-in-toronto
- 21.09 – Data Champions Online, Canada https://dco-canada.coriniumintelligence.com/
- 22-23.09 - Big Data LDN, UK largest data & analytics event, Olympia London, UK https://bigdataldn.com/
- 22-23.09 - RE.WORK Deep Learning Summit https://www.re-work.co/events/deep-learning-summit-research and https://www.re-work.co/events/deep-learning-summit-applications
- 28-29.09 – Chief Data & Analytics Officer, Financial Services, Online https://cdao-fs-eu.coriniumintelligence.com/
- 28-30.09 – DataOps Summit Online https://www.dataopssummit-sf.com/about/
- 30.09 - Web Data Extraction Summit 2021 by Zyte https://www.extractsummit.io/

116 views06:33

Open / Comment

2021-08-30 05:22:02 BYOL - Bootstrap Your Own Latent
BYOL is a new approach to self-teaching image representation with 2 neural networks that interact and learn from each other. The online network learns from the representation made by the target network on the same image with various additions. The underlying BYOL architecture is existing ResNet50 or other similar architectures. Input x is padded to t and t ', which are transmitted via the online and target network separately.
The difference between online and target networks is that the former has an MLP architecture with two fully connected layers, and Relu and batchnorm in between. The online network view learns from the view generated by the target network. The online network is updated with a regression loss function whose targets are set by the target network. And the parameters of the target model are updated by the exponential moving average of the online network, allowing you to process more information and avoid decision collapse.
The performance of BYOL is in line with the comparison with the supervised learning architecture of SOTA. There is a slight performance degradation when using only random cropping as image enlargement, but BYOL performs better than SimCLR by iteratively learning from previous versions of its output without using negative pairs with the linear classifier protocol. However, the BYOL approach is not yet applicable to the tasks of processing text, video, and audio.

https://ai.plainenglish.io/byol-bootstrap-your-own-latent-dacee62a3dc8
https://arxiv.org/abs/2006.07733
https://arxiv.org/abs/2010.10241
https://github.com/lucidrains/byol-pytorch

186 views02:22

Open / Comment

2021-08-27 19:19:47 What is AIOps and how it differs from MLOps
MLOps is an interdisciplinary approach to managing machine learning methods as standalone products with their own life cycle, with a focus on developing, scaling, and applying ML algorithms on an ongoing basis.
MLOps aims to bridge the gap between creating ML models and maintaining them, while AIOps focuses on automating incident management and intelligent root cause analysis.
AIOps solutions use all tracking and reporting data and logs to detect events and apply machine learning and deep learning to notify IT operations of any issues or disruptions.
The goal of AIOps is to improve the efficiency of IT operations by automating the diagnosis of events and using machine learning to pinpoint root causes. These protections provide technical teams with high quality data that is easy to understand by analyzing the distortions generated by monitoring technologies and reducing false positives by allowing them to function in decision making. AIOps goes beyond preventing downtime to include cost containment, security, and AI-powered policy compliance to improve IT operations.
MLOps helps teams choose which tools, methodologies, and documentation will help their ML models go into production, and AIOps helps teams automate their technology lifecycles.
The greatest effect is provided by the combined use of MLOps and AIOps.
https://ai.plainenglish.io/whats-the-difference-between-aiops-and-mlops-15316cfa803d

153 views16:19

Open / Comment

2021-08-25 18:17:16 News from MIT: A New AI-Powered Probabilistic Programming Language
It can impartially assess the "fairness" of AI algorithms more accurately and faster than existing alternatives. This Sum-Product Probabilistic Language (SPPL) is a probabilistic programming system - a new area at the intersection of programming languages and AI that simplifies the development of AI solutions using probabilistic models and explanations of observable data.
SPPL offers improved flexibility and robustness through the expressiveness of the language, its precise and simple semantics, and the speed and reliability of its exact character output engine. This avoids pitfalls by limiting it to a carefully designed class of AI models, including decision tree classifiers. SPPL works by compiling probabilistic programs into a specialized data structure called a sum-product expression. However, this approach cannot analyze neural networks, although it works faster than other similar solutions. SPPL is Python-based open source project.
https://news.mit.edu/2021/exact-symbolic-artificial-intelligence-faster-better-assessment-ai-fairness-0809
https://github.com/probcomp/sppl

116 views15:17

Open / Comment

Big Data Science

Ratings & Reviews

The latest Messages 17

Popular Channels

Related Chats

Popular Channels

Login