Big Data Science

Channel address:

Categories: Technologies

Language: English

Subscribers: 1.44K

Description from channel

Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼 — https://t.me/bds_job — channel about Data Science jobs and career
💻 — https://t.me/bdscience_ru — Big Data Science [RU]

▲ Vote (1)

Ratings & Reviews

1.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 22

2021-06-03 07:14:13 What is your salary? Anonymous report and open datasets about AI/ML and Big Data salaries
You can share your data and download CSV/JSON-datasets with answers of your colleagues from all over the world https://salaries.ai-jobs.net/download/

699 viewsedited 04:14

Open / Comment

2021-06-01 06:23:55 How to get rid of lidar sensors and improve the quality of self-driving cars - research by ML specialists from MIT
Modern self-driving cars are driven with a giant rotating cylinder on the roof. It is a lidar sensor that sends pulses of infrared light and measures the time it bounces off objects to create a 3D map of points around the vehicle. However, this 3D data is huge and computationally intensive. For example, a typical 64-channel sensor delivers over 2 million points per second. Due to the extra spatial dimension, modern 3D models require 14 times more computation during output than their 2D counterparts. Therefore, for efficient navigation, engineers have to convert data to 2D, as a result of which some information is lost.
A team of DS specialists from MIT is creating a new ML automatic driving system that will do autonomous navigation using only raw 3D point cloud data and low-resolution GPS maps like smartphones. They even had to develop new deep learning components to use the GPU more efficiently and drive cars in real time. During testing, the system reduced the frequency of transmission of control of the machine to the human driver and could even withstand severe sensor failures. This hybrid evidence-based approach, which combines various control predictions together to arrive at the optimal choice of motion planning, performed better than traditional 3D lidar. And by combining control predictions according to model uncertainty, the system can adapt to unexpected events. The main components of the system are a driving platform without high definition 3D maps, an ML system and a deep 3D learning solution that optimizes neural architecture and inference library. Further, the team plans to develop the project, working out unfavorable weather conditions and dynamic interaction with other vehicles.
https://www.csail.mit.edu/news/more-efficient-lidar-sensing-self-driving-cars

189 views03:23

Open / Comment

2021-05-30 12:15:12 Looking for robot’s simulation? Try MuJoCo - Multi-Joint dynamics with Contact. Initially it was used at the Movement Control Laboratory, University of Washington, and has now been adopted by a wide community of researchers and developers. MuJoCo is a physics engine to facilitate research and development in robotics, biomechanics, graphics and animation, and other AI/ML/DS-areas. It offers a unique combination of speed, accuracy and power and model-based optimization through contacts. MuJoCo makes it possible to scale up computationally-intensive techniques and apply them to complex dynamical systems in contact-rich behaviors. It also provides testing and validation of control schemes before deployment on physical robots, interactive scientific visualization, virtual environments, animation and gaming. Core module (MuJoCo 2.0 - dynamic library with C/C++ API, includes an XML parser, model compiler, simulator, and interactive OpenGL visualizer, compatible with 64-bit Windows, Linux and macOS) distributes under Trial License 30 days is free, then you can buy Individual or Institutional License. Other modules you can download free https://www.roboti.us/index.html

127 views09:15

Open / Comment

2021-05-28 08:05:03 A new deep learning engine from NVIDIA Research to create 3D object models from standard 2D images based on GAN neural networks and the NVIDIA Omniverse platform.
Developed by the NVIDIA AI Research Lab in Toronto, GANverse3D transforms flat images into realistic 3D models. Their rendered results can be used in virtual environments, allowing game developers and designers to add new objects to their layouts easy without 3D modeling expertise and large rendering budgets. For example, one photo of a car can be turned into a 3D model that can drive around a virtual scene with realistic headlights, taillights and turn signals. The training dataset is created using GAN neural networks that synthesize images of the same object from different points of the angles.
Previous inverse graphics models relied on 3D shapes as training data. A new approach without 3D resources turns the GAN model into an efficient data generator for creating a 3D-object from 2D-images. Trained on real images rather than typical synthetic data, this AI model generalizes better to real-world applications, saving time and budget for modeling complex virtual objects. In particular, with the trained GANverse3D app, real photos of cars, buildings, or even people and animals can be transformed into 3D shapes to be customized and animated in Omniverse.
To visualize the same object from different viewpoints, the neural network has the following structure: the first 4 layers are open, and the remaining 12 are frozen. Conversely, if you freeze the first 4 layers and variable the remaining 12, the neural network generated different images from the same viewpoint. By manually assigning standard viewpoints (height and distance from the camera), the researchers were able to quickly create a multi-angle dataset from separate 2D images.
These multi-view images are included in the inverse graphics rendering framework to produce 3D mesh models from 2D images. After training on multi-view 2D images, GANverse3D only needs one 2D image to form the mesh of the 3D model. This 3D model can be used with a 3D neural renderer, allowing developers to customize objects and change backgrounds. And importing as an extension to the NVIDIA Omniverse platform and running on NVIDIA RTX GPUs, GANverse3D comes in handy for recreating any 2D image in 3D.
The results of testing the latest GAN model from NVIDIA, trained on 55,000 car images, outperformed an inverse graphics neural network trained on the popular Pascal3D dataset.
https://blogs.nvidia.com/blog/2021/04/16/gan-research-knight-rider-ai-omniverse/

202 views05:05

Open / Comment

2021-05-26 07:04:59 Easy Python: how to create your own class in a couple of lines with the dataclass module
Data analysts and Data Scientists like Python - an easy to-learn and to-use object-oriented language that contains many built-in functions and allows you to create your own. According to the DRY (Don’t Repeat Yourself) principle, the current development standard is to package code into classes that provide APIs for using their methods. In practice, you have to create your own classes in Python quite often, with the subsequent addition and expansion. You can do it faster with the standard module dataclasses, which should be imported first. Then you need to decorate the custom class with a special function dataclass, and list the necessary attributes inside this class, annotating them beforehand, i.e. by specifying the types. For example, for a custom class Person with an integer ID and string name, it would look like this:
from dataclass import dataclass
@dataclass
class Person:
id: int
name: str
The dataclasses module provides a decorator and functions to automatically add generated custom methods such as init () and repr () to user-defined classes. However, you need to use the dataclasses module wisely and be aware of the following features:
• by default, the module does not check the type specified in the variable annotation;
• the dataclass() decorator adds some methods to the custom class that may be redundant and duplicate the behavior of those already existing in the class;
• when using the dataclass() decorator, all base classes are viewed in reverse order, starting from the object, and for each class found, the attributes of its parent are added to the ordered display of the fields. All generated methods will use this combined computed ordered field mapping. And because the fields are in the order they are inserted, derived classes override base ones.

214 views04:04

Open / Comment

2021-05-24 07:38:31 Reinforcement learning (RL) is great for tasks with a well-defined reward function, as evidenced by the successful experiences of AlphaZero for Go, OpenAI Five for Dota, and AlphaStar for StarCraft. But in practice, it is not always possible to clearly define the reward function. For example, in a simple room cleaning case, an old business card found under the bed or a used concert ticket may be evaluated as trash, but if they are valuable for host they should not be thrown away. However, even if you set clear criteria for evaluating the analyzed object, converting them into rewards is not easy. If you give the agent a reward that reinforces his behavior every time when it collects the garbage, it can throw it back to collect again and receive reinforcement.
This behavior of the AI system can be prevented by forming a reward function based on feedback on the agent's behavior. But this approach requires a lot of resources: in particular, training the Deep RL Cheetah model from OpenAI Gym and MujoCo requires about 700+ human comparisons.
Therefore, researchers at the Berkeley University of California David Lindner and Rohin Shah proposed an RL-algorithm without human supervision or an explicitly assigned function to form a reward policy based on implicit information. They named it RLSP (Reward Learning by Simulating the Past) because RL is formed by modeling the past, based on judgments that allow the agent to draw inferences about human preferences without explicit feedback. The main difficulty with scaling RLSP is how to reason about previous experience in the case of a big amount of data. The authors propose to choose the most probable past trajectories of the development of events instead of their full enumeration, alternating the prediction of past actions with the prediction of the past states from which these actions were taken.
The RLSP algorithm uses gradient lifting to continuously update a linear reward function to explain the observed state. Scaling this idea is possible through a functional representation of each state and modeling a linear reward function for these characteristics, followed by an approximation of the RLSP gradient by sampling more likely past trajectories. The gradient encourages the reward function so that backward trajectories (which should have been done in the past) and forward trajectories (which the agent would have done using the current reward) are consistent with each other. Once the trajectories are consisted, the gradient becomes zero, and the reward function that is most likely to cause the observed state is known. The essence of the RLSP algorithm is to perform a gradient lift using this gradient. The algorithm was tested in the MujoCo simulator, an environment for testing RL algorithms on the problem of training simulated robots to move along the optimal trajectory or in the best possible way. The results showed that RLSP-generated reinforcement policies perform as well as those directly trained in the true reward function.
https://bair.berkeley.edu/blog/2021/05/03/rlsp/

175 views04:38

Open / Comment

2021-05-22 14:01:22 Data cleansing is too hard and too long? Try PClean, a new AI-system form MIT researchers written in a domain-specific probabilistic programming language for automatic data cleansing. It removes typos, duplicates, missing values, spelling errors and inconsistencies, making it easier to prepare a dataset for analysis and ML-modeling. Notably, PClean does not just mechanically cleanse data, but takes into account its semantics using generalized common sense models for judgments that can be customized for specific underlying data and error types.
The idea of probabilistic data cleansing based on declarative generalized knowledge of the research context is not new. It was published in a 2003 article by researches of the Berkley University of California. PClean develops this idea according to the trend of "explainable AI" with the realistic models of human knowledge to interpret data. Corrections in PClean based on Bayesian reasoning, whereby each alternative explanation for ambiguous data is assigned some weight to the existing probability data based on prior knowledge. An additional advantage of PClean is the ability to clean really large amounts of data, and relatively quickly. For example, in a recent 2021 study on table with 2.2 million rows of medical data, PClean found over 8,000 errors in just 7.5 hours. Finally, thanks to the principle of Bayesian probability, PClean give calibrated estimates of its uncertainty, which can be manually corrected and train the AI system.
https://news.mit.edu/2021/system-cleans-messy-data-tables-automatically-0511

208 views11:01

Open / Comment

2021-05-19 18:37:58 Since May 18, 2021, Google integrates all its ML cloud services into a single interface and API as part of the public accessible Vertex AI, a managed MLOps-platform for deploying and serving AI models. Vertex AI integrates AutoML and the AI platform into a single API, client library, and user interface. Users can independently manage data and prototypes, deploy and interpret their ML-models using Vertex tools: Vizier, Feature Store, Experiments, Continuous Monitoring and Pipelines.
Vertex AI integrates with many open-source frameworks (TensorFlow, PyTorch, and scikit-learn) and also supports all ML-frameworks with custom training and prediction containers. Also you can connect BigQuery, use standard SQL-queries in existing business intelligence and spreadsheet tools, and export datasets from BigQuery to Vertex AI. Vertex Data Labeling allows your mark your data accurately. Learn more and try the new DS-platform from the corporation of good here: https://cloud.google.com/vertex-ai

240 views15:37

Open / Comment

2021-05-12 19:14:24 YOLO is the first neural network recognized objects in real time on mobile devices. Due to the absence of “for”cycles in layers architecture, it provides high speed and accuracy of recognition in one pass.
The first version of YOLO was offered in 2016, and today, in May 2021, the 5th version has already been released. At the moment of the YOLO family provide the best results of real-time object detection.
YOLO works faster than R-CNN because it splits the image into a constant number of cells, instead of highlighting regions and calculating a solution for each of them. Now YOLO is not so good in recognition of objects with complex shapes or a group of small objects due to the insufficient number of candidates for the margins.
Nevertheless, in December 2020, Scaled YOLO v4 showed the best results (55.8% AP) on the Microsoft COCO dataset among peers, overtaking the Google EfficientDet D7x / DetectoRS neural network or SpineNet-190 (self-learning on additional data), Amazon Cascade in accuracy -RCNN, ResNest200 Microsoft RepPoints v2 and Facebook RetinaNet SpineNet-190. These results were achieved in the conditions of an optimal ratio of speed and accuracy from 15 FPS to 1774 FPS.

233 viewsedited 16:14

Open / Comment

2021-05-10 08:20:11 Apache Spark for Data Scientist: a short overview of ML packages
Data Scientists like the Apache Spark not only for ability to process really large datasets very quickly, but also for the presence of popular machine learning algorithms (classification, regression, clustering, filtering) and tools for preparing data for modeling (cleaning, feature extraction, transformation etc.), as well as algebraic and statistical functions. All this is packaged in special packages: MLLib (spark.mllib) and ML (spark.ml).
However "Spark ML" is not the official library name in the spark.ml package, but is often used to refer to the DataFrame-based MLlib API, unlike spark.mllib, which works with lower-level data structures - RDD (Resilient Distributed Dataset, a reliable distributed table-type collection). That official Apache Spark documentation emphasizes that both APIs are supported and neither is deprecated. In practice, most modern Spark apps developers, data analysts and Data Scientists work with the spark.ml package because of the flexible and convenient DataFrame API.

231 views05:20

Open / Comment

Big Data Science

Ratings & Reviews

The latest Messages 22

Popular Channels

Related Chats

Popular Channels

Login