Big Data Science

Channel address:

Categories: Technologies

Language: English

Subscribers: 1.44K

Description from channel

Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼 — https://t.me/bds_job — channel about Data Science jobs and career
💻 — https://t.me/bdscience_ru — Big Data Science [RU]

▲ Vote (1)

Ratings & Reviews

1.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 3

2022-07-18 07:53:58 7 Platforms of Federated ML
Federated learning is also referred to as collaborative because ML models are trained on multiple decentralized edge devices or servers containing local data samples without exchanging them. This approach differs from traditional centralized ML methods, where all local datasets are uploaded to a single server, and from more classical decentralized approaches with the same distribution of local data. Today, federated learning is actively used in the defense industry, telecommunications, pharmaceuticals and IoT platforms.
Federated Machine Learning ideas were first introduced by Google in 2017 to improve mobile keyboard text prediction using machine learning models trained on data from multiple devices. In federated ML, models are trained on multiple local datasets on local nodes without explicit data exchange, but with periodic exchange of parameters, such as deep neural network weights and biases, between local nodes to create a common global model. Unlike distributed learning, which was originally aimed at parallelizing computations, federated learning is aimed at learning heterogeneous data sets. In federated ML, datasets are usually highly heterogeneous in size. And clients, i.e. end devices where local models are trained can be unreliable and more prone to failure than in distributed learning systems where the nodes are data centers with powerful computing capabilities. Therefore, in order to provide distributed computing and synchronization of its results, federated ML requires frequent data exchange between nodes.
Due to its architectural features, federated ML has a number of disadvantages:
• heterogeneity between different local datasets - each node has an error in relation to the general population, and sample sizes can vary significantly;
• temporal heterogeneity - the distribution of each local dataset changes over time;
• it is necessary to ensure the compatibility of the data set on all nodes;
• hiding training datasets is fraught with the risk of introducing vulnerabilities into the global model;
• lack of access to global training data makes it difficult to identify unwanted biases in training inputs;
• there is a risk of losing updates to local ML models due to failures at individual nodes, which may affect the global model.
Today, federated ML is supported by the following platforms:
• FATE (Federated AI Technology Enabler) https://fate.fedai.org/
• Substra https://www.substra.ai/
• Python libraries PySyft and PyGrid https://github.com/OpenMined/PySyft, https://github.com/OpenMined/PyGrid, https://github.com/OpenMined/pygrid-admin
• Open FL https://github.com/intel/openfl
• TensorFlow Federated (TFF) https://www.tensorflow.org/federated
• IBM Federated Learning https://ibmfl.mybluemix.net/
• NVIDIA CLARA https://developer.nvidia.com/clara

615 views04:53

Open / Comment

2022-06-24 08:36:25

#test
The statistical power does NOT depend on

Anonymous Quiz

19%

the magnitude of the effect of interest in the population

56%

expected value

the sample size used to detect the effect

21%

the statistical significance criterion used in the test

43 voters155 views05:36

Open / Comment

2022-06-22 08:33:04 Best of May Airflow Summit 2022!
Top reports from data engineers for data engineers: the most interesting talks, from the intricacies of the batch-orchestrator to best practices of deployment and data management.
https://medium.com/apache-airflow/airflow-summit-2022-the-best-of-373bee2527fa

140 viewsedited 05:33

Open / Comment

2022-06-21 07:20:54

Новый Python: теперь намного быстрее!

204 views04:20

Open / Comment

2022-06-21 07:20:28 New Python: faster more than 2x!
Released in April 2022, the alpha of Python 3.11 can run up to 60% faster than the previous version in some cases. Benchmarking tests conducted by Phoronix, conducted on Ubuntu Linux and compiled with the GCC compiler, have shown that Python 3.11 scripts run an average of 25% faster than Python 3.10 without changing the code. This became possible due to the fact that now the interpreter is responsible for the static placement of its code objects and speeding up the seda of execution. Now, every time Python is used to call one of its own functions, a new frame is created, the internal structure of which has been improved so that it saves only the most important information without additional data about memory management and debugging.
Also, as of release 3.11, it is introduced that when CPython encounters a Python function that calls another function, it sets up a new frame and jumps to the new code contained within it. This avoids calling the function responsible for interpreting C (previously, each call to a Python function called a C function that interpreted it). This innovation further accelerated the execution of Python scripts.
https://levelup.gitconnected.com/the-fastest-python-yet-up-to-60-faster-2eeb3d9a99d0

246 views04:20

Open / Comment

2022-06-18 18:40:21

#test
False signal of the car alarm sensor (without real threat) is error

Anonymous Quiz

38%

type II

53%

type I

depends of statistical significance level

it is not an error

32 voters95 views15:40

Open / Comment

2022-06-17 10:04:20

Data analytics - blog of the leading Data Scientist working at Uber, one of the authors of Machine Learning. The channel material will help you really grow into a data professional.

1 channel instead of thousands of textbooks and courses, subscribe:

@data_analysis_ml

144 viewsedited 07:04

Open / Comment

2022-06-16 06:26:48 GATO: the new SOTA from DeepMind
May 19, 2022 DeepMind published an article about a new generic universal agent outside the realm of text outputs. GATO operates as a multi-modal, multi-tasking, multi-variant universal policy. On the same network with the same weights, you can play Atari, record images, chat, manipulate blocks, and set other tasks depending on your context: generate text, determine optimal joint torques, detect points, etc.
GATO is trained on a large number of data sets including experienced agents in both simulated and sparse environments, in addition to many natural language and image data sets. At the GATO training stage, data from various tasks and modalities are serialized in a flat sequence of tokens, grouped and processed by a transformer neural network, similar to a large language model. Losses are masked so GATO only predicts action and text targets.
When the Gato is deployed, the demo prompt is tokenized, forming the initial sequence. The environment then emits the first observation, which is also tokenized and reproduced to the next one. GATO autoregressively selects an action vector one token at a time. Once all markers containing an action vector are selected (defined by environment specific actions), the action is decoded and placed in the environment, which performs the steps and produces a new observation. Then the procedure is repeated. The model always observes all observations and actions in its context window of 1024 tokens.
https://www.deepmind.com/publications/a-generalist-agent

228 views03:26

Open / Comment

2022-06-14 09:19:08 LAION-5B: open dataset for multi-modal ML for 5+ billion text-image pairs
On May 31, 2022, the non-profit organization of AI researchers presented the largest dataset of 5.85 billion image-text pairs filtered using CLIP. The LAION-5B is 14 times larger than its predecessor, the LAION-400M, which was previously the world's largest open image-to-text dataset.
2.3 billion pairs are in English, and the other half of the dataset contains samples from over 100 other languages. The dataset also includes several nearest neighbor indices, an improved web interface for exploration and subsetting, and watermark detection and NSFW scores. The dataset is recommended for research purposes and is not specifically controlled.
The entire 5 billion dataset is divided into 3 datasets, each of which can be downloaded separately. They all have the following column structure:
• Image URL
• TEXT - subtitles, in English for en, in other languages for multi and nolang
• WIDTH - image width
• HEIGHT - image height
• LANGUAGE - sample language, laion2B-multi only, calculated using cld3
• Similarity – similarity, cosine between text and image for ViT-B/32 embedding, clip for en, mclip for multi and nolang
• Pwatermark - the probability of a watermarked image, calculated using the laion watermark detector.
• Punsafe - The probability that an image is unsafe is calculated using the laion clip detector.
pwatermark and punsafe are either available as separate collections that must be joined with a url+text hash.
Details and links to download: https://laion.ai/laion-5b-a-new-era-of-open-large-scale-multi-modal-datasets/

189 views06:19

Open / Comment

2022-06-10 08:31:36

#test
What is the difference between projection and view in relational databases?

Anonymous Quiz

these terms are the same

these terms are relevant to different context

75%

projection is operation of relational algebra, view is result of query execution

11%

view is operation of relational algebra, projection is result of query execution

44 voters164 views05:31

Open / Comment

Big Data Science

Ratings & Reviews

The latest Messages 3

Popular Channels

Related Chats

Popular Channels

Login