Big Data Science

Channel address:

Categories: Technologies

Language: English

Subscribers: 1.44K

Description from channel

Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼 — https://t.me/bds_job — channel about Data Science jobs and career
💻 — https://t.me/bdscience_ru — Big Data Science [RU]

▲ Vote (1)

Ratings & Reviews

1.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 13

2021-11-17 06:44:49 Digital Twin of Earth from NVIDIA
To prevent climate disasters, NVIDIA is set to build the world's most powerful AI supercomputer for predicting climate change. This system will create a digital twin of the Earth in the Universe and will become an analogue of Cambridge-1, the world's most powerful supercomputer with artificial intelligence for medical research. By combining three technologies: GPU-accelerated computing, deep learning on neural networks, and AI supercomputers with a lot of observable and model data, scientists and engineers are set to achieve accurate simulations of physical, biological and chemical processes on Earth. This will help shape early warnings for the adaptation and resilience of urban infrastructure so that people and countries can act quickly to prevent climate disasters.
https://blogs.nvidia.com/blog/2021/11/12/earth-2-supercomputer/

161 views03:44

Open / Comment

2021-11-15 07:22:42 Fastparquet: Reading Parqufet Files with Python
Apache Parquet is a binary column-oriented storage format originally created for the Hadoop ecosystem. Thanks to its concise and efficient column-wise representation of data, it is very popular in the Big Data world. However, reading the data in Parquet format is not an easy task. PySpark can handle this, of course, but not every Data Scientist works with data in Apache Spark. This is where fastparquet comes in, a Python implementation of the Parquet format used by Dask, Pandas, and others to deliver high performance with a small distribution size and small codebase. Fastparquet depends on a set of Python libraries (numpy, pandas, cramjam, fsspec), so they should be installed beforehand.
After installation via the PIP package manager (pip install fastparquet) or from Github (pip install git + https: //github.com/dask/fastparquet), the contents of the Parquet file can be easily transferred to the dataframe in your usual DS-IDE as Jupiter Notebook:
from fastparquet import ParquetFile
pf = ParquetFile('myfile.parq')
df = pf.to_pandas()
df2 = pf.to_pandas(['col1', 'col2'], categories=['col1'])

Or write a dataframe to a Parquet file, specifying the number of logical segments, compression codec and data scheme:
from fastparquet import write
write('outfile.parq', df)
write('outfile2.parq', df, row_group_offsets=[0, 10000, 20000],
compression='GZIP', file_scheme='hive')

https://github.com/dask/fastparquet
https://www.anaconda.com/blog/whats-new-with-fastparquet
https://blog.datasyndrome.com/using-the-python-ml-stack-inside-pyspark-de1223942c32
https://fastparquet.readthedocs.io/en/latest/

215 views04:22

Open / Comment

2021-11-12 07:20:22 ML for child protection
In the United States, one in seven children has been abused or neglected in the past year. US Child Protection Agencies receive several million reports of alleged neglect or abuse each year. Therefore, some agencies are implementing ML to help professionals review cases and determine which ones should be investigated next. But these ML models are useless if users don't understand and trust their results.
So a study from MIT and others has developed a visual analytics tool that uses bar charts to show how specific factors in a case affect the predicted risk of a child being homeless over the next two years. This risk assessment is based on over 100 demographic and historical factors, including parental age and criminal record. Having tested the created solution, the developers drew conclusions about the need to improve the visualization and interpretability of forecasts in order to avoid dangerous distortions in making important decisions.
https://news.mit.edu/2021/machine-learning-high-stakes-1028

179 views04:20

Open / Comment

2021-11-10 07:23:16 AI predicts how old children are. How safe is it?
Today, any content is usually limited by age, indicating the minimum number of years of its potential consumer. Everyone knows the labeling of films 18+. And in most social networks, users over the age of 13 can independently create accounts. Unsurprisingly, facial recognition technologies are being used to determine the age of content consumers or users. Looking further, AI can predict what a person will look like in old age: remember the recent boom in FaceApp.
Likewise, Yoti's age solutions have a margin of error of less than 3 years for a range of 6 to 60 years. For users under 25 years of age, the margin of error is less than 1.5 years. In the next few weeks, they will launch in major supermarket chains in the UK, for example, to prevent the sale of alcohol to minors. Yoti trained her neural networks using hundreds of thousands of images of people's faces from their official documents (passports and driver's licenses). Due to the high risk of leakage of such confidential data, other players in the face recognition market say that age verification should, if possible, be done without analyzing the face itself or other biometric data.
https://www.wired.com/story/ai-predicts-how-old-children-are/

221 views04:23

Open / Comment

2021-11-08 06:26:14 Self-Supervised Reversible Reinforcement Learning: A New Approach from Google AI
Reinforcement learning (RL) is great at solving problems from scratch, but it is not easy to train an agent to understand the reversibility of his actions. For example, robots should avoid activities that could damage them. To evaluate the reversibility of an action, one needs practical knowledge and understanding of the physics of the environment in which the RL agent exists. Therefore, Google AI researchers at the NeurIPS 2021 conference present a new way to approximate the reversibility of the actions of RL agents. This approach adds a separate reversibility assessment component to self-directed reinforcement learning from untagged data collected by agents. The model can be trained online (with the RL agent) or offline (from the interaction dataset) to guide RL policies towards reversible behavior. This can significantly improve the performance of RL agents when performing multiple tasks.
The reversibility component added to the RL procedure is extracted from interactions and is a model that can be trained separately from the agent itself. The model is trained on its own and does not require data markup indicating the reversibility of actions: the model itself learns which types of actions tend to be reversible from the context of the training data. This takes into account the probability of occurrence of events and priority as a proxy measure of the true reversibility, which can be learned from the dataset of interactions, even without rewarding the RL agent.
This method allows RL agents to predict reversibility of action by learning to simulate the temporal order of randomly selected trajectory events, resulting in better exploration and control. The method is self-checking, i.e. does not require prior knowledge of reversibility, which is suitable for different environments.
https://ai.googleblog.com/2021/11/self-supervised-reversibility-aware.html

271 views03:26

Open / Comment

2021-11-05 07:58:23 Optuna: How to Automate Hyperparameter Tuning
Tuning the hyperparameters of an ML model takes a lot of time and effort. To simplify this task, you can use special frameworks, one of which is Optuna. Launched in 2019, this platform has the following advantages:
• compatibility with PyTorch, Chainer, TensorFlow, Keras, MXNet, Scikit-Learn, XGBoost, LightGBM and other ML frameworks;
• work in an understandable DS-language using conditional expressions, loops and Python syntax;
• the ability to handle continuous hyperparameter values by tuning alpha and lambda regularization to any floating point values within a given range;
• use of Bayesian selection algorithms with the ability to remove the obviously losing space of the given hyperparameters from the analysis in order to speed up the optimization;
• parallelization of the search for hyperparameters over several threads or processes without changing the code;
• works faster than analogs (RandomSearch, GridSearch, hyperopt, scikit-optimize);
• detailed documentation.
https://optuna.org/
https://towardsdatascience.com/kagglers-guide-to-lightgbm-hyperparameter-tuning-with-optuna-in-2021-ed048d9838b5

69 views04:58

Open / Comment

2021-11-03 09:17:46 Simple interpolation in Scipy instead of complex optimization
SciPy is a set of math algorithms and helper functions built on an extension of the NumPy Python library. It adds many high-level commands and classes for manipulating and visualizing data, allowing a DS specialist to get away with a regular Python code development environment without complex math systems like MATLAB, IDL, Octave, R-Lab, and SciLab.
For example, interpolation of experimental data in complex scientific or business research. Having obtained the interpolation function from Scipy, you can use it in further calculations. This is useful when additional collection and experimentation is expensive or time-consuming, such as semiconductor development, chemical process optimization, production planning, etc.
Interpolation will help to conduct simulations with datasets where data points are collected at a large interval. For example, you can create an interpolation function using linear, quadratic, or cubic splines and run the interpolation function to evaluate the results of an experiment or simulation on a dense mesh.
This method does not guarantee the best results in all situations, but it is suitable for most real-life situations. Despite the fact that interpolation requires smoothness (continuity) of the function, this assumption can be applied to most real ones, which are not too jumpy, but rather smooth for interpolation methods.
Scipy interpolation routines work in both 2D and 1D cases. For example, you can get a smooth interpolated 2D surface from sparse data using Scipy interpolation, creating a 4900-point matrix from 400 actual data points.
In Scipy, the scipy.interpolate package is responsible for interpolation, which contains spline functions and classes, one-dimensional and multidimensional interpolation classes, Lagrange and Taylor polynomial interpolators, and wrappers for the FITPACK and DFITPACK functions.
https://towardsdatascience.com/optimizing-complex-simulations-use-scipy-interpolation-dc782c27dcd2
https://docs.scipy.org/doc/scipy/reference/tutorial/interpolate.html

199 views06:17

Open / Comment

2021-11-01 17:38:13 Not only CoPilot: Anomaly Detection in Software Development with ControlFlag by IntelLabs
Intel provided its response to OpenAI's AI CoPilot. Meet ControlFlag, a self-checking pattern detection system that learns typical patterns in the control structures of high-level programming languages like C / C ++ by pulling them from GitHub repositories, etc. These patterns are then used to detect anomalies in custom code. ControlFlag can be used to detect typos, missed NULL checks, and more.
Having first introduced the product at the end of 2020, in 2021 Intel opened up the ControlFlag source code based on the unsupervised learning method. After studying more than a billion lines of code, the system finds errors with high accuracy and is able to adapt to the style of individual developers in order to distinguish anomaly from the peculiarities of coding.
https://github.com/IntelLabs/control-flag

171 views14:38

Open / Comment

2021-10-30 08:00:31 TOP-15 the most interesting DS-conferences all over the world in November 2021
1. 01-02 Nov – 15th International Conference on Big Data Analytics and Big Data Science https://waset.org/big-data-analytics-and-big-data-science-conference-in-november-2021-in-san-francisco
2. 2 Nov - OSA Con, Open Source Analytics Conference, a free virtual event https://altinity.com/osa-con-2021/
3. 2-3 Nov - Chief Data & Analytics Officer, Europe – annual meeting https://cdao-eu.coriniumintelligence.com/
4. 3-4 Nov – 3rd International Conference on Big Data Analytics and Data Science https://crgconferences.com/datascience/
5. 3-4 Nov - Ai4 2021 Enterprise Summit, Exploring AI Across Industry https://ai4.io/enterprise-ai/
6. 4-6 Nov - AAAI 2021 Fall Symposium on Science Guided Machine Learning https://sites.google.com/vt.edu/sgai-aaai-21
7. 8-11 Nov - NVIDIA GTC, the Conference for AI Innovators, Technologists, and Creators https://www.nvidia.com/gtc/
8. 8-19 Nov - PRODUCT LEADERSHIP FESTIVAL 2021 - Product, Design & Data https://www.productleadership.com/events/product-leadership-festival
9. 9 Nov - RE.WORK Women in AI https://www.re-work.co/events/women-in-ai-virtual-2021
10. 11-12 Nov – DACH AI, Data Analytics and Insights Summit https://berryprofessionals.com/ai-data-analytics-and-insights-summit-dach/
11. 15-18 Nov - Toronto Machine Learning Summit (TMLS) 2021 https://www.torontomachinelearning.com/
12. 15-17 Nov - Marketing Analytics and Data Science https://informaconnect.com/marketing-analytics-data-science
13. 16-18 Nov – Open Data Science Conference https://odsc.com/
14. 18 Nov - SAS Global Learning Conference https://www.sas.com/content/sascom/sas/events/global-learning-conference.html
15. 21-25 Nov – Data Science Conference Europe 2021 https://datasciconference.com/

130 views05:00

Open / Comment

2021-10-29 06:26:49 Simple combo for data analyst: 3 frameworks joining Python + spreadsheets
In practice, any data analyst works with datasets not only in Jupyter Notebook or Google Colab. Sometimes you have to open spreadsheet files Excel and Google Spreadsheets. Therefore, there is a need to combine Python scripts with built-in spreadsheet tools. The following frameworks come in handy for this:
• XLWings is a Python package that is actually preinstalled on Anaconda and is most often used to automate Excel processes. It is similar to Openpyxl, but more reliable and user-friendly. For example, you can write your own UDF in Python to parse web pages, machine learning, or solve NLP problems on data in a spreadsheet. https://www.xlwings.org/tutorials/
• Mito is a spreadsheet interface for Python, a spreadsheet within Jupyter that generates code. Mito supports basic Python functions like: merge, join, pivot, filtering, sorting, visualization, adding columns, using spreadsheet formulas, etc. https://docs.trymito.io/
• Openpyxl is a set of Python packages for reading from and writing to Excel. For example, you can connect to a local Excel file and access a specific cell or group of cells by fetching data into a DataFrame. And after processing, you can send the data back to the Excel file. In practice, this package is most often used in the financial sector, since processing large datasets in Excel is too slow. https://foss.heptapod.net/openpyxl/openpyxl

168 views03:26

Open / Comment

Big Data Science

Ratings & Reviews

The latest Messages 13

Popular Channels

Related Chats

Popular Channels

Login