Big Data Science

Channel address:

Categories: Technologies

Language: English

Subscribers: 1.44K

Description from channel

Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼 — https://t.me/bds_job — channel about Data Science jobs and career
💻 — https://t.me/bdscience_ru — Big Data Science [RU]

▲ Vote (1)

Ratings & Reviews

1.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 11

2021-12-27 09:19:49

155 views06:19

Open / Comment

2021-12-27 09:19:12 On the eve of the New Year, speeding up DS: meet the Polars
Polars is a fast ML modeling data preparation library for Python and Rust. It is 15 times faster than Pandas, parallelizing the processing of dataframes and queries in memory. Written in Rust, Polars uses all the cores of the computer. Also, the library is optimized for the specifics of data processing and supports Python. The rich API allows not only to work with huge amounts of data at the stage of their pre-preparation, but also to build working pipelines. The benchmarking comparison showed that Polars is ahead of not only Pandas, but also other tools, including computing engines popular in Big Data such as Apache Spark, Dask, etc.
Installing and trying Polars is very easy with the pip package manager:
pip install polars
import polars as pl
https://www.pola.rs/
https://betterprogramming.pub/this-library-is-15-times-faster-than-pandas-7e49c0a17adc

160 views06:19

Open / Comment

2021-12-24 09:48:25 AutoML and more with PyCaret
PyCaret is an open source AutoML library in Python with a low-level approach to automating most MLOps tasks. PyCaret has special features for parsing, deploying, and combining models that many other ML frameworks do not have. It allows you to go from preparing data to deploying an ML model in minutes in a user-selected development environment.
In fact, PyCaret is a Python wrapper for several libraries and ML frameworks: scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, etc. The simplicity of PyCaret allows it to be used not only by experienced DS specialists, but and ordinary users who are able to perform simple complex analytical tasks. The library is available for free download and use under the MIT license. The package contains several modules, functions in which are grouped according to the main use cases: from simple classification to NLP and anomaly detection.
https://pycaret.org/
https://github.com/pycaret/pycaret

251 views06:48

Open / Comment

2021-12-22 09:56:57 4 simple tips for effective data engineering
To prevent data engineering projects with hundreds of artifacts, including dependency files, jobs, unit tests, shell files, and Jupyter notebooks from becoming chaos, follow these guidelines:
• manage dependencies, for example through a dependency manager like Poetry
• remember about unit tests - introducing unit tests into the project will save you from trouble and improve the quality of your code
• divide and conquer - store all data transformations in a separate module
• document to remember the code and the business problem it solves yourself and share knowledge with colleagues
https://blog.devgenius.io/keeping-your-data-pipelines-organized-fa387247d59e

167 views06:56

Open / Comment

2021-12-20 05:59:37 Meet the Gallia: a new library for data transformation
This schema-enabled Scala library comes in handy for practical data transformation, including ETL processes, function development, HTTP responses, and more. Highly scalable, it is designed to bridge the gap between Pandas and Spark SQL. Gallia is useful for those who appreciate the powerful type system in Scala, and those who find it difficult to understand too fancy SQL queries. Essentially, Gallia implements a one stop shop paradigm for most or all of your data transformation needs in a single application. The library supports all kinds of data manipulation, from aggregations to pivoting tables, including processing individual and nested objects, not just collections. For scaling, Gallia integrates perfectly with the Spark RDD API.
https://cros-anthony.medium.com/gallia-a-library-for-data-transformation-3fafaaa2d8b9
https://github.com/galliaproject/gallia-core/blob/master/README.md

168 views02:59

Open / Comment

2021-12-17 10:46:30 Introduce useful DS-tools: meet the Streamlit
Streamlit is an open source Python library that makes it easy to create and publish beautiful custom web applications for ML and DS. Build and deploy powerful applications in just a couple of minutes. Comparing Streamlit to Dash is similar to comparing Python to C #. Streamlit makes it easy to build web data applications in pure Python code, often in a few lines of code. For example, one-line commands for displaying interactive visuals Plotly, Bokeh and Altair, Pandas DataFrames, etc. Streamlit is supported by a huge open-source developer community: add your own components to the library using JavaScript. And the cloud use of Streamlit is open to everyone: you can create and host up to three applications for free.
https://streamlit.io/

147 viewsedited 07:46

Open / Comment

2021-12-15 10:08:31 Speed up DS with big data: Pandas API right in Apache Spark
The popular computing framework Apache Spark allows you to write programs in Python, which is familiar to every DS-specialist. PySpark now includes a pandas library that can be imported with just one line: import pyspark.pandas as ps.
This provides the following benefits:
• lowers the threshold for entering Spark;
• unifies the codebase for small and big data, local machines and distributed clusters;
• speeds up Pandas code.
By the way, Pandas on Spark is even faster than the other popular Python engine, Dask!
https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html
https://towardsdatascience.com/run-pandas-as-fast-as-spark-f5eefe780c45
https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

137 views07:08

Open / Comment

2021-12-13 06:37:21 What is UMAP and why is it useful for Data Scientist?
UMAP (Uniform Manifold Approximation and Projection) is a universal manifold learning and dimensionality reduction algorithm. It is designed to be compatible with scikit-learn, uses the same API, and can be added to sklearn pipelines. As a stochastic algorithm, UMAP uses randomization to speed up the approximation and optimization steps. This means that different UMAP runs may produce different results. Although the UMAP is relatively stable, ideally the difference between runs should be relatively small, but it is. To ensure accurate reproduction of results, UMAP allows the user to set a random initial state.
Since version 0.4, UMAP also supports multithreading to improve performance, and when optimized, race conditions between threads are allowed at certain stages. The randomness in the UMAP output for the multithreaded case depends not only on the input random seed, but also on race conditions between threads during optimization, which is impossible to control. Therefore, multithreaded UMAP results cannot be explicitly reproduced.
UMAP can be used as an efficient preprocessing step to improve the performance of density-based clustering. But UMAP, like t-SNE, does not completely preserve density and can create false discontinuities in clusters. Compared to t-SNE, UMAP maintains a more global structure, creating more meaningful clusters. And thanks to the support for arbitrary embed sizes, UMAP allows you to work with large dimensional spaces.
Due to the active use of the nearest neighbors method, for some datasets, UMAP can consume excessive memory. Setting low_memory to True will help to switch to a slower, but less intensive approach to calculating nearest neighbors. It's also important to know that when run without a random seed, UMAP will use a parallel implementation of NUMBA to multithread and consume CPU cores. By default, it will use as many cores as available. You can limit the number of threads Numba uses by using the NUMBA_NUM_THREADS environment variable. Also due to the nature of Numba, UMAP does not support 32-bit Windows.
Despite some disadvantages, UMAP can be used in the following cases:
• exploratory data analysis (EDA);
• interactive visualization of the analysis results;
• processing of sparse matrices;
• detection of malicious programs based on behavioral data;
• preprocessing vectors of phrases for clustering;
• preprocessing of image embeddings (Inception) for clustering.
https://github.com/lmcinnes/umap
https://umap-learn.readthedocs.io/en/latest/index.html

198 views03:37

Open / Comment

2021-12-10 08:35:54 Python and SQL combo with FugueSQL: a single SQL interface for Pandas, Spark and Dask dateframes
FugueSQL is an open source Python library that allows you to combine Python code with an SQL command by switching between them in a Jupyter Notebook or Python script. FugueSQL supports distributed computing and provides a unified API to run the same SQL code in Pandas, Dask, and Apache Spark.
Unlike PandaSQL, which has a single SQLite server, which introduces a lot of overhead when transferring data between Pandas and the database, FugueSQL supports multiple local backends: pandas, DuckDB, and SQLite.
When using the pandas backend, Fugue translates SQL directly into pandas operations, excluding data transfers. DuckDB has excellent panda support so the data transfer overhead is negligible. Both Pandas and DuckDB are the preferred FugueSQL server-side modules for local data processing. Fugue also supports Spark, Dask, and cuDF (via blazingSQL) as backends.
The Fugue SQL code is parsed using ANTLR and mapped to equivalent functions in the Fugue API. FugueSQL has many features built in and the code is extensible with Python-code. By default, it supports the most common features of the function: filling in nulls, deleting nulls, renaming columns, changing the schema, and more. Fugue also improves on some improvements in standard SQL to handle end-to-end data workflows gracefully. For example, creating intermediate tables by assigning a number.
As Pandas %% fsql accepts NativeExecutionEngine as a default parameter. In Dask, FugueSQL is slightly slower than the native engine, but more complete in terms of implemented SQL keywords. FugueSQL also runs on Spark, mapping %% fsql operations to Spark and Spark SQL operations. This allows you to quickly develop distributed applications. All you have to do is create a local prototype using the NativeExecutionEngine, test it, and deploy it to the Spark cluster just by changing the execution engine.
https://towardsdatascience.com/introducing-fuguesql-sql-for-pandas-spark-and-dask-dataframes-63d461a16b27
https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/index.html

218 views05:35

Open / Comment

2021-12-08 07:14:44 What are CDC system and why you need it
To reduce the amount of data read from a corporate warehouse or lake, but always keep abreast of the latest changes, a CDC approach is used: Capture Changed Data or Change Data Capture. There are ready-made CDC tools: Oracle Golden Gate, Qlik Replicate and HVR, which are best suited for receiving data from frequently updated relational DBMSs. Also, data engineers create their own solutions: • CDC calculations using timestamps marking creation, update, and expiration in the source tables. Any process that inserts, updates, or deletes a row must also update the corresponding timestamp column. Hard deletion is not allowed. The drawbacks to this method are that you need to redesign the database structure to add a timestamp column, and the need for tight coupling between the original table and the ETL process code. • CDC computation using a negative query, when a link is created between the source and the target sink and a minus SQL query is executed to calculate the change log. This method is more of an anti-pattern, since only works if the source and destination databases are of the same type, and it also increases the amount of data moved. Both methods of handwritten CDC impose significant overhead on the original database. And special CDC tools reduce the load on the network and the source by analyzing the logs to calculate changes. However, the main disadvantage of off-the-shelf CDC solutions is their high cost. In addition, the source DBMS administrator must grant the CDC tool privileged access to the database log, which is perceived as not very loyal for security reasons.
https://towardsdatascience.com/change-data-capture-cdc-for-data-ingestion-ca81ff5934d2

162 views04:14

Open / Comment

Big Data Science

Ratings & Reviews

The latest Messages 11

Popular Channels

Related Chats

Popular Channels

Login