Big Data Science

Channel address:

Categories: Technologies

Language: English

Subscribers: 1.44K

Description from channel

Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼 — https://t.me/bds_job — channel about Data Science jobs and career
💻 — https://t.me/bdscience_ru — Big Data Science [RU]

▲ Vote (1)

Ratings & Reviews

1.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 2

2022-08-09 10:33:18

11 августа состоится Alfa Data Science MeetUp#2

Участие бесплатное, необходимо зарегистрироваться на сайте, чтобы получить ссылку на онлайн-трансляцию.

Темы и спикеры:
Развитие клиентской базы: моделирование LTV и прогноз будущих доходов
- Сергей Королёв, Middle Data Scientist Альфа-Банк
Uplift-моделирование в ценообразовании кредитных продуктов
- Максим Коматовский, Junior Data Scientist Альфа-Банк
Совершенный код расчёт
- Максим Cтаценко, Team Lead/Senior DWH Developer в Яндекс
Побеждаем смещение распределения в задаче нейросетевого кредитного скоринга
- Алексей Фирстов, Senior Data Scientist Альфа-Банк

Митап пройдет в интерактивном формате, вопросы спикерам приветствуются, авторы лучших вопросов получат призы от Alfa Digital.

596 views07:33

Open / Comment

2022-08-08 12:53:58 Looking for data to train ML-models? Generate it Yourself: 3 Python Packages for Generating Synthetic Data
Synthetic data is an artificially generated, not collected, topic learning dataset for training ML models or practicing analysis techniques. You can create them yourself using possible Python packages:
• Faker is a very simple and efficient Python package for creating their data. It's great when you need to load data into a database, create a use of XML documents, prepare for load testing, or anonymize data retrieved from involved services. https://github.com/joke2k/faker
• SDV (Synthetic Data Vault) is a synthetic data storage for creating synthetic data based on a given dataset. The generated data can be a single summary, pivot table, or time series, and have the same properties and statistics as the original dataset. SDV uses synthetic data with DL models. Even if the original dataset contains multiple data types and gaps, SDV handles them. https://sdv.dev/SDV/
• Gretel Synthetics - a source code package based on a recurrent neural network for generating structured and unstructured data. The batch approach treats a data set as text data and trains a model based on it. The model will then create synthetic data with text data. Gretel is based on RNN networks, it requires more computing power, so when working with it, it is better to use Google Colab, rather than load a personal computer. https://synthetics.docs.gretel.ai/en/stable/

501 views09:53

Open / Comment

2022-08-04 07:17:35 4 utilities for working with JSON files
Hadoop and Spark are the most popular big data frameworks for working with big data - large files. But often you need to process many small files, for example, in JSON format, which in Hadoop HDFS are distributed over many data blocks and partitions. The number of partitions determines the number of tasks since 1 task can deal with only 1 partition at a time. This will be a high load for the Application Master and reduction of productivity for the entire cluster. In addition, most of the time is spent only on opening and closing files, and not on reading data.
Therefore, it is worth to combine many small files in large one, which Hadoop and Spark can process very quickly. In the case of JSON files, such a union into an array of records can be done using the tools:
• jq – is used to filter and process incoming JSON data, great for parsing and processing data streams https://stedolan.github.io/jq/
• jo - creates JSON data structures https://github.com/jpmens/jo
• json_pp - displays JSON objects in a more convenient format, as well as convert them between textures https://github.com/deftek/json_pp
• jshon - JSON parser with fast evaluation of large amounts of data http://kmkeen.com/jshon/
https://sidk17.medium.com/boss-we-have-a-large-number-of-small-files-now-how-to-process-these-files-ee27f67dc461

552 views04:17

Open / Comment

2022-08-02 09:24:45 Data analysts speak SQL. How can they understand themselfs?
Every analyst knows 5 rules of SQL query formatting to make them easy to read:
• Place Key Words (SELECT, FROM and WHERE) On New Lines
• List Column Names after SELECT On New Lines
• Indent Sub-Elements On New Line
• Add Subquery Parenthesis On Their Own Lines
• Place Case Statement Conditions On New Lines
However, not all analysts apply these rules in practice. Of course, specialized IDEs take on formatting functions, for example, Visual Studio Code has built-in document formatting capabilities, as well as the ability to connect external extensions such as SQLTools or SqlBeautifier. If you need to read a very big SQL- query from colleagues, presented in the form of flat text, use online formatters to convert the text of the SQL query to a readable form:
• https://codebeautify.org/sqlformatter
• https://www.freeformatter.com/sql-formatter.html
• https://sqlformat.org/

426 views06:24

Open / Comment

2022-07-30 07:58:14 TOP-10 DS-events in August 2022 all over the World:
• Aug 5, Bayesian Modelling Applications Workshop. Eindhoven, The Netherlands + Virtual. http://abnms.org/uai2022-apps-workshop/
• Aug 8-12, Data Matters. Virtual. https://datamatters.org/
• Aug 11, Subsurface Community Meetup: Why Apache Arrow is the industry-standard for columnar data processing and transport. Virtual. https://subsurfacemeetupaugust2022.splashthat.com/
• Aug 14-18, KDD 2022: ACM SIGKDD 2022. Washington, DC, USA. https://kdd.org/kdd2022/
• Aug 15, 1st ACM SIGKDD Workshop on Content Understanding and Generation for E-commerce. Washington, DC, USA. https://content-generation.github.io/workshop/
• Aug 15-17, TDWI Data Literacy Bootcamp. Virtual. https://tdwi.org/events/seminars/august/tdwi-data-literacy-bootcamp/home.aspx
• Aug 15-17, Disney Data & Analytics Conference. Orlando, FL, USA. https://disneydataconference.com/
• Aug 16, StateOfTheArt() - Free AI Conference with Top AI/ML Influencers. Virtual. https://www.eventbrite.com/e/stateoftheart-free-ai-conference-with-top-aiml-influencers-tickets-379160628647
• Aug 16-18, Ai4 2022, the industry's leading AI event. Aug 16, Las Vegas, NV, USA. https://ai4.io/usa/application-attendee/
• Aug 23, The data dividend: Mumbai. Mumbai, India. Aug 23-24, Ray Summit. San Francisco, CA, USA. https://events.economist.com/custom-events/the-data-dividend-mumbai

473 views04:58

Open / Comment

2022-07-27 07:34:45 PyMLPipe: A lightweight MLOps Python Package
PyMLPipe is a lightweight Python package for MLOps processes. It helps to automate:
• Monitoring of models and data schemas
• Versioning of ML models and data
• Model performance comparison
• API deployment in one click
This source library supports Scikit-Learn, XGBoost, LightGBM and Pytorch. It has a modular structure, represented by a set of Python functions combined into an API and a visual graphical interface. PyMLPipe is great for working with tabular data.
https://neelindresh.github.io/pymlpipe.documentation.io/

565 views04:34

Open / Comment

2022-07-25 11:09:54 TOP 4 dbt tips for data analyst and data engineer
dbt (data build tool) is an open source code framework for executing, testing and documenting SQL queries, which allows you to process data analysis machine, including structuring and description of arrivals, their search, nested calls, rule triggering, documentation and testing. For example, you can use the dbt CLI or dbt Cloud to work with data collection to consume, transform, and load data into storage by computing a dynamic database on a schedule. To increase the efficiency of using dbt for the selection of schemas, sources and models, it is possible to use data:
• The Schema.yml file can only be found in the dbt models folder. The tool allows you to create a unit test that counts the duration of a column for nulls.
• dbt data tests have a strict rule that they must return null rows in order to pass the test. Instead of looking for a value such as the number of a particular set of rows, the data test should be written to expect to find null rows if the results do not match the correct set of sums. Therefore, when developing test data, you need to think about how to return 0 rows in the expected key, but at the same time you need to check the number. You can use the != or <= operators to validate data.
• To increase the speed of testing increase the number of threads in the project profile, in the profiles.yml file. For example, if there are 30 tests, then there are 40 threads, indicate in the profiles.yml file. Probably 30 data and schema tests in 4 seconds.
• The history test needs a meaningful name. Although dbt automatically learns the test names, it is recommended that you label them yourself. dbt doesn't have much control over running small test suites, it needs to be able to see all running projects. In the same way that developers are encouraged to use functions and variables with semantic name definitions, testing should be given tests for meaningful names. Otherwise, it will be difficult to determine which test passes or fails during test execution. When a test error is found in dbt, all schema and data tests are run together. It's not easy to use a single directory in the data tests folder, but you can name them "dbt test - schema" or "dbt test - data" to quickly determine which tests to use.

https://corissa-haury.medium.com/4-quick-facts-about-dbt-testing-5c32b487b8cd

618 views08:09

Open / Comment

2022-07-22 07:18:42

#test
Avoid overfitting the ML-model on a large volume of highly noisy input data by highlighting the most significant features one of the following methods will help

Anonymous Quiz

15%

filtration

47%

L1 regularization

27%

L2 regularization

11%

normalization

74 voters566 views04:18

Open / Comment

2022-07-20 18:54:24

Computational Complexity of Machine Learning Algorithms

1.5K views15:54

Open / Comment

2022-07-20 18:53:54 Computational complexity of ML algorithms
When the amount of data is low, almost any ML algorithm gives acceptable accuracy and is suitable for solving the tasks. But when the volume and size of the data become large, it is necessary to choose an algorithm for training the ML model that does not require too many computing resources. It is better to choose a simple or less expensive algorithm in terms of computation than an algorithm that requires large computational resources, when the accuracy of prediction and evaluation of results is similar or even slightly worse.
The choice of algorithm depends on the following consequences:
• the order of time (complexity of time) required to calculate the algorithm - functions associated with the data of the algorithm itself, the volume and number of features
• set of computational space (spatial complexity) - the order of the space required during the calculation of the algorithm - a function associated with the algorithm, such as the number of features, coefficients, hidden layers of neural networks. Space complexity includes both the size of the input data and the ancillary space (auxiliary space) used by the algorithm during execution;
For example, Mergesort has an ancillary space 𝑂(𝑛) and volume complexity 𝑂(𝑛), while Quicksort has an ancillary space 𝑂(1) and volume complexity 𝑂(𝑛). As a result, both merge sort and quick sort have time stability 𝑂(𝑛log𝑛).
https://medium.com/datadailyread/computational-complexity-of-machine-learning-algorithms-16e7ffcafa7d

535 viewsedited 15:53

Open / Comment

Big Data Science

Ratings & Reviews

The latest Messages 2

Popular Channels

Related Chats

Popular Channels

Login