🔥 Burn Fat Fast. Discover How! 💪

Big Data Science

Logo of telegram channel bdscience — Big Data Science B
Logo of telegram channel bdscience — Big Data Science
Channel address: @bdscience
Categories: Technologies
Language: English
Subscribers: 1.44K
Description from channel

Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼 — https://t.me/bds_job — channel about Data Science jobs and career
💻 — https://t.me/bdscience_ru — Big Data Science [RU]

Ratings & Reviews

1.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

0

4 stars

0

3 stars

1

2 stars

0

1 stars

2


The latest Messages 15

2021-10-13 07:44:25 FLAN by Google AI: generalizable Language Models with Instruction Fine-Tuning
In order for an ML-model to generate meaningful text, it must have a large amount of knowledge about the world and the ability to abstract. While language models that are trained to do this are able to automatically acquire this knowledge as they scale, their ML models should better uncover this knowledge and apply it to specific real-world problems.
One recent popular technique for using language models to solve problems is called the zero-shot prompt or the multi-shot prompt. This method formulates a problem based on the text that the language model could see during training, in order to then generate a response, complementing the text. While this method has good performance for some tasks, it requires careful design to make the tasks look like the data the model saw during training. This approach works well for some tasks, but may not be intuitive for practical interaction with the model. For example, the creators of GPT-3 have found that such hinting methods do not lead to good performance in natural language inference (NLI) tasks.
Instead, FLAN tunes the model with a wide variety of instructions that use a simple and intuitive description of the problem, such as “Classify this movie review as positive or negative” or “Translate this sentence into Danish”. Creating a dataset with instructions from scratch to fine-tune the model will be resource intensive. Therefore, templates can be used to convert existing datasets to training format. Experiments by Google AI researchers have shown the success of this approach, testing FLAN and GPT-3 on 25 tasks.
Notably, on a small scale, the FLAN method actually degrades performance, and only on a larger scale does the model become able to generalize instructions in the training data to invisible problems. This is due to the fact that small models do not have enough parameters to perform a large number of tasks.
https://ai.googleblog.com/2021/10/introducing-flan-more-generalizable.html
32 views04:44
Open / Comment
2021-10-11 08:39:37
203 views05:39
Open / Comment
2021-10-11 08:39:29 Math for Data Scientist: 3 Distance Measures, Part 1
• Euclidean Distance
- measures the length of the line that connects two points. The most common measure, but not scalable. The calculated distances may be skewed depending on the units of the objects. Therefore, before using this measure, you need to normalize the data. As the dimension of the data increases, the usefulness of the Euclidean distance decreases. But this measure works great for low-dimensional data. For example, the kNN and HDBSCAN methods show good results with this measure. Finally, Euclidean distance is intuitive to use and easy to implement.
• Cosine Similarity - the cosine of the angle between two vectors. This method helps to eliminate the disadvantages of high-dimensional Euclidean distance. Two vectors with the same orientation have a cosine similarity of 1, and vectors that are diametrically opposed to each other have a similarity of -1. The magnitude of the vectors is irrelevant as this is a measure of orientation. Therefore, this measure is not very suitable for recommendation systems, because cosine similarity does not account for the difference in the rating scale between different users. Nevertheless, cosine similarity is useful when there is multidimensional data and the magnitude of the vectors does not matter, for example, for text analysis.
• Hamming distance - the number of values that differ in two vectors. Typically used to compare two binary strings of the same length, for example, to compare how similar they are to each other by calculating the number of characters that differ. Hamming distance is difficult to apply when two vectors have different lengths. For example, for correcting or detecting errors in data transmission over computer networks when determining the number of corrupted bits in a binary word as a way to estimate the error. You can also use Hamming distance to measure the distance between categorical variables.
221 views05:39
Open / Comment
2021-10-08 07:26:04 5 Scikit-learn tips from Data Scientist
1. Fill the gaps with Iterative Imputer
- IterativeImputer, which iteratively searches for and fills in missing values, improving the dataset with each iteration. To use this function, import it enable_iterative_imputer from the sklearn.experimental package
2. Generate random dummy data to reserve the place where real or useful data should be present. The dummy data is needed for testing, so it must be reliable. To do this, you can use the functions make_classification () in a classification task or make_regression () in a regression task. You can also set the number of samples and features to control the behavior of the data in debugging and testing.
3. Save ML-models for reuse without retraining. To serialize your algorithms and save them, the pickle and joblib Python libraries come in handy.
4. Plot a confusion matrix using the plot_confusion_matrix function, which displays true positive, false positive, false negative and true negative values.
5. Visualize decision trees using the tree.plot_tree function in the matplotlib package without manually installing dependencies to create simple visualizations. You can also save the tree as a graphic png-file.
https://www.educative.io/blog/scikit-learn-tricks-tips
200 views04:26
Open / Comment
2021-10-07 14:21:55 How to raise the quality of data?

You can have perfect outcomes on all stages of product promotion, but if you lack quality data, they will not be reliable and won't bring any efficient results. What is important about data is consistency, especially for product analytics. Data quality depends on it heavily.

At Matemarketing Vlad Kharitonov and Oleg Khomyuk will elaborate on how to achieve consistency in all cases, including scaling. Their performance includes speeches on strict contract-based categorization, versioning, cross-platform cases, using legacy for scaling.

Matemarketing is the biggest conference on Marketing and Product Analytics, Monetization and Data-Driven Solutions in Russia and CIS.

- - - -
Matemarketing-21 will take place on November 18-19 in Moscow and will be available online.
The full program and all details are available on our website.
- - - -

And now we want to share Jordi Roura's, (InfoTrust Barcelona), report from Matemarketing. You will find out how to provide quality data theoretically and see examples of implementation of this knowledge in certain cases.
220 views11:21
Open / Comment
2021-10-06 08:45:08
Reverse ETL
150 views05:45
Open / Comment
2021-10-06 08:44:42 Reverse ETL: what it is and how to use it
Reverse ETL is the process of copying data from a data warehouse to operating systems, including SaaS for marketing, sales, and support. This allows any team of professionals, from salespeople to engineers, to access the data they need on the systems they use. There are 3 main use cases for reverse ETL:
• Operational analytics - providing insights to business teams in their normal workflows and tools to make data-driven decisions
• Data Automation — Automate ad hoc requests for data from other teams, for example, when financiers request product usage data for billing
• personalization of interaction with customers in different applications

The most popular reverse ETL tools today are:
Hightouch is a data platform that allows you to synchronize data from repositories with CRM, marketing and customer support tools https://hightouch.io/docs/
Census is an operational analytics platform that synchronizes the data warehouse with different applications https://www.getcensus.com/
Octolis is a cloud service that allows marketing and sales teams to easily deploy use cases by activating their data in their operational tools such as CRM or marketing automation software https://octolis.com/
Grouparoo is an open source reverse ETL that runs easily on a laptop or in the cloud, allowing you to develop locally, commit changes, and deploy https://www.grouparoo.com/docs/config
Polytomic is an ETL solution that allows you to create in real time all the necessary customer data in Marketo, Salesforce, HubSpot and other business systems in a couple of minutes https://www.polytomic.com/
RudderStack is a customer data platform for developers where reverse ETL tools make it easy to deploy pipelines that collect customer data from each application, website, and SaaS platform to activate on DWH and application systems https://rudderstack.com/
Workato - a tool for automating business processes in cloud and local applications https://www.workato.com/
Omnata - data integration tool for modern architectures https://omnata.com/
Smart ETL Tool from Rivery - a platform for automating ETL processes using any cloud-based DBMS, including Redshift, Oracle, BigQuery, Azure and Snowflake https://rivery.io/
168 views05:44
Open / Comment
2021-10-04 11:33:10
Luxury EDA with Lux
Useful DS-tools that will come in handy in your daily work. For example, the Lux – Python-library, which simplifies and accelerates data exploration by automating the process of visualizing and analyzing data. For the dataframe in the Jupyter Notebook, Lux recommends a set of visualizations that highlight interesting trends and patterns in the dataset. The visualizations are displayed using an interactive widget that allows users to quickly browse through large collections of visualizations and understand the data. Deeply integrated with Pandas, Lux supports the various geographic and temporal data types in the library, as well as SQL queries against Postgres.
Lux consists of several modules, each of which performs its own duties:
• user interface level;
• level of verification and analysis of user input;
• intent processing level, data execution level, and finally, analytics level.
https://github.com/lux-org/lux
https://lux-api.readthedocs.io/en/latest/source/getting_started/overview.html
134 viewsedited  08:33
Open / Comment
2021-10-01 09:44:49 Machine unlearning is a new challenge in ML
Sometimes ML algorithms have to forget what they have learned. For example, artificial intelligence can destroy privacy. Regulators around the world have the right to compel companies to remove inappropriate information. EU and California citizens may require the company to delete their data. Recently, regulators in the US and Europe have said that owners of artificial intelligence systems must sometimes remove systems trained on sensitive data. And in 2020, the UK data regulator warned companies that some ML programs may be subject to GDPR rights because they contain personal data. In early 2021, the FTC forced facial recognition startup Paravision to remove a collection of incorrectly captured photographs of faces and ML algorithms trained on them.
Thus, we come to a new area of DS called machine learning, which seeks ways to induce selective amnesia for AI in order to remove all traces of a particular person or data point from an ML system without affecting its performance. Some studies have shown that under certain conditions it is possible to make ML algorithms forget something, but this method is not yet ready for use in production. Specifically, in 2019, scientists from the Universities of Toronto and Wisconsin-Madison proposed splitting the raw data for machine learning into multiple parts, each of which is processed separately before the results are combined into the final ML model. If you later need to forget one data point, you only need to reprocess part of the original dataset. Testing has shown that the approach works with online shopping data and a collection of over a million photographs. However, the unlearning system will fail if sent deletion requests are received in a specific sequence. Researchers are now looking for how to solve this problem. However, machine learning techniques are more of a demonstration of technical acumen than a major shift in data protection. After all, even if machines learn to forget, users will have to remember who they are sharing their data with.
https://www.wired.com/story/machines-can-learn-can-they-unlearn/
263 views06:44
Open / Comment
2021-09-29 10:39:29 Analysis of the American Data Science market 2021: a web scraping project on Selenium on open vacancies with visual results and conclusions. Also in the review, you will learn about the popularity of programming languages and ML-frameworks among US employers.
https://pub.towardsai.net/current-data-science-job-market-trend-analysis-future-4184f03a04ca
285 views07:39
Open / Comment