Big Data Science

Channel address:

Categories: Technologies

Language: English

Subscribers: 1.44K

Description from channel

Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼 — https://t.me/bds_job — channel about Data Science jobs and career
💻 — https://t.me/bdscience_ru — Big Data Science [RU]

▲ Vote (1)

Ratings & Reviews

1.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

4 stars

3 stars

2 stars

1 stars

The latest Messages 9

2022-02-11 05:57:44 How to create an exe-file from a py-script
Although you can run a Python script in a terminal or text editor, sometimes you need to hide all the code in a py file by wrapping it inside an executable (.exe) file. For example, to schedule a job that runs an executable at a specific time. This can be done in 2 ways:
• via the GUI of the auto-py-to-exe package (https://pypi.org/project/auto-py-to-exe/), which must first be installed via pip install auto-py-to-exe, then run and follow 4 steps in sequence.
• in the terminal with the PyInstaller library, which should also be installed first: pip install pyinstaller, and then go to the directory with the desired py file and create an executable based on it using the pyinstaller --onefile name_of_script.py command
In fact, the first method is a visualization of the 2nd in a visual GUI. And those who work with the CLI interface can immediately use PyInstaller without any additional wrappers.
https://towardsdatascience.com/how-to-easily-convert-a-python-script-to-an-executable-file-exe-4966e253c7e9

180 views02:57

Open / Comment

2022-02-07 07:21:21 Undouble - Python library for detecting duplicate images using hash functions
Finding identical or similar photos manually is a long and tedious task. ITS can not be solved simply by comparing the size and file name, because. photos are taken from different sources (mobile devices, social networking applications, etc.), which results in differences in these attributes and creates differences in resolution, scaling, compression, and brightness. Hash functions are ideal for detecting identical and similar photos due to their resistance to minor changes. This idea is the basis of Undouble - the Python library that works using a multi-stage image preprocessing process (grayscale, normalization and scaling), image hash calculation, and image grouping. Threshold 0 will group images with identical image hash. The results can be easily examined using the plotting function, and the images can be moved using the move function. When moving images, the image from the group with the highest resolution is copied, and all other images are moved to the "undouble" subdirectory.
To try this open source library (https://github.com/erdogant/undouble), you first need to install it: pip install undouble, then import the package into your project: from undouble import Undouble. Then, by setting the hash method and hash size, duplicates can be detected using undouble. In this case, the following steps are performed: recursively reading all images from the directory with the specified extensions, computing the hash, and grouping similar images.
See an example with explanations here:
https://towardsdatascience.com/detection-of-duplicate-images-using-image-hash-functions-4d9c53f04a75

114 viewsedited 04:21

Open / Comment

2022-02-04 06:40:59 Terality - super fast serverless engine instead of slow Pandas
Terality is a serverless data processing engine that runs on giant clusters to work with datasets of any size. Thanks to the serverless paradigm, you don’t have to worry about scaling resources in clusters or other infrastructure: there are practically no limits on memory, and therefore on the size of a data set. It only needs a good internet connection to handle hundreds of GB, even on a simple office laptop with 4GB of RAM. Terality allows you to run Pandas code 10 times faster: Terality syntax is similar to Pandas. It only takes one line of code to switch from Pandas to Terality: import teratiyu as te. The Python package sends HTTPS requests to the Terality engine when you call Pandas functions. The engine processes the data and the command and returns the result. However, Terality is not just a Python package, but freemium software with a free 1TB plan. This counts every API call, not just data reads.
https://docs.terality.com/
https://towardsdatascience.com/good-bye-pandas-meet-terality-its-evil-twin-with-identical-syntax-455b42f33a6d

177 views03:40

Open / Comment

2022-02-02 05:36:21 Yandex Courier Robots in Seoul
As early as last year, Yandex's autonomous courier robots began delivering orders in Russia, food from restaurants in the US city of Ann Arbor, Michigan, and other US student campuses. And in January 2022, Yandex entered into an agreement of intent with a large South Korean telecommunications company, KT Corporation, for delivery by autonomous robots in Seoul. So already this year, South Korea will become the first country in East Asia where Yandex rovers operate. The company is also preparing to launch this technology in Dubai.
http://www.koreaherald.com/view.php?ud=20220118000709

6 views02:36

Open / Comment

2022-01-31 17:41:24 TOP-10 Data Science conferences in February 2022:
1. 02 Feb - Virtual conference DataOps Unleashed https://dataopsunleashed.com/
2. 03 Feb - Beyond Big Data: AI/Machine Learning Summit 2022, Pittsburgh, USA https://www.pghtech.org/events/BeyondBigData2022
3. 10 Feb - Online-summit AICamp ML Data Engineering https://www.aicamp.ai/event/eventdetails/W2022021009
4. 12-13 Feb - IAET International Conference on Machine Learning, Smart & Nanomaterials, Design Engineering, Information Technology & Signal Processing. Budapest, Hungary https://institute-aet.com/mns-22/
5. 16 Feb - DSS Hybrid Miami: AI & ML in the Enterprise. Miami, FL, USA & Virtual https://www.datascience.salon/miami/
6. 17-18 Feb - RE.WORK San Francisco, CA, USA and Online
Reinforcement Learning Summit: https://www.re-work.co/events/reinforcement-learning-summit-2022
Deep Learning Summit: https://www.re-work.co/events/deep-learning-summit-2022 Enterprise AI Summit: https://www.re-work.co/events/enterprise-ai-summit-2022
7. 18-20 Feb - International Conference on Compute and Data Analysis (ICCDA 2022). Sanya, China http://iccda.org/
8. 21-25 Feb - WSDM'22, The 15th ACM International WSDM Conference. Online. http://www.wsdm-conference.org/2022/
9. 22-23 Feb - AI & ML Developers Conference. Virtual. https://cnvrg.io/mlcon
10. 26-27 Feb - 9th International Conference on Data Mining and Database (DMDB 2022). Vancouver, Canada https://ccseit2022.org/dmdb/

164 views14:41

Open / Comment

2022-01-28 08:11:41 Upscaling video games with NVIDIA's DLDSR
DLDSR (Deep Learning Dynamic Super Resolution) is a video game image enhancement technology that uses a multilayer neural network that requires fewer pixels. The 2.25X DLDSR is comparable in quality to the 4X resolution of previous generation DSR technology. At the same time, DLDSR performance is much higher thanks to the tensor cores of RTX video cards, which accelerate neural networks several times. You can try DLDSR on your gaming computer by updating your video card driver and setting the desired settings.
https://www.rockpapershotgun.com/nvidias-deep-learning-dynamic-super-resolution-tech-is-out-now-heres-how-to-enable-it

168 views05:11

Open / Comment

2022-01-26 05:31:34

сравнение метрик LAMDA с человеческими оценками

170 views02:31

Open / Comment

2022-01-26 05:31:25 LaMDA: Safe, Grounded, and High-Quality Dialog Model from Google AI
LaMDA is created by fine-tuning a family of dialogue-specific Transformer-based neural language models with model parameters up to 137B and training the models to use external knowledge sources. LaMDA has three key goals:
• Quality, which is measured in terms of Sensibleness, Specificity, and Interestingness. These indicators are evaluated by people. Reasonableness indicates the presence of meaning in the context of the dialogue, for example, the absence of absurd answers from the ML-model and contradictions with earlier answers. Specificity indicates whether the system's response is specific to the context of the previous dialog. Interestingness measures the emotional reaction of the interlocutor to the answers of the ML model.
• Safety so that the model's responses do not contain offensive and dangerous statements.
• Groundedness - modern language models often generate statements that seem plausible, but in fact contradict the true facts in external sources. Groundedness is defined as the percentage of responses with statements about the outside world that can be verified by reputable external sources. A related metric, Informativeness, is defined as the percentage of responses with information about the outside world that can be confirmed by known sources.
LaMDA models undergo two-stage training: pre-training and fine-tuning. The first stage was performed on a data set of 1.56 thousand words from publicly available dialogue data and public web documents. After tokenizing the data set of 2.81T tokens, the model was trained to predict each next token in the sentence, given the previous ones. The pretrained LaMDA model has also been widely used for NLP research at Google, including program synthesis, zero-shot learning, and more.
In the fine-tuning phase, LaMDA is trained to combine generative tasks to generate natural language responses in given contexts and classification tasks to determine the safety and quality of the model. This results in a single multitasking model: the LaMDA generator is trained to predict the next token in the dialogue dataset, and the classifiers are trained to predict the security and response quality scores in context using annotated data.
The test results showed that LaMDA significantly outperforms the pre-trained model in every dimension and at every scale. Quality metrics improve as the number of model parameters increases, with or without fine-tuning. Safety is not improved by scaling the model alone, but compensated for by fine-tuning. Groundedness improves as the size of the model grows, due to the ability to remember unusual knowledge. And fine-tuning allows the model to access external sources and effectively transfer part of the burden of remembering knowledge to them. By fine-tuning, the human-level quality gap can be reduced, although the performance of the model remains below human-level in terms of safety and Groundedness.
https://ai.googleblog.com/2022/01/lamda-towards-safe-grounded-and-high.html

198 views02:31

Open / Comment

2022-01-24 06:52:40 Zingg + TigerGraph combo for deduplication and big data graph analytics
Graph databases with built-in relationship patterns are great for record disambiguation and entity resolution. For example, TigerGraph is a powerful graph analytics system. And if you supplement it with the open ML tool Zingg (https://github.com/zinggAI/zingg), you can find duplicate and ambiguous records even faster.
Imagine, the same person in different systems is written differently. Therefore, it is very difficult to analyze its user behavior, for example, to generate a personal marketing offer or inclusion in loyalty programs. Zingg have built-in locking mechanisms that only calculate pairwise similarity for selected records. This reduces computation time and helps scale to large datasets. You don't have to worry about manually linking/grouping records: the internal entity resolution framework takes care of that. So with Zingg and TigerGraph you can combine the best simple and scalable entity resolution and further graph analysis.
https://towardsdatascience.com/entity-resolution-with-tigergraph-add-zingg-to-the-mix-95009471ca02

254 views03:52

Open / Comment

2022-01-21 05:48:41 5 YOUTUBE channels for a data engineer from popular DS bloggers
• Ken Jee https://www.youtube.com/c/KenJee1/videos - 183 thousand subscribers and about 200 videos about Data Science, big data engineering, ML and sports analytics
• Karolina Sowinska https://www.youtube.com/c/KarolinaSowinska/videos 30+ thousand subscribers and almost 60 great videos about AirFlow, AI, ETL and the career of a data engineer;
• Shashank Mishra https://www.youtube.com/c/LearningBridge/video 40+ thousand subscribers and more than 150 videos about everyday life data engineers, DS course reviews, interview recommendations and personal experience of the author who worked at Amazon , McKinsey&Company, PayTm and other large corporations, as well as startups.
• Seattle Data Guy https://www.youtube.com/c/SeattleDataGuy/videos almost 20 thousand subscribers and more than 100 videos about the soft and hard skills of a data engineer, life hacks for solving daily tasks of collecting and aggregating data using Python and not only, SQL best practices, introduction to R and much more
• Andreas Kretz https://www.youtube.com/c/andreaskayy/videos about 27 thousand subscribers and more than 500 videos vanilla and proprietary Hadoop, Spark, Kafka, AWS services and other cloud platforms, ETL basics, installation details and practical use different Big Data technologies and features of the data engineer profession.

5 views02:48

Open / Comment

Big Data Science

Ratings & Reviews

The latest Messages 9

Popular Channels

Related Chats

Popular Channels

Login