🔥 Burn Fat Fast. Discover How! 💪

Big Data Science

Logo of telegram channel bdscience — Big Data Science B
Logo of telegram channel bdscience — Big Data Science
Channel address: @bdscience
Categories: Technologies
Language: English
Subscribers: 1.44K
Description from channel

Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼 — https://t.me/bds_job — channel about Data Science jobs and career
💻 — https://t.me/bdscience_ru — Big Data Science [RU]

Ratings & Reviews

1.67

3 reviews

Reviews can be left only by registered users. All reviews are moderated by admins.

5 stars

0

4 stars

0

3 stars

1

2 stars

0

1 stars

2


The latest Messages 12

2021-12-06 10:45:21 How to read Parquet files
The Apache Parquet format is widely used in Big Data due to its column-wise storage and efficient compression. It allows you to quickly read the data you want from a specific column instead of reading a full row and saves time. However, not every application system can read binary Parquet files. It is often necessary to convert Parquet files to CSV or TXT format before opening them, for example, in MS Excel or Power BI, which are often used in the work of a DS specialist. In this case, the Python pandas library will help with the built-in read_parquet() function for reading data from a Parquet file into a dataframe. The dataframe can then be saved to a CSV file using the to_csv() method and opened in almost any office spreadsheet editor.
https://medium.com/@i.m.mak/workaround-for-reading-parquet-files-in-power-bi-e2d060abcb80
138 views07:45
Open / Comment
2021-12-03 07:47:29 Visual Genome: the most labeled dataset
Scientists at Stanford University have collected the most annotated dataset with over 100,000 images. In total, the dataset contains almost 5.5 million object descriptions, attributes and relationships. You don't even have to download the dataset, but get the data you need by accessing the RESTful API endpoint using the GET-method. Despite the fact that the latest updates to the dataset are dated 2017, this is an excellent data set for training models in typical ML problems, from recognizing generated data using graph algorithms.
https://visualgenome.org/api/v0/api_home.html
5.9K views04:47
Open / Comment
2021-12-01 06:02:20 XLS-R is a new set of ML models from Facebook AI
The PyTorch team from Facebook have published XLS-R, a set of large-scale models for self-study of cross-language speech representation based on wav2vec 2.0. The models were trained in 128 languages for over 400 thousand hours of unlabeled speech. Thanks to fine tuning, the models show a high level of speech recognition and are excellent for translation, understanding and language identification tasks. The training data was taken from a variety of sources, including audiobooks and court records. XLS-R neural network models contain more than 2 billion parameters and are multilingual. Moreover, testing has shown that teaching several languages at once increases the efficiency of neural networks. You can download XLS-R from Github right now.
https://github.com/pytorch/fairseq/tree/main/examples/wav2vec/xlsr
70 views03:02
Open / Comment
2021-11-30 07:10:23 TOP-10 the most interesting DS-conferences all over the world in December 2021
1. 2 Dec – TechCrunch and iMerit ML DataOps Summit https://techcrunch.com/events/imerit-ml-dataops-summit/
2. 6-7 Dec – Scientific Data Analysis at Scale (SciDAS) Cloud Computing Workshop. Chapel Hill, NC, USA & Virtual https://renci.github.io/sbdh-scidas-workshop/
3. 6-10 Dec – The Analytics Engineering Conference https://coalesce.getdbt.com/
4. 7-10 Dec - IEEE ICDM 2021: 21st IEEE Int. Conference on Data Mining, Auckland, New Zealand https://icdm2021.auckland.ac.nz/
5. 7 Dec - Chief Data & Analytics Officer, Nordics - Think Tanks, by Corinium. Join The Nordic Region's Most Innovative Data & Analytics Leaders. Online https://cdao-nordics.coriniumintelligence.com/
6. 7-8 Dec - MENA Conversational AI Summit 2021, Virtual https://menaconversationalai.com/
7. 8 Dec - Data Points Summit | Manufacturing, Retail & CPG by Grid Dynamics. Online https://datapoints.griddynamics.com/
8. 9 Dec – Augment - Cloud Data Warehousing Summit, free virtual event https://hevodata.com/events/summit/how-to-migrate-setup-and-scale-a-cloud-data-warehouse
9. 14 Dec - Data Reliability Engineering Conference 2021, One full day of reliability standards and innovation, Free Registration, online https://drecon.org/
10. 15-18 Dec - IEEE Int. Conf. on Big Data (IEEE BigData 2021). Orlando, FL, USA https://bigdataieee.org/BigData2021/index.html
86 views04:10
Open / Comment
2021-11-29 07:48:53 Data engineering for Data Science: SynapseML by Microsoft
Microsoft has adapted Apache Spark for the tasks of data engineers and DS specialists, released SynapseML - a framework for creating scalable ML pipelines. This open source library was previously called MMLSpark. SynapseML powered by SparkML to the Spark ecosystem of deep learning and data analysis, seamless ML pipeline transformation tools with Open Neural Network Exchange (ONNX), LightGBM, Cognitive Services, Vowpal Wabbit, and OpenCV. This allows you to create powerful and highly scalable predictive and analytical models for a variety of data sources.
It is noteworthy that SynapseML is able to work with untagged datasets thanks to API methods of ready-made AI services for quickly solving typical ML tasks. SynapseML requires Scala 2.12, Spark 3.0+, and Python 3.6+. The framework allows you to write code in any Spark-compatible language: Python, Scala, R, Java, .NET, and C #. Over the HTTP protocol, users can embed any web service into their SparkML models, and the clustered nature of Spark allows ML projects to scale.
https://microsoft.github.io/SynapseML/
https://github.com/microsoft/SynapseML
150 views04:48
Open / Comment
2021-11-26 07:45:35 Apache Spark on Google Colab? Installing PySpark on DS Cloud
Apache Spark is one of the most demanded computational frameworks in the Big Data field. With its clustered architecture and in-memory MapReduce jobs, it quickly processes huge amounts of data. Spark has an API for the popular DS Python language, PySpark, and automatically parallelizes code generated on the local machine to all nodes in the cluster. But what if you don't have a cluster and need to process a lot of data?
Cloud solutions will help - Google Colab, where you can install Spark. First you need to download Java, because the framework is written in Scala and runs on JVM:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
Then you should download the Spark framework itself from the official website of the Apache Software Foundation, along with Hadoop
!wget -q https://www-us.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz
Unzip the downloaded tgz-file
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
The next step is to install the findspark library to find Spark on the system and import it.
!pip install -q findspark
Then you need to set the path to the Colab environment to run PySpark there:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"
To find Spark on the system, import findspark and use the findspark.init() method. The findspark.find() method will help you find out where Spark is installed:
import findspark
findspark.init()
findspark.find()
Finally, you can import the Spark - SparkSession from PySpark.sql and create an entry point to the framework, optionally specifying the application name in the appName (“”) configuration parameter
from pyspark.sql import SparkSessionspark= SparkSession \
.builder \
.appName("Spark Application Name") \
.getOrCreate()
https://medium.com/geekculture/how-to-get-your-spark-installation-right-every-time-on-colab-218d57b6091d
99 views04:45
Open / Comment
2021-11-24 07:26:59 Visualizing Temporal Changes of Categorical Data: PyCatFlow vs RankFlow
Sometimes a Data Scientist needs to visualize ranked lists over time, such as changes in search results for queries on Google or YouTube. To do this, you can use RankFlow - a useful tool with a minimalistic UI and a rather cumbersome data preparation process. RankFlow allows you to compare ranked listings over time. It requires that the input tabular data be organized so that each column represents a ranked list. Each ranked list can be supplemented with weights, adding another level of information to the data. For example, for YouTube search results, you can take views, upvotes, or upvotes ratio. Each column in a data table is represented as a stack of nodes, ordered according to rank in a given dataset. In addition, identical nodes are connected between columns. This highlights data continuity and changes, allowing pattern analysis.
Building a RankFlow visualization from this data requires modifying the dataset. For each version of the API, there must be a column containing a ranked list of permissions that are not ordered by any relevance metric. Therefore, ordering a RankFlow chart is a design decision, meaning items can be sorted alphabetically, by frequency in the dataset, or based on additional data.
In practice, adapting the data to the required RankFlow data structure is quite tedious. To speed up the pre- and post-processing of charts, you can write your own Python script that processes the XML data in the SVG file generated by RankFlow. An alternative is PyCatFlow, a visualization tool similar to RankFlow that works well for temporal data without explicit ranking information, but with potential additional categorical data. PyCatFlow is an open-source Python package that can be downloaded freely from Github.
https://medium.com/@bumatic/pycatflow-visualizing-categorical-data-over-time-b344102bcce2
https://github.com/bumatic/PyCatFlow
89 views04:26
Open / Comment
2021-11-22 08:52:57 • Non-negative Matrix Factorization (NMF) - Non-negative matrix factorization, an alternative approach to decomposition that assumes that the data and components are non-negative. NMF is an uncontrolled linear dimensionality reduction technique. In NMF, the original data (feature matrix) is split into multiple matrices (i.e., factorized) representing the hidden relationship between observations and their characteristics. NMF can be connected instead of PCA when the data matrix does not contain negative values. NMF does not provide an explained variance like PCA and other methods, so the best way to find the optimal value for n_components is to try a range of values.
• Truncated Singular Value Decomposition (TSVD) - Truncated singular value decomposition is similar to PCA. This method performs linear dimensionality reduction using a truncated singular value decomposition. Unlike PCA, this estimator does not center the data before computing the singular value decomposition and can work efficiently with sparse matrices.
https://medium.com/@deepak.engg.phd/dimensionality-reduction-with-scikit-learn-ee5d2b69225b
128 views05:52
Open / Comment
2021-11-22 08:52:57 Principal Component Analysis: 7 Methods for Dimension Reduction in Scikit-Learn
One of the main problems of machine learning on large datasets is the huge size of the computational vectors. Therefore, methods of dimensionality reduction to reduce the number of variables are very relevant. This method is Principal Component Analysis (PCA), the essence of which is to reduce the dimension of the dataset, while retaining as much "variability" as possible, i.e. statistical information.
PCA is a statistical method of converting high-dimensional data into low-dimensional data by choosing the most important features that collect as much information about the dataset as possible. Features are selected based on the variance they cause in the output. The trait that causes the most variance is the first major component. The trait responsible for the second largest variance is considered the second main component, etc. It is important that the main components are not related to each other in any way. In addition to speeding up ML algorithms, PCA allows you to visualize data by projecting it into a lower dimension in order to display it in 2D or 3D space.
The popular Python library Scikit-learn includes the sklearn.decomposition.PCA module, which is implemented as a transformer object for multiple components in the fit () method. It can also be used for new data to project onto these components. To use the PCA method in the Scikit-Learn library, there are 2 steps:
1. Initialize the PCA class by passing the required number of components to the constructor;
2. invoke the fitting methods, and then transform them, passing them a feature set. The transform method returns the specified number of base components.
Scikit-learn supports several variations of the PCA method:
• Kernel Principal Component Analysis (KPCA) is a nonlinear dimensionality reduction method using the kernel. The PCA kernel was developed to help classify data whose decision boundaries are described by a non-linear function. The idea is to move into a higher-dimensional space in which the decision-making boundary becomes linear. The sklearn.decomposition module has different kernels: linear, polynomial (poly), Gaussian radial basis function (rbf), sigmoid (sigmoid '), etc. The default is linear, which is suitable if the data is linear separable.
• Sparse PCA - A sparse version of PCA, the purpose of which is to extract a set of sparse components that best recovers data. Typically, PCA extracted components have extremely dense expressions, i.e. nonzero coefficients as linear combinations of the original variables. This makes it difficult to interpret the results. In practice, real principal components can be more naturally represented as sparse vectors, for example, in face recognition, they can display parts of faces.
• Incremental Principal Component Analysis (IPCA) - Incremental PCA method when the dataset to be decomposed is too large to fit in memory. IPCA constructs a low-rank approximation for the input data using an amount of memory that is independent of the size of the input sample. It still depends on the input features, but changing the packet size allows you to control the memory usage.
• Fast Independent Component Analysis (ICA) - Fast independent PCA is used to evaluate sources with noisy measurements and reconstruct sources, since classic PCA does not work with non-Gaussian processes.
• Linear Discriminant Analysis (LDA) - Linear Discriminant Analysis, like classical PCA, is a linear transformation method. But the PCA is not monitored, i.e. ignores class labels, and LDA is supervised machine learning that is used to distinguish between two classes or groups. LDA is suitable for controlled dimensionality reduction by projecting input data into linear subspace from directions that maximize separation between classes. The dimension of the output is necessarily less than the number of classes. In scikit-learn, LDA is implemented using LinearDiscriminantAnalysis, where the n_components parameter specifies the number of functions to return.
119 views05:52
Open / Comment
2021-11-19 06:05:45 Over 50 New Graph Algorithms in TigerGraph: Fall 2021 Release
TigerGraph is a popular fast and scalable graph database with massively parallel processing and ACID transaction support, making it the fastest and most scalable graphics platform. Thanks to its efficient data compression and MPP architecture, it can analyze huge amounts of information in real time. And the internal query language GSQL is very similar to standard SQL, familiar to every analyst.
October 2021 release includes 50 new algorithms, for example, embedded graphs node2vec and FastRP, similarity algorithms ("Nearest Neighbor Approximation", "Euclidean Similarity", "Overlap Similarity" and "Pearson Similarity"), structural similarity algorithms for predicting topological relationships and random path algorithms. In the first half of 2022, the developers promise to add neural networks and other ML methods to build analytical pipelines on graphs. Despite the fact that TigerGraph is positioned as a powerful enterprise solution, the source code of this system is open and available for free download from Github.
https://www.tigergraph.com/blogs/about-tigergraph/graph-data-science-library/
https://github.com/tigergraph
87 views03:05
Open / Comment