tags: # Data Analysis Software Tool Application python data processing Lack value Machine learning
In daily data analysis, the missing value in the data is one of the most headache.
During the time of daily data acquisition, due to various reasons, the data is deleted in the process of data acquisition, generally represented by spaces, nan, and placeholders. However, these contents are unable to be compatible with algorithms in the application's machine learning or depth learning model, because the auto-variable in the INPUT must be a specific element.
Due to a variety of reasons, many of the data sets in the real world contain missing data, which are often encoded into spaces, Nans, or other placeholders. But such a data set cannot be compatible with the SCIKIT - Learn algorithm, because most of the learning algorithms will default to the elements in the array are values, so the elements of the vocabulary have their own representative meaning.
This article describes the visualization and processing routines for data missing values in daily work.
Data set application disclosed competition case data for example, download addressMachine Learning Repository Machine Learning Library。

It is a data record to be lost, also known as data logging. The lack of data columns, that is, due to various reasons, the data is missing.
# Install the missing value to handle third parties
pip install missingno
# Use Missingno
import missingno as msno
# Visualized to the read data lack value, the blank part is part of the column data
# The original data is as follows

#
msno.matrix(data, labels=True)

# Bar Chart Displays missing data
msno.bar(data, labels=True)

# , directly display the correlation between the field data by thermal map
msno.heatmap(data, labels=True)

# Tree diagram Displays the missing data, the soldiers directly display the correlation between the field data through the thermal map
msno.dendrogram(recent)
# Directly override the original table structure to delete the line containing missing data
recent.dropna(inplace=True)
# Name as col
#
recent['col'].fillna(recent['col'].mean())
# Fill blank data using the median number
recent['col'].fillna(recent['col'].median())
# Fill blank data using the number
recent['col'].fillna(recent['col'].mode())
# Ie the missing field as the variable, other fields that can provide information as a simple regression equation processing
# Numerical Prediction
import numpy as np
from sklearn.linear_model import LinearRegression
X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
# y = 1 * x_0 + 2 * x_1 + 3
y = np.dot(X, np.array([1, 2])) + 3
reg = LinearRegression().fit(X, y)
reg.predict(np.array([[3, 5]]))
# SVM classification model category prediction
from sklearn.svm import LinearSVC
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
X, y = make_classification(n_features=4, random_state=0)
clf = make_pipeline(StandardScaler(),
LinearSVC(random_state=0, tol=1e-5))
clf.fit(X, y)
clf.predict([[0, 0, 0, 0]])
# Conversion using a dictionary traversal
size_mapping = {'str1':1, 'str2':2, 'str3':3}
import numpy as np
class_mapping = {label: idx for idx, label in enumerate(np.unique(recent['classlabel']))}
recent['classlabel'] = recent['classlabel'].map(class_mapping)
# Use the PCA to reduce the dimension, convert the original data as a result of not explaining
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
pca = PCA(n_components=2)
pca.fit(X)
Welcome to visit me [Zhihu account]:Coffee And my [station B Man editing account]:VideosMan If my note is helpful to you, please use a small finger to order a big praise. Regarding the relevant knowle...
Numpy ID Price Area Bedroom Basement 1 200 105 3 false 2 165 80 2 false 3 184.5 120 2 false 4 116 70.8 1 false 5 270 150 4 true ID stands for labels. General labels are unique. Machine learning is gen...
First, the method of processing lack value Due to a variety of reasons, many of the data sets in the real world contain missing data, which are often encoded into spaces, Nans, or other placeholders. ...
Pandas official API Function prototype Parametric meaning axis{0 or ‘index’, 1 or ‘columns’}, default 0 Determine if rows or columns which contain missing values are removed. 0...
PANDAS DataFRAME Lack of Value Processing (data pre -processing) Create Dataframe data samples The output result is as follows: See if there are several ways to lose values in each field The output ...
Data visualization of Python data processing (1) Today, I briefly summarized some basic knowledge points of data visualization, and I hope it will be helpful to small codes (use the matplotlib package...
• Number of this text: about 1300 words • Reading time: about 2 minutes • Difficulty: 2 stars The lack of data mainly includes the lack of records and the lack of field information. It ...
On the first day of the data analysis, a set of MySQL connection methods are given. There is nothing else, and ask the data dictionary is outsourcing. Let's talk about solutions first Option One: It w...
pandas reference: http://pandas.pydata.org/pandas-docs/stable/10min.htmlOfficial 10 minute tutorial...