How to sample data in Python, NumPy, Pandas, and Scikit-learn
Published on Aug. 22, 2023, 12:17 p.m.
To sample data in Python, NumPy, Pandas, and Scikit-learn, you can use various functions and methods available in each package. Here are some examples:
- Sampling in Python
To sample data in Python, you can use the random.sample()
function provided by the random
module:
import random
data = [1, 2, 3, 4, 5]
sampled_data = random.sample(data, k=3)
print("Sampled data:", sampled_data)
This code defines a list of data and uses the random.sample()
function to sample k=3
elements from the list. The sampled data is then printed to the console.
- Sampling in NumPy
To sample data in NumPy, you can use the random.choice()
function or the random.shuffle()
function provided by the numpy.random
module:
import numpy as np
data = np.array([1, 2, 3, 4, 5])
sampled_data = np.random.choice(data, size=3, replace=False)
print("Sampled data:", sampled_data)
This code creates a NumPy array and uses the np.random.choice()
function to randomly sample size=3
elements from the array without replacement. The sampled data is then printed to the console.
Alternatively, you can use the np.random.shuffle()
function to shuffle the array in place and then select the first k
elements as the sampled data:
import numpy as np
data = np.array([1, 2, 3, 4, 5])
np.random.shuffle(data)
sampled_data = data[:3]
print("Sampled data:", sampled_data)
This code creates a NumPy array and uses the np.random.shuffle()
function to shuffle the array in place. The first k=3
elements of the shuffled array are then selected as the sampled data, which is printed to the console.
- Sampling in Pandas
To sample data in Pandas, you can use the sample()
method provided by the Pandas DataFrame
or Series
object. Here is an example:
import pandas as pd
data = pd.DataFrame({'A': [1, 2, 3, 4, 5]})
sampled_data = data.sample(n=3)
print("Sampled data:", sampled_data)
This code creates a Pandas DataFrame
object and uses the sample()
method to sample n=3
rows from the dataframe. By default, sample()
samples rows without replacement, so each sampled row will only appear once in the result.
- Sampling in Scikit-learn
To sample data in Scikit-learn, you can use the functions and APIs provided by the sklearn.utils
module. Here is an example of how to use the train_test_split()
function to split a dataset into training and testing sets:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)
print("Number of training examples:", len(X_train))
print("Number of testing examples:", len(X_test))
This code uses the Iris dataset provided by Scikit-learn and the train_test_split()
function to split the dataset into a training set and a test set, with 80% of the data used for training and 20% of the data used for testing. The resulting training and testing sets are then printed to the console.