Scikit-learn

Scikit-learn Overview

Scikit-learn is a widely used open-source Python library for machine learning, providing simple and efficient tools for data analysis and modeling. It is built on top of popular libraries like NumPy, SciPy, and Matplotlib, and offers a wide range of algorithms for supervised and unsupervised learning.

Key Features

Preprocessing: Tools for scaling, normalization, encoding categorical variables, and handling missing data.
Model Selection: Cross-validation, grid search, and hyperparameter tuning for selecting the best models.
Supervised Learning: Algorithms like linear regression, support vector machines (SVM), decision trees, and random forests.
Unsupervised Learning: Algorithms for clustering (K-means, DBSCAN) and dimensionality reduction (PCA, t-SNE).
Metrics: A variety of metrics for evaluating model performance, such as accuracy, precision, recall, and AUC-ROC.
Model Persistence: Allows saving trained models for later use via joblib or pickle.

Scikit-learn is highly accessible for both beginners and professionals, with well-documented APIs and good integration with other data science tools like Pandas.

from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # Load the Iris dataset data = load_iris() X = data.data # Features y = data.target # Target labels # Split the dataset into training and testing sets (80% train, 20% test) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Initialize a Random Forest Classifier clf = RandomForestClassifier(n_estimators=100, random_state=42) # Train the classifier clf.fit(X_train, y_train) # Make predictions on the test set y_pred = clf.predict(X_test) # Calculate the accuracy of the model accuracy = accuracy_score(y_test, y_pred) # Print the accuracy print(f"Accuracy: {accuracy:.2f}") # Print a comparison of actual vs predicted values print("\nActual vs Predicted:") for actual, predicted in zip(y_test, y_pred): print(f"Actual: {actual}, Predicted: {predicted}") # Print some sample data points from the test set to explain the results print("\nSample Data Points (Features) from the Test Set:") for i in range(5): print(f"Test Sample {i+1}: {X_test[i]}, Predicted Label: {y_pred[i]}, Actual Label: {y_test[i]}") #------------------------------------------------------------------------------------# # Output # Accuracy: 1.00 # # Actual vs Predicted: # Actual: 1, Predicted: 1 # Actual: 0, Predicted: 0 # Actual: 2, Predicted: 2 # Actual: 1, Predicted: 1 # Actual: 1, Predicted: 1 # Actual: 0, Predicted: 0 # Actual: 1, Predicted: 1 # Actual: 2, Predicted: 2 # Actual: 1, Predicted: 1 # Actual: 1, Predicted: 1 # Actual: 2, Predicted: 2 # Actual: 0, Predicted: 0 # Actual: 0, Predicted: 0 # Actual: 0, Predicted: 0 # Actual: 0, Predicted: 0 # Actual: 1, Predicted: 1 # Actual: 2, Predicted: 2 # Actual: 1, Predicted: 1 # Actual: 1, Predicted: 1 # Actual: 2, Predicted: 2 # Actual: 0, Predicted: 0 # Actual: 2, Predicted: 2 # Actual: 0, Predicted: 0 # Actual: 2, Predicted: 2 # Actual: 2, Predicted: 2 # Actual: 2, Predicted: 2 # Actual: 2, Predicted: 2 # Actual: 2, Predicted: 2 # Actual: 0, Predicted: 0 # Actual: 0, Predicted: 0 # # Sample Data Points (Features) from the Test Set: # Test Sample 1: [6.1 2.8 4.7 1.2], Predicted Label: 1, Actual Label: 1 # Test Sample 2: [5.7 3.8 1.7 0.3], Predicted Label: 0, Actual Label: 0 # Test Sample 3: [7.7 2.6 6.9 2.3], Predicted Label: 2, Actual Label: 2 # Test Sample 4: [6. 2.9 4.5 1.5], Predicted Label: 1, Actual Label: 1 # Test Sample 5: [6.8 2.8 4.8 1.4], Predicted Label: 1, Actual Label: 1 #------------------------------------------------------------------------------------#

Scikit-learn Code Summary

About the Iris Dataset

Yes, the code will automatically load the Iris dataset using the load_iris() function provided by Scikit-learn. The Iris dataset is one of several built-in toy datasets included in Scikit-learn, so you don't need to manually download or load any external files.

Overview of How it Works

No external setup is needed to use this dataset—it's available right out of the box in Scikit-learn.

Scikit-learn Comparison

If the accuracy is 1.00, the actual labels and predicted labels should match perfectly for all test samples, and you will see identical values in the comparison output.