Igor Kołodziej — Data Science

Projects

Real-Time Finance - Big Data Pipeline

SparkHiveHBaseHDFSDocker

Two-person team project. Stack: NiFi/Kafka → HDFS → Spark → Hive → HBase. My work was Spark/Hive/HBase (processing, analytics, serving).

Spark ETL for curated datasets + analytics (OHLC/volume, returns, rolling BTC–USD/PLN correlation).
Hive external tables/views and sanity checks for analytics outputs.
Serving layer: loading analytical facts into HBase for fast key-based reads.

GitHub

NMAR - R package for estimation under nonignorable nonresponse

CRANRCI/testsDocsSimulation studies

CRAN R package: unified nmar() API and method comparisons in simulation studies.

Implemented estimators from the literature behind a unified nmar() API.
Reproducible simulation studies for comparisons + validation.
Packaged for CRAN with documentation and CI/testing.

CRAN Docs GitHub

Mamut - AutoML toolkit for tabular classification

PythonPyPIscikit-learnOptunaEnsemblesReports

AutoML workflow for tabular classification (binary + multi-class): preprocessing, HPO, model comparison, and ensemble search, with reports and plots.

Preprocessing pipeline (imputation, scaling, encoding, skew correction, outliers; optional PCA/feature selection).
Model search across common classifiers (LogReg, RF, SVC, XGBoost, MLP, NB, KNN) with Bayesian or grid search.
Dynamic ensemble search with majority voting (hard/soft) + HTML report, notebook plots, and optional SHAP.

PyPI Docs GitHub

QuantumRAG - Grover-inspired top-k selection for RAG

RAGFAISSEmbeddingsStreamlitQiskitBenchmarking

Dense retrieval (FAISS) + GroverTopK selector (Qiskit Aer), with an evaluation harness on SQuAD 1.1 and end-to-end latency profiling.

Multi-model comparisons via Hugging Face Inference (llama-3-8b, mixtral-8x7b, phi-3.5), with answers with vs. without context.
Benchmark artifacts exported to CSV/JSON + plots; Streamlit demo for context inspection and model comparison.
Key finding: Grover vs classic selection ~identical (≈99% context agreement) with ~30 ms overhead; top-3 contexts outperform top-1 / no-context.

GitHub

Other projects

DermNet - DINOv2 embeddings for clustering GitHub

DoomRL - PPO/A2C agents for ViZDoom GitHub

Research

Research Software Engineer

Poznań University of Economics and Business · project: “Towards census-like statistics for foreign-born populations - quality, data integration and estimation” (2020/39/B/HS4/00941)

03/2025 - Present

Implemented NMAR estimators from the literature behind a unified API.
Built reproducible simulation studies to compare methods and validate behavior.
Maintained engineering quality: documentation/vignettes, CI, tests.
Talks: uRos (Romanian NSI, 2025) and ElementsX (AGH, 2025).

Leadership

President, Data Science Club (WUT)

2024–2025

Organized talks/workshops; hosted guests from Google, ING, Allegro.
Worked with a student team on outreach and events.

Co-organizer, ensembleAI hackathon

2024, 2025 (preparing 2026)

Sponsors, logistics, venue coordination, on-site operations.

Capitalize (student venture, Enactus WUT)

Demo app shipped to Google Play (testing track)

Backend features/APIs (FastAPI).
Python scripts for basic telemetry analysis from Amplitude exports.

Awards

2nd place - Enactus Poland National Competition (Capitalize), 2023
Finalist - Consult IT business/technology hackathon (SGH Warsaw School of Economics), 2023
Laureate - AGH “Diamond Index” Olympiad in Physics, 2022
Finalist - National Technical Knowledge Olympiad (OWT), 2022

Skills

Data / Systems

SQL, Spark
Hive, HDFS, HBase
NiFi, Docker

Software Engineering

Python, Java, R
Git, Linux, CI/testing (GitHub Actions)
Backend: FastAPI, Spring Boot

ML / Evaluation

PyTorch, scikit-learn, Transformers
Optuna, NumPy, Pandas

Languages

Polish - native
English - C2 (CAE Grade A)
German - basic

Contact

Open to Junior/Intern roles around data engineering, data-heavy backend, and ML systems.

Email CV (PDF) GitHub LinkedIn