Real-Time Finance - Big Data Pipeline
SparkHiveHBaseHDFSDocker
Two-person team project. Stack: NiFi/Kafka → HDFS → Spark → Hive → HBase.
My work was Spark/Hive/HBase (processing, analytics, serving).
- Spark ETL for curated datasets + analytics (OHLC/volume, returns, rolling BTC–USD/PLN correlation).
- Hive external tables/views and sanity checks for analytics outputs.
- Serving layer: loading analytical facts into HBase for fast key-based reads.
Mamut - AutoML toolkit for tabular classification
PythonPyPIscikit-learnOptunaEnsemblesReports
AutoML workflow for tabular classification (binary + multi-class): preprocessing, HPO, model comparison, and ensemble search,
with reports and plots.
- Preprocessing pipeline (imputation, scaling, encoding, skew correction, outliers; optional PCA/feature selection).
- Model search across common classifiers (LogReg, RF, SVC, XGBoost, MLP, NB, KNN) with Bayesian or grid search.
- Dynamic ensemble search with majority voting (hard/soft) + HTML report, notebook plots, and optional SHAP.
QuantumRAG - Grover-inspired top-k selection for RAG
RAGFAISSEmbeddingsStreamlitQiskitBenchmarking
Dense retrieval (FAISS) + GroverTopK selector (Qiskit Aer), with an evaluation harness on SQuAD 1.1 and end-to-end latency profiling.
- Multi-model comparisons via Hugging Face Inference (llama-3-8b, mixtral-8x7b, phi-3.5), with answers with vs. without context.
- Benchmark artifacts exported to CSV/JSON + plots; Streamlit demo for context inspection and model comparison.
- Key finding: Grover vs classic selection ~identical (≈99% context agreement) with ~30 ms overhead; top-3 contexts outperform top-1 / no-context.