Zero-Shot Image Classification
PEFT
Safetensors
Transformers
English
Turkish
lora
clip
geolocation
istanbul
street-level
zero-shot-classification
Instructions to use sibernetik/istanbul-streetclip with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use sibernetik/istanbul-streetclip with PEFT:
Task type is invalid.
- Transformers
How to use sibernetik/istanbul-streetclip with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("zero-shot-image-classification", model="sibernetik/istanbul-streetclip") pipe( "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/hub/parrots.png", candidate_labels=["animals", "humans", "landscape"], )# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("sibernetik/istanbul-streetclip", dtype="auto") - Notebooks
- Google Colab
- Kaggle
Configuration Parsing Warning:In adapter_config.json: "peft.task_type" must be a string
Istanbul StreetCLIP
A LoRA fine-tuned StreetCLIP model for predicting Istanbul district locations from street-level photos.
Model Description
This model uses LoRA (Low-Rank Adaptation) to fine-tune StreetCLIP for Istanbul district classification. Given a street-level photograph, it predicts which of 10 Istanbul districts the photo was taken in using zero-shot CLIP similarity.
Districts: Beyoglu, Kadikoy, Besiktas, Uskudar, Fatih, Sisli, Bakirkoy, Maltepe, Sariyer, Atasehir
Performance
| Metric | Score |
|---|---|
| Top-1 Accuracy | 63.51% |
| Top-3 Accuracy | 83.16% |
| Test Samples | 570 |
Per-District Accuracy
| District | Accuracy |
|---|---|
| Sariyer | 93.33% |
| Sisli | 84.44% |
| Beyoglu | 77.33% |
| Besiktas | 73.33% |
| Kadikoy | 69.33% |
| Uskudar | 61.33% |
| Atasehir | 55.56% |
| Bakirkoy | 44.44% |
| Maltepe | 35.56% |
| Fatih | 22.22% |
Usage
import torch
from transformers import CLIPModel, CLIPProcessor
from peft import PeftModel
# Load model
base_model = CLIPModel.from_pretrained("geolocal/StreetCLIP")
model = PeftModel.from_pretrained(base_model, "sibernetik/istanbul-streetclip")
model.eval()
processor = CLIPProcessor.from_pretrained("geolocal/StreetCLIP")
# Districts
districts = [
"Beyoglu", "Kadikoy", "Besiktas", "Uskudar", "Fatih",
"Sisli", "Bakirkoy", "Maltepe", "Sariyer", "Atasehir"
]
captions = [f"A street-level photo of {d}, Istanbul, Turkey" for d in districts]
# Predict
from PIL import Image
image = Image.open("your_photo.jpg")
inputs = processor(images=image, text=captions, return_tensors="pt", padding=True)
with torch.no_grad():
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)[0]
for district, prob in sorted(zip(districts, probs), key=lambda x: x[1], reverse=True):
print(f"{district}: {prob.item()*100:.2f}%")
Training Details
- Base Model: geolocal/StreetCLIP (~432M params)
- Method: LoRA (r=16, alpha=32, dropout=0.1)
- Target Modules: q_proj, k_proj, v_proj, out_proj
- Trainable Parameters: 4,325,376 (1.0% of total)
- Training Data: 3,800 street-level images from Mapillary across 10 Istanbul districts
- Train/Val/Test Split: 70/15/15
- Epochs: 5
- Batch Size: 4 (effective 32 with gradient accumulation)
- Learning Rate: 5e-6 with cosine schedule
- Hardware: Apple Silicon (MPS)
- Loss: CLIP contrastive loss
Training Progress
| Epoch | Train Loss | Val Accuracy |
|---|---|---|
| 1 | 1.9099 | 27.37% |
| 2 | 1.3854 | 36.84% |
| 3 | 1.0433 | 47.89% |
| 4 | 0.8700 | 58.25% |
| 5 | 0.7431 | 64.91% |
Limitations
- Optimized for Istanbul only; won't generalize to other cities
- Performance varies by district (best for Sariyer, weakest for Fatih)
- Trained on Mapillary street-level imagery; may not work well on aerial/satellite photos
- Small dataset (3,800 images) limits generalization
Framework Versions
- PEFT 0.18.1
- Transformers 4.x
- PyTorch 2.x
- Downloads last month
- -
Model tree for sibernetik/istanbul-streetclip
Base model
geolocal/StreetCLIP