Skip to content

Conversation

@arm-diaz
Copy link

@arm-diaz arm-diaz commented Dec 10, 2025

Here's the filled-out PR template:


Issue #, if available:
N/A (New contribution)

Description of changes:

Added two new tutorials for HuggingFace text classification on Amazon SageMaker:

generative_ai/sm-huggingface_text_classification.ipynb - Complete end-to-end tutorial covering:

  • Environment setup with latest DLC versions (PyTorch 2.5.1, Transformers 4.49.0)
  • Data preparation with HuggingFace datasets (IMDB, AG News, SST-2, Emotion)
  • Tokenization and Arrow format explanation (why save_to_disk()load_from_disk())
  • Custom training script with early stopping and metrics
  • Deployment with correct inference container versions (PyTorch 2.6.0)
  • Multi-class classification support with dataset-specific labels and test samples
  • Cleanup instructions with cost warnings

Key features:

  • ASCII diagrams explaining data flow, architecture, and format comparisons
  • Hyperlinks to official AWS and HuggingFace documentation throughout
  • Version compatibility tables for DLC containers
  • Cost estimates and cleanup reminders

Testing done:

  • Trained BERT-base-uncased on AG News dataset (4-class) successfully
  • Training job completed: hf-bert-base-2025-12-07-22-12-33-733
  • Model artifacts saved to S3
  • Deployed endpoint and tested inference with multi-class predictions
  • Confirmed training/inference container version differences (PyTorch 2.5.1 vs 2.6.0)

Merge Checklist

Put an x in the boxes that apply.

  • I have verified that my PR does not contain any new notebook/s which demonstrate a SageMaker functionality already showcased by another existing notebook in the repository
  • I have read the [CONTRIBUTING](https://github.com/aws/amazon-sagemaker-examples/blob/default/CONTRIBUTING.md) doc and adhered to the guidelines regarding folder placement, notebook naming convention and example notebook best practices
  • I have updated the necessary documentation, including the README of the appropriate folder as well as the index.rst file
  • I have tested my notebook(s) and ensured it runs end-to-end
  • I have linted my notebook(s) and code using python3 -m black -l 100 generative_ai/sm-huggingface_text_classification.ipynb

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

"│ 1. Load & Tokenize 2. Save (Arrow) 3. Upload to S3 │\n",
"│ ────────────────── ─────────────── ───────────────── │\n",
"│ load_dataset() → save_to_disk() → aws s3 sync │\n",
"│ AutoTokenizer() train_data/ s3://bucket/train/ │\n",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommendation generated by Amazon CodeGuru Reviewer. Leave feedback on this recommendation by replying to the comment or by reacting to the comment using emoji.

Potential S3 bucket sniping vulnerability detected. This rule has identified S3 bucket references that could be vulnerable to bucket sniping attacks. Bucket sniping occurs when an attacker registers an S3 bucket name after finding it referenced in code but not yet created. This can lead to data exposure, malicious content hosting, or service disruption.

Recommendations:

  1. Create all referenced S3 buckets immediately
  2. Use organization-specific prefixes for bucket names
  3. Verify bucket ownership before use
  4. Consider using AWS Organizations S3 bucket naming rules

Similar issue at line numbers 400 and 403.

Discovered: bucket

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants