Analyze & Summarize Text using Transformer + Generative AI

Author

Gunardi

Published

November 2, 2023

This notebook shows how to perform sentiment analysis and perform summarization on long document. The document used for these NLP tasks is the 2022 annual financial statement for Apple (specifically section: item 7 Management’s Discussion).

1 Install and Import Prerequisite Libraries

!pip install transformers
import json
import nltk
import pandas as pd
from transformers import pipeline
nltk.download('punkt')

2 Load Data

FILE_PATH = "/content/drive/MyDrive/320193_10K_2022_0000320193-22-000108.json"
with open (FILE_PATH) as file:
    content = json.load(file)

item_7 = content["item_7"]

3 Perform Sentiment Analysis

def find_emotional_sentences(text:str, emotions:list, threshold:float):
    # Inputs: 
    # 1. text -> input text as string
    # 2. emotions -> list of desired emotions to be analyzed
    # 3. threshold -> minimum confidence level
    sentences_by_emotion = {}
    for e in emotions:
        sentences_by_emotion[e]=[]
        
    # Break up the input text on punctuation to returns a list of "sentences" 
    sentences = nltk.sent_tokenize(text)
    print(f'Document has {len(text)} characters and {len(sentences)} sentences.')
    for s in sentences:
        # Use the classifier to get the emotions in sentences
        prediction = classifier(s)
        if (prediction[0]['label']!='neutral' and prediction[0]['score']>threshold):
            #print (f'Sentence #{sentences.index(s)}: {prediction} {s}')
            sentences_by_emotion[prediction[0]['label']].append(s)
            
    # Print total number of sentences with specific emotions
    for e in emotions:
        print(f'{e}: {len(sentences_by_emotion[e])} sentences')
    return sentences_by_emotion

3.1 Use Finbert Model

FinBERT is a pre-trained NLP model to analyze sentiment of financial text. It is built by further training the BERT language model in the finance domain, using a large financial corpus and thereby fine-tuning it for financial sentiment classification.

The model analyzes the following sentiments: positive, negative or neutral.

classifier_model_name = 'ProsusAI/finbert'
classifier_emotions = ['positive', 'neutral', 'negative']

classifier = pipeline('text-classification', model=classifier_model_name)
sentences_by_emotion = find_emotional_sentences(item_7, classifier_emotions, 0.7)
Document has 15613 characters and 84 sentences.
positive: 20 sentences
neutral: 0 sentences
negative: 14 sentences

3.1.1 Get Sentences Based On Sentiments

sentences_by_emotion['positive']
['Fiscal Year Highlights\nFiscal 2022 Highlights\nTotal net sales increased 8% or $28.5 billion during 2022 compared to 2021, driven primarily by higher net sales of iPhone, Services and Mac.',
 'In April 2022, the Company announced an increase to its Program authorization from $315 billion to $405 billion and raised its quarterly dividend from $0.22 to $0.23 per share beginning in May 2022.',
 'iPhone\niPhone net sales increased during 2022 compared to 2021 due primarily to higher net sales from the Company’s new iPhone models released since the beginning of the fourth quarter of 2021.',
 'Mac\nMac net sales increased during 2022 compared to 2021 due primarily to higher net sales of laptops.',
 'Wearables, Home and Accessories\nWearables, Home and Accessories net sales increased during 2022 compared to 2021 due primarily to higher net sales of Apple Watch and AirPods.',
 'Services\nServices net sales increased during 2022 compared to 2021 due primarily to higher net sales from advertising, cloud services and the App Store.',
 'Further information regarding the Company’s reportable segments can be found in Part II, Item 8 of this Form 10-K in the Notes to Consolidated Financial Statements in Note 11, “Segment Information and Geographic Data.”\nThe following table shows net sales by reportable segment for 2022, 2021 and 2020 (dollars in millions):\nAmericas\nAmericas net sales increased during 2022 compared to 2021 due primarily to higher net sales of iPhone, Services and Mac.',
 'Europe\nEurope net sales increased during 2022 compared to 2021 due primarily to higher net sales of iPhone and Services.',
 'Greater China\nGreater China net sales increased during 2022 compared to 2021 due primarily to higher net sales of iPhone and Services.',
 'The strength of the renminbi relative to the U.S. dollar had a favorable year-over-year impact on Greater China net sales during 2022.',
 'Rest of Asia Pacific\nRest of Asia Pacific net sales increased during 2022 compared to 2021 due primarily to higher net sales of iPhone, Mac and Services.',
 'Gross Margin\nProducts and Services gross margin and gross margin percentage for 2022, 2021 and 2020 were as follows (dollars in millions):\nProducts Gross Margin\nProducts gross margin increased during 2022 compared to 2021 due primarily to a different Products mix and higher Products volume, partially offset by the weakness in foreign currencies relative to the U.S. dollar.',
 'Products gross margin percentage increased during 2022 compared to 2021 due primarily to a different Products mix, partially offset by the weakness in foreign currencies relative to the U.S. dollar.',
 'Services Gross Margin\nServices gross margin increased during 2022 compared to 2021 due primarily to higher Services net sales, partially offset by the weakness in foreign currencies relative to the U.S. dollar.',
 'Services gross margin percentage increased during 2022 compared to 2021 due primarily to improved leverage and a different Services mix, partially offset by the weakness in foreign currencies relative to the U.S. dollar.',
 'Selling, General and Administrative\nThe year-over-year growth in selling, general and administrative expense in 2022 was driven primarily by increases in headcount-related expenses, advertising and professional services.',
 'The Company’s effective tax rate for 2022 was higher compared to 2021 due primarily to a higher effective tax rate on foreign earnings, including the impact to U.S. foreign tax credits as a result of regulatory guidance issued by the U.S. Department of the Treasury in 2022, and lower tax benefits from foreign-derived intangible income deductions and share-based compensation.',
 'Liquidity and Capital Resources\nThe Company believes its balances of cash, cash equivalents and unrestricted marketable securities, which totaled $156.4 billion as of September 24, 2022, along with cash generated by ongoing operations and continued access to debt markets, will be sufficient to satisfy its cash requirements and capital return program over the next 12 months and beyond.',
 'The Company intends to increase its dividend on an annual basis, subject to declaration by the Board of Directors.',
 'Although management believes the Company’s reserves are reasonable, no assurance can be given that the final outcome of these uncertainties will not be different from that which is reflected in the Company’s reserves.']
sentences_by_emotion['neutral']
[]
sentences_by_emotion['negative']
['The weakness in foreign currencies relative to the U.S. dollar had an unfavorable year-over-year impact on all Products and Services net sales during 2022.',
 'COVID-19\nThe COVID-19 pandemic has had, and continues to have, a significant impact around the world, prompting governments and businesses to take unprecedented measures, such as restrictions on travel and business operations, temporary closures of businesses, and quarantine and shelter-in-place orders.',
 'The COVID-19 pandemic has at times significantly curtailed global economic activity and caused significant volatility and disruption in global financial markets.',
 'The COVID-19 pandemic and the measures taken by many countries in response have affected and could in the future materially impact the Company’s business, results of operations and financial condition.',
 'Certain of the Company’s outsourcing partners, component suppliers and logistical service providers have experienced disruptions during the COVID-19 pandemic, resulting in supply shortages.',
 'iPad\niPad net sales decreased during 2022 compared to 2021 due primarily to lower net sales of iPad Pro.',
 'The weakness in foreign currencies relative to the U.S. dollar had a net unfavorable year-over-year impact on Europe net sales during 2022.',
 'Japan\nJapan net sales decreased during 2022 compared to 2021 due to the weakness of the yen relative to the U.S. dollar.',
 'The weakness in foreign currencies relative to the U.S. dollar had an unfavorable year-over-year impact on Rest of Asia Pacific net sales during 2022.',
 'Other Income/(Expense), Net\nOther income/(expense), net (“OI&E”) for 2022, 2021 and 2020 was as follows (dollars in millions):\nThe decrease in OI&E during 2022 compared to 2021 was due primarily to higher realized losses on debt securities, unfavorable fair value adjustments on equity securities and higher interest expense, partially offset by higher foreign exchange gains.',
 'Provision for Income Taxes\nProvision for income taxes, effective tax rate and statutory federal income tax rate for 2022, 2021 and 2020 were as follows (dollars in millions):\nThe Company’s effective tax rate for 2022 was lower than the statutory federal income tax rate due primarily to a lower effective tax rate on foreign earnings, tax benefits from share-based compensation and the impact of the U.S. federal R&D credit, partially offset by state income taxes.',
 'The Company’s effective tax rate for 2021 was lower than the statutory federal income tax rate due primarily to a lower effective tax rate on foreign earnings, tax benefits from share-based compensation and foreign-derived intangible income deductions.',
 'Resolution of these uncertainties in a manner inconsistent with management’s expectations could have a material impact on the Company’s financial condition and operating results.',
 'Resolution of legal matters in a manner inconsistent with management’s expectations could have a material impact on the Company’s financial condition and operating results.']

3.2 Use Distilbert Model

Distilbert is created with knowledge distillation during the pre-training phase which reduces the size of a BERT model by 40%, while retaining 97% of its language understanding. It’s smaller, faster than Bert and any other Bert-based model.

classifier_model_name = 'bhadresh-savani/distilbert-base-uncased-emotion'
classifier_emotions = ['anger', 'disgust', 'fear', 'joy', 'sadness', 'surprise']

classifier = pipeline('text-classification', model=classifier_model_name)
sentences_by_emotion = find_emotional_sentences(item_7, classifier_emotions, 0.7)
Document has 15613 characters and 84 sentences.
anger: 0 sentences
disgust: 0 sentences
fear: 4 sentences
joy: 55 sentences
sadness: 8 sentences
surprise: 0 sentences

3.2.1 Get Sentences Based On Sentiments

sentences_by_emotion['anger']
[]
sentences_by_emotion['disgust']
[]
sentences_by_emotion['fear']
['Uncertain Tax Positions\nThe Company is subject to income taxes in the U.S. and numerous foreign jurisdictions.',
 'The evaluation of the Company’s uncertain tax positions involves significant judgment in the interpretation and application of GAAP and complex domestic and international tax laws, including the Act and matters related to the allocation of international taxation rights between countries.',
 'Resolution of these uncertainties in a manner inconsistent with management’s expectations could have a material impact on the Company’s financial condition and operating results.',
 'Legal and Other Contingencies\nThe Company is subject to various legal proceedings and claims that arise in the ordinary course of business, the outcomes of which are inherently uncertain.']
sentences_by_emotion['joy']
['Management’s Discussion and Analysis of Financial Condition and Results of Operations\nThe following discussion should be read in conjunction with the consolidated financial statements and accompanying notes included in Part II, Item 8 of this Form 10-K.',
 'Fiscal Year Highlights\nFiscal 2022 Highlights\nTotal net sales increased 8% or $28.5 billion during 2022 compared to 2021, driven primarily by higher net sales of iPhone, Services and Mac.',
 'The Company announces new product, service and software offerings at various times during the year.',
 'Significant announcements during fiscal 2022 included the following:\nFirst Quarter 2022:\n•Updated MacBook Pro 14” and MacBook Pro 16”, powered by the Apple M1 Pro or M1 Max chip; and\n•Third generation of AirPods.',
 'Second Quarter 2022:\n•Updated iPhone SE with 5G technology;\n•All-new Mac Studio, powered by the Apple M1 Max or M1 Ultra chip;\n•All-new Studio Display™; and\n•Updated iPad Air with 5G technology, powered by the Apple M1 chip.',
 'Third Quarter 2022:\n•Updated MacBook Air and MacBook Pro 13”, both powered by the Apple M2 chip;\n•iOS 16, macOS Ventura, iPadOS 16 and watchOS 9, updates to the Company’s operating systems; and\n•Apple Pay Later, a buy now, pay later service.',
 'Fourth Quarter 2022:\n•iPhone 14, iPhone 14 Plus, iPhone 14 Pro and iPhone 14 Pro Max;\n•Second generation of AirPods Pro; and\n•Apple Watch Series 8, updated Apple Watch SE and all-new Apple Watch Ultra.',
 'In April 2022, the Company announced an increase to its Program authorization from $315 billion to $405 billion and raised its quarterly dividend from $0.22 to $0.23 per share beginning in May 2022.',
 'During 2022, the Company repurchased $90.2 billion of its common stock and paid dividends and dividend equivalents of $14.8 billion.',
 'Products and Services Performance\nThe following table shows net sales by category for 2022, 2021 and 2020 (dollars in millions):\n(1)Products net sales include amortization of the deferred value of unspecified software upgrade rights, which are bundled in the sales price of the respective product.',
 '(2)Wearables, Home and Accessories net sales include sales of AirPods, Apple TV, Apple Watch, Beats products, HomePod mini and accessories.',
 '(3)Services net sales include sales from the Company’s advertising, AppleCare, cloud, digital content, payment and other services.',
 'iPhone\niPhone net sales increased during 2022 compared to 2021 due primarily to higher net sales from the Company’s new iPhone models released since the beginning of the fourth quarter of 2021.',
 'Mac\nMac net sales increased during 2022 compared to 2021 due primarily to higher net sales of laptops.',
 'Wearables, Home and Accessories\nWearables, Home and Accessories net sales increased during 2022 compared to 2021 due primarily to higher net sales of Apple Watch and AirPods.',
 'Services\nServices net sales increased during 2022 compared to 2021 due primarily to higher net sales from advertising, cloud services and the App Store.',
 'Segment Operating Performance\nThe Company manages its business primarily on a geographic basis.',
 'The Company’s reportable segments consist of the Americas, Europe, Greater China, Japan and Rest of Asia Pacific.',
 'Europe includes European countries, as well as India, the Middle East and Africa.',
 'Rest of Asia Pacific includes Australia and those Asian countries not included in the Company’s other reportable segments.',
 'Although the reportable segments provide similar hardware and software products and similar services, each one is managed separately to better align with the location of the Company’s customers and distribution partners and the unique market dynamics of each geographic region.',
 'Further information regarding the Company’s reportable segments can be found in Part II, Item 8 of this Form 10-K in the Notes to Consolidated Financial Statements in Note 11, “Segment Information and Geographic Data.”\nThe following table shows net sales by reportable segment for 2022, 2021 and 2020 (dollars in millions):\nAmericas\nAmericas net sales increased during 2022 compared to 2021 due primarily to higher net sales of iPhone, Services and Mac.',
 'Europe\nEurope net sales increased during 2022 compared to 2021 due primarily to higher net sales of iPhone and Services.',
 'Greater China\nGreater China net sales increased during 2022 compared to 2021 due primarily to higher net sales of iPhone and Services.',
 'The strength of the renminbi relative to the U.S. dollar had a favorable year-over-year impact on Greater China net sales during 2022.',
 'Rest of Asia Pacific\nRest of Asia Pacific net sales increased during 2022 compared to 2021 due primarily to higher net sales of iPhone, Mac and Services.',
 'Gross Margin\nProducts and Services gross margin and gross margin percentage for 2022, 2021 and 2020 were as follows (dollars in millions):\nProducts Gross Margin\nProducts gross margin increased during 2022 compared to 2021 due primarily to a different Products mix and higher Products volume, partially offset by the weakness in foreign currencies relative to the U.S. dollar.',
 'Products gross margin percentage increased during 2022 compared to 2021 due primarily to a different Products mix, partially offset by the weakness in foreign currencies relative to the U.S. dollar.',
 'Services gross margin percentage increased during 2022 compared to 2021 due primarily to improved leverage and a different Services mix, partially offset by the weakness in foreign currencies relative to the U.S. dollar.',
 'Operating Expenses\nOperating expenses for 2022, 2021 and 2020 were as follows (dollars in millions):\nResearch and Development\nThe year-over-year growth in R&D expense in 2022 was driven primarily by increases in headcount-related expenses and engineering program costs.',
 'Selling, General and Administrative\nThe year-over-year growth in selling, general and administrative expense in 2022 was driven primarily by increases in headcount-related expenses, advertising and professional services.',
 'Provision for Income Taxes\nProvision for income taxes, effective tax rate and statutory federal income tax rate for 2022, 2021 and 2020 were as follows (dollars in millions):\nThe Company’s effective tax rate for 2022 was lower than the statutory federal income tax rate due primarily to a lower effective tax rate on foreign earnings, tax benefits from share-based compensation and the impact of the U.S. federal R&D credit, partially offset by state income taxes.',
 'The Company’s effective tax rate for 2021 was lower than the statutory federal income tax rate due primarily to a lower effective tax rate on foreign earnings, tax benefits from share-based compensation and foreign-derived intangible income deductions.',
 'The Company’s effective tax rate for 2022 was higher compared to 2021 due primarily to a higher effective tax rate on foreign earnings, including the impact to U.S. foreign tax credits as a result of regulatory guidance issued by the U.S. Department of the Treasury in 2022, and lower tax benefits from foreign-derived intangible income deductions and share-based compensation.',
 'Liquidity and Capital Resources\nThe Company believes its balances of cash, cash equivalents and unrestricted marketable securities, which totaled $156.4 billion as of September 24, 2022, along with cash generated by ongoing operations and continued access to debt markets, will be sufficient to satisfy its cash requirements and capital return program over the next 12 months and beyond.',
 'Debt\nAs of September 24, 2022, the Company had outstanding fixed-rate notes with varying maturities for an aggregate principal amount of $111.8 billion (collectively the “Notes”), with $11.1 billion payable within 12 months.',
 'Future interest payments associated with the Notes total $41.3 billion, with $2.9 billion payable within 12 months.',
 'As of September 24, 2022, the Company had $10.0 billion of Commercial Paper outstanding, all of which was payable within 12 months.',
 'As of September 24, 2022, the Company had fixed lease payment obligations of $15.3 billion, with $2.0 billion payable within 12 months.',
 'Manufacturing Purchase Obligations\nThe Company utilizes several outsourcing partners to manufacture subassemblies for the Company’s products and to perform final assembly and testing of finished products.',
 'The Company also obtains individual components for its products from a wide variety of individual suppliers.',
 'Outsourcing partners acquire components and build product based on demand information supplied by the Company, which typically covers periods up to 150 days.',
 'As of September 24, 2022, the Company had manufacturing purchase obligations of $71.1 billion, with $68.4 billion payable within 12 months.',
 'Other Purchase Obligations\nThe Company’s other purchase obligations primarily consist of noncancelable obligations to acquire capital assets, including assets related to product manufacturing, and noncancelable obligations related to internet services and content creation.',
 'As of September 24, 2022, the Company had other purchase obligations of $17.8 billion, with $6.8 billion payable within 12 months.',
 'Deemed Repatriation Tax Payable\nAs of September 24, 2022, the balance of the deemed repatriation tax payable imposed by the U.S. Tax Cuts and Jobs Act of 2017 (the “Act”) was $22.0 billion, with $5.3 billion expected to be paid within 12 months.',
 'In addition to its contractual cash requirements, the Company has a capital return program authorized by the Board of Directors.',
 'As of September 24, 2022, the Company’s quarterly cash dividend was $0.23 per share.',
 'The Company intends to increase its dividend on an annual basis, subject to declaration by the Board of Directors.',
 'Critical Accounting Estimates\nThe preparation of financial statements and related disclosures in conformity with U.S. generally accepted accounting principles (“GAAP”) and the Company’s discussion and analysis of its financial condition and operating results require the Company’s management to make judgments, assumptions and estimates that affect the amounts reported.',
 'Note 1, “Summary of Significant Accounting Policies” of the Notes to Consolidated Financial Statements in Part II, Item 8 of this Form 10-K describes the significant accounting policies and methods used in the preparation of the Company’s consolidated financial statements.',
 'Management bases its estimates on historical experience and on various other assumptions it believes to be reasonable under the circumstances, the results of which form the basis for making judgments about the carrying values of assets and liabilities.',
 'Although management believes the Company’s reserves are reasonable, no assurance can be given that the final outcome of these uncertainties will not be different from that which is reflected in the Company’s reserves.',
 'Reserves are adjusted considering changing facts and circumstances, such as the closing of a tax examination.',
 'The Company records a liability when it is probable that a loss has been incurred and the amount is reasonably estimable, the determination of which requires significant judgment.']
sentences_by_emotion['sadness']
['The weakness in foreign currencies relative to the U.S. dollar had an unfavorable year-over-year impact on all Products and Services net sales during 2022.',
 'The COVID-19 pandemic has at times significantly curtailed global economic activity and caused significant volatility and disruption in global financial markets.',
 'The COVID-19 pandemic and the measures taken by many countries in response have affected and could in the future materially impact the Company’s business, results of operations and financial condition.',
 'iPad\niPad net sales decreased during 2022 compared to 2021 due primarily to lower net sales of iPad Pro.',
 'The weakness in foreign currencies relative to the U.S. dollar had a net unfavorable year-over-year impact on Europe net sales during 2022.',
 'Japan\nJapan net sales decreased during 2022 compared to 2021 due to the weakness of the yen relative to the U.S. dollar.',
 'The weakness in foreign currencies relative to the U.S. dollar had an unfavorable year-over-year impact on Rest of Asia Pacific net sales during 2022.',
 'Other Income/(Expense), Net\nOther income/(expense), net (“OI&E”) for 2022, 2021 and 2020 was as follows (dollars in millions):\nThe decrease in OI&E during 2022 compared to 2021 was due primarily to higher realized losses on debt securities, unfavorable fair value adjustments on equity securities and higher interest expense, partially offset by higher foreign exchange gains.']
sentences_by_emotion['surprise']
[]

4 Summarize Sentences With Emotions

To generate summaries, the T5-base model is used. It is a large-scale transformer-based language model that has achieved state-of-the-art results on various NLP tasks, including text summarization.

4.1 First Approach to Summarize

This section applies summarization task by using the native functionality of a Huggingface transformer module to utilize T5-base model.

def summarize_sentences(sentences_by_emotion, min_length, max_length):
    for k in sentences_by_emotion.keys():
        if (len(sentences_by_emotion[k])!=0):
            text = ' '.join(sentences_by_emotion[k])
            summary = summarizer(text, min_length=min_length, max_length=max_length)
            print(f"{k.upper()}: {summary[0]['summary_text']}\n")
summarizer = pipeline('summarization', model='t5-base', max_length=1000)
/usr/local/lib/python3.10/dist-packages/transformers/models/t5/tokenization_t5_fast.py:160: FutureWarning: This tokenizer was incorrectly instantiated with a model max length of 512 which will be corrected in Transformers v5.
For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.
- To avoid this warning, please instantiate this tokenizer with `model_max_length` set to your preferred value.
  warnings.warn(

The warning implies that the sentence with > 512 tokens will be truncated. Therefore we will the map-reduced technique as described in 1 and 2 in the next section.

For now, let’s check what happens if we ignore this warning. Later after executing section 4.1.a, the reader can compare both of the results.

summarize_sentences(sentences_by_emotion, min_length=60, max_length=90)
Token indices sequence length is longer than the specified maximum sequence length for this model (2303 > 512). Running this sequence through the model will result in indexing errors
FEAR: the Company is subject to income taxes in the U.S. and numerous foreign jurisdictions . resolution of these uncertainties in a manner inconsistent with management’s expectations could have a material impact on the Company’s financial condition and operating results . legal and other contingencies The Company may be subject to various legal proceedings and claims that arise in the ordinary course of business .

JOY: the Company’s reportable segments consist of the Americas, Europe, Greater China, Japan and Rest of Asia Pacific . as of September 24, 2022, the Company had outstanding fixed-rate notes with varying maturities for an aggregate principal amount of $111.8 billion, with $2.9 billion payable within 12 months . the Company believes its balances of cash, cash equivalents and unrestricted marketable securities will

SADNESS: weakness in foreign currencies relative to the U.S. dollar had an unfavorable year-over-year impact on all Products and Services net sales during 2022 . the COVID-19 pandemic has at times significantly curtailed global economic activity and caused significant volatility .
text = ' '.join(sentences_by_emotion["sadness"])
summary = summarizer(text, min_length=60, max_length=90)
summary[0]['summary_text']
'weakness in foreign currencies relative to the U.S. dollar had an unfavorable year-over-year impact on all Products and Services net sales during 2022 . the COVID-19 pandemic has at times significantly curtailed global economic activity and caused significant volatility .'
text = ' '.join(sentences_by_emotion["fear"])
summary = summarizer(text, min_length=60, max_length=90)
summary[0]['summary_text']
'the Company is subject to income taxes in the U.S. and numerous foreign jurisdictions . resolution of these uncertainties in a manner inconsistent with management’s expectations could have a material impact on the Company’s financial condition and operating results . legal and other contingencies The Company may be subject to various legal proceedings and claims that arise in the ordinary course of business .'
text = ' '.join(sentences_by_emotion["joy"])
summary = summarizer(text, min_length=60, max_length=150)
summary[0]['summary_text']
'the Company’s reportable segments consist of the Americas, Europe, Greater China, Japan and Rest of Asia Pacific . as of September 24, 2022, the Company had outstanding fixed-rate notes with varying maturities for an aggregate principal amount of $111.8 billion, with $2.9 billion payable within 12 months . the Company believes its balances of cash, cash equivalents and unrestricted marketable securities will be sufficient to satisfy its cash requirements and capital return program .'

4.1.1 Chunk long documents to no longer than 512 words

The following two functions are helper functions to “truncate” long text into several shorter text chunks (which are stored inside a list). The first function counts word numbers as limiting parameter, whereas the second function count the token numbers.

def join_50_chars_or_less(lst, limit=50):
    """
    Takes in lst of strings and returns join of strings
    up to `limit` number of chars (no substrings)

    :param lst: (list)
        list of strings to join
    :param limit: (int)
        optional limit on number of chars, default 50
    :return: (list)
        string elements joined up until length of 50 chars.
        No partial-strings of elements allowed.
    """
    chunk_list = []
    temp_chunk = ""
    for i in range(len(lst)):
      if temp_chunk == "":
        current_chunk = lst[i]
      else:
        current_chunk = " ".join([temp_chunk, lst[i]])
      if len(current_chunk) <= limit:
        temp_chunk = current_chunk
      else:
        chunk_list.append(temp_chunk)
        temp_chunk = lst[i]
    return chunk_list
input_text_chunks = join_50_chars_or_less(sentences_by_emotion["joy"][:10], 512)
len(input_text_chunks)
4
def join_50_token_or_less(lst, limit=50):
    """
    Takes in lst of strings and returns join of strings
    up to `limit` number of chars (no substrings)

    :param lst: (list)
        list of strings to join
    :param limit: (int)
        optional limit on number of chars, default 50
    :return: (list)
        string elements joined up until length of 50 chars.
        No partial-strings of elements allowed.
    """
    chunk_list = []
    temp_chunk = ""
    for i in range(len(lst)):
      if temp_chunk == "":
        current_chunk = lst[i]
      else:
        current_chunk = " ".join([temp_chunk, lst[i]])
      if len(current_chunk.split()) <= limit:
        temp_chunk = current_chunk
      else:
        chunk_list.append(temp_chunk)
        temp_chunk = lst[i]
    if not chunk_list:
      chunk_list.append(temp_chunk)
    return chunk_list
input_text_chunks = join_50_token_or_less(sentences_by_emotion["joy"], 512)
len(input_text_chunks)
3
text_chunks = input_text_chunks
while len(text_chunks) > 1:
  print("Total number of text chunks: {}".format(len(text_chunks)))
  new_text_chunks = []
  counter = 0
  for i in text_chunks:
    counter += 1
    print("Counter: {}/{}".format(counter, len(text_chunks)))
    summary = summarizer(i, min_length=60, max_length=200)
    new_text_chunks.append(summary[0]['summary_text'])
  text_chunks = join_50_token_or_less(new_text_chunks, 512)
summary_1 = text_chunks[0]
Total number of text chunks: 3
Counter: 1/3
Counter: 2/3
Counter: 3/3
summary_1
"net sales increased 8% or $28.5 billion during fiscal 2022 compared to 2021 . the company announces new product, service and software offerings at various times during the year . in April 2022, the Company announced an increase to its Program authorization from $315 billion to $405 billion . rest of Asia Pacific includes Australia and those Asian countries not included in the Company’s other reportable segments . each one is managed separately to better align with the location of the Company's customers and distribution partners and the unique market dynamics of each geographic region . Americas Americas net sales increased during 2022 compared to 2021 due primarily to higher net sales of iPhone, Services and Mac . the effective tax rate for 2021 was lower than the statutory federal income tax rate due primarily to lower effective tax rates on foreign earnings . lower tax benefits from foreign-derived intangible income deductions and share-based compensation . as of September 24, 2022, the Company had $10.0 billion of Commercial Paper outstanding, all of which was payable within 12 months ."
sum = 0
print("Total token number of input texts:")
for i in input_text_chunks:
  print(len(i.split()))
  sum += len(i.split())
print("--- +")
print(sum)
Total token number of input texts:
510
482
484
--- +
1476
print("The summary_1 has {} tokens".format(len(summary_1.split())))
The summary_1 has 179 tokens

4.2 Second Approach to Summarize

In this section, we also apply the same t5-base large language model. But this time we will insert the NLP task in the prompt itself as prefix.

The model achieves this by adding a different prefix to the input corresponding to each task. For example, to use T5 for translation, one would input “translate English to German: …” whereas for summarization, one would input “summarize: …”.

!pip install torch transformers
import re
import torch
from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("T5-base")
model = AutoModelWithLMHead.from_pretrained("T5-base", return_dict=True)
text_chunks = input_text_chunks
while len(text_chunks) > 1:
  print("Total number of text chunks: {}".format(len(text_chunks)))
  new_text_chunks = []
  counter = 0
  for i in text_chunks:
    counter += 1
    print("Counter: {}/{}".format(counter, len(text_chunks)))

    inputs = tokenizer.encode("summarize: " + i, return_tensors="pt", max_length=512, truncation=True)
    output = model.generate(inputs, min_length=80, max_length=100)
    current_summary = tokenizer.decode(output[0])

    new_text_chunks.append(current_summary)
  text_chunks = join_50_token_or_less(new_text_chunks, 512)

# Remove special tokens like:
# 1. <pad> indicates start of sentence,
# 2. </s> indicates end of sentence,
# 3. <unk> indicates the token is not in training dataset, therefore it's unknown.
summary_2 = text_chunks[0].replace("<pad>", "").replace("</s>", "").replace("<unk>", "").strip()

# Remove multiple spaces:
summary_2 = re.sub(' +', ' ', summary_2)
Total number of text chunks: 3
Counter: 1/3
Counter: 2/3
Counter: 3/3
summary_2
'net sales of iPhone, Services and Mac increased 8% or $28.5 billion during 2022 compared to 2021. the company announces new product, service and software offerings at various times during the year. During 2022, the Company repurchased $90.2 billion of its common stock and paid dividends and dividend equivalents of $14.8 billion. a total of 88 million shares of common stock were redeemed in the fourth quarter. rest of Asia Pacific includes Australia and those Asian countries not included in the Company’s other reportable segments. although the reportable segments provide similar hardware and software products and similar services, each one is managed separately to better align with the location of the Company’s customers and distribution partners. Americas Americas net sales increased during 2022 compared to 2021 due primarily to higher net sales of iPhone, Services and Mac. Europe Europe net sales increased during 2022 the effective tax rate for 2021 was lower than the statutory federal income tax rate due primarily to a lower effective tax rate on foreign earnings. the company’s effective tax rate for 2022 was higher compared to 2021 due primarily to a lower effective tax rate on foreign earnings. the company’s balances of cash, cash equivalents and unrestricted marketable securities, which totaled $156.4 billion as of September 24, 2022'
sum = 0
print("Total token number of input texts:")
for i in input_text_chunks:
  print(len(i.split()))
  sum += len(i.split())
print("--- +")
print(sum)
Total token number of input texts:
510
482
484
--- +
1476
print("The summary_2 has {} tokens".format(len(summary_2.split())))
The summary_2 has 213 tokens

5 Summary

In this notebook we have performed:

  • Sentiment analysis using

    • T5-base model to detect sentiments: positive, neutral and negative

    • Distilbert model to detect sentiments: anger, disgust, fear, joy, sadness, surprise

  • Summarizing emotional sentences using two different approaches using T5-base model

    • It summarizes input text with total 1476 tokens and summarize it into text with 213 tokens.