DOKUMEN123.COM

How to Improve SynTex Quality

SynTex Generation produces Phazed text that mimics English (or any written human language) formally, yet is obviously artificial (Phazed) in many unforseen ways. SynTexGen brings to light many linguistic issues that might not have been addressed otherwise. Several major issues are outlined below.

Real World Knowledge (RWK)

SynTexGen only knows what it learned from its Training Text(TT), which discusses just a tiny sliver of total human RWK. And of course SynTexGen doesnt actually 'know' anything at all except for Extension Probabilities in the context of a given SynTex and given TT.
That's not enough. It's only Text Knowledge. There's no smell, touch, vision, hearing, etc. involved. Authentic human text flows in part from the author's RWK (we hope). So how can SynTexGen internalize RWK? Even the complete English Wikipedia only contains a tiny fraction of RWK. Further, Wikipedia contains vast numbers of discrete articles, each one focussed on a different topic. So article #7295 is unlikely to share knowledge commonality with article #9135 for RWK, altho they DO probably share much Language Knowledge.

There's a Topicality issue at play here. There are countless numbers of possible Topics a Text could discuss. And even tho Wikipedia is vast, any given Topic is only directly addressed in just a few articles that total very little text, not enuf to produce a topically coherent SynTex.
At present, SynTexGen has little idea how to TopicFocus, except to prefer Extensions that share recent vocabulary in the Context Window. If SynTexGen used Wikipedia as the Training Text(TT), then there would be equal Extension probability for each of 500,000 articles. And that produces notoriously Phazed SynTex.
What happens if you mix 100 various paint colors together? The result is a Phazed grey color. We want to mix several shades of green, for example, well-nuanced to yield a convincing topically-coherent SynTex with good TopicFocus.

Language Knowledge(LK)

            Henry smoked a beer.
            Yuki drank a cigarette.

Neither example is wrong, but they are both so implausible as to impact SynTex quality. People know what can complement 'to smoke' or 'to drink'. But how can SynTexGen mimic LK?

Perhaps anytime 'to smoke' is in the Context, Extensions should rely on LK Dictionary information to prefer certain continuations over others. The dictionary entry for 'to drink' would include statistical tabulation of what complements most frequently follow. Of course, WITHIN an 8gram (for example) Extension, this situation never arises since Candidates are Training Text (TT) Parasites. At some point in the TT an actual occurrence of the 8gram exists and SynTexGen merely re-purposes it in similar context, but new instance.
It's BETWEEN Extensions where Phazing occurs. A new Extension (En) is chosen based on very limited contextual criteria (L/R context tokens, QuoteStatus, length) so discrimination is poor and implausible Candidates are often Extended.
Perhaps there should be an Active Zone in the last Ntokens of the SynTex, 8 tokens for example. Then the Language Knowledge Dictionary should be consulted to select the optimal Extension.

Cast

It's important for SynTexGen to have an idea what the participants are for a given SynTex. For example, in one page of SynTex , Topical Cohesion(TC) is achieved in part by limiting Extensions to those involving the Participants. If the Topic is ice_cubes for example, then the Cast might include {ice_cubes, water, cold, refrigerators, lemonade, etc.} perhaps totalling 10-15 Participants. But not 100 Participants since Topicality would defocus.
For example a page ofSynTex might have Context Window (CW) only a few tokens long. Yet the CW influences Extension much more than early, more remote content in distant parts of the SynTex. So the codomain for Cast is the whole SynTex, including the CW. Cast is the set of items the SynTex discusses. An Extension will seem more appropriate if it observes the Cast as well as the CW.
Some languages (Bahasa Indonesia) use numerous Titles in conversation. Participating speakers have to know the social relationships between themselves to choose appropriate Titles. So much so that ignoring Cast produces extremely Phazed SynTex, borderline incorrect.

MultiToken Molecules (MTM)

MTMs include any Prolific Slices, commonly: Collocations, Named Entities, Phrasal Verbs and Anonyms. For example, we might find that the 5gram Slice The United States of America is Prolific (occurs frequently enough) in a TT to consider it an MTM (and also a Named Entity (NE)).
MTMs matter since Fracturing MTMs risks Phazing SynTex, where Fracturing means truncating an MTM so it doesnt Extend to completion since it wasnt correctly seen as a cohesive unit, but instead grows an Extraneous Extension. An Extraneous Extension to SynTex is any other Extension except the identified completions for the Prolific MTMs. MTMs are Prolific Ngram Slices. SynTexGen can avoid MTM Fracturing by excluding Fractured Wafers.

We identify MTMs by frequency analysis over a Training Text SliceDeck (TTSD). The TTSD is all the Slices in the TT of given lengths (so similar to the PlageDicts, except PlageDicts additionaly contain Fractured MTMs). The result is a long list of Prolific MTMs sorted by frequency of occurrence, including many that humans wouldnt find. For example, we find the frequency of every 4gram Slice, 5gram Slice....up to 10gram Slice for a certain TT. At Wafer Dictionary compilation time Fractured MTM Slices are excluded. Then they cannot become Candidate Extensions, thus reducing Phazing by reducing MTM Fracturing.
Discontiguous MTMs, typically Phrasal Verbs, are harder to identify for two reasons. First, albeit unlikely, they can be longer than the longest stipulated Ngram length. Secondly, there are intervening tokens between the two halves of a discontiguous MTM. Fortunately discontiguous MTMs are infrequent anyway, so we ignore them.

Some Extensions are less Phazed than others because they avoid Fractured MTMs

                     best:          well it was often said that the United States of America is the only
                     plausible:     well it was often said that the United States of Mexico is the only
                     phazed:        well it was often said that the United States of water
                     wrong:         well it was often said that the United States of are abundant

Some MTMs are pseudomorphs conjugation/number/person agreement, greatly complicating detection. For now we ignore this situation, crudely considering these MTMs to be unrelated:

                        the United States
                        The US
                        the United States of America
                        let's get on with it
                        we got on with it
                        we're gonna get on with it
                        she always wraps it up by sundown
                        she wrapped it up by sundown

Anonyms

Anonyms are MTMs that arent NEs or PVs, but still Prolific Slices. They often have Prolific constituents, accounting for the Proliferation that qualifies them as Anonyms. Since (for English) some Prolific Unigrams include: ('the','of','.',',') so some common Anonyms include:

             of the          lord of the flies
             . the           got back yesterday. the door was
             , the           out of the deck, without
             . of            died yesterday. of lung cancer

Fracturing Anonyms doesnt Phaze SynTex. We dont dont try to detect Anonyms. However SliceDeck Frequency Analysis discovers numerous Anonyms anyway, that amount to false positives. But there's no need to exclude them from MTMs, and it's extra effort to isolate them.

Named Entities (NE)

NE are MTMs that correspond to real-world objects like people, places, products, etc. that have specific names. Some NE examples:

          Donald Trump
          Union City
          Wet Wipes
          1425 Las Positas Blvd.

Similar to collocations, NE will need an NE dictionary and NEs shouldnt be split at Wafer Compile time. Often NEs involve Upper Case with all the complications Upper/Lower/Case handling entails. Information Extraction uses Named Entities. Temporal and numeric expressions, amounts of money and measurements may require similar treatment.
How can we decide if "Donald Trump" is a NE or just another pair of tokens in text? In fact there are many OTHER token pairs that co-occur more frequently than random, but probably shouldnt be considered Collocation or Named Entities.

       of the
       in the
       . The

Phrasal Verbs (PV)

In English PVs (also: prepositional verbs or particle verbs) can further complicate convincing SynTexGeneration. They're another instance of multiTokens, sequences that occur together more frequently than random, in this case since they are integral English verbs.

              They refused to give up until she decided to begin
              There's no way to get ahold of us after dark
              This song deserves listening to

Some PVs can have remote parts:

               He brought that same excuse up again
               She handed the first of the chapters in

Collocations

For SynTexGen, the term collocations is taken to mean MTMs that arent Phrasal Verbs or Anonyms.. We prefer the term Named Entity

Proliferation: Prolific Ngrams

Ngrams in display Zipfian Distribution in Text, following Zipf's Law. Very few Ngrams are Prolific since they appear very frequently. Then very many Ngrams are Hapax since they display only once. The Prolifics amount to MTMs. SynTexGen avoids Fracturing Prolifics since completing a Prolific is less likely to Phaze the Extension.

Coextension

Wafers are Coextensive when they have the same length and same Left/Right Edges. They are Candidates to Extend a Gap because they are Coextensive with the Gap. Coextensive Candidates are less likely to Phaze a SynTex since the Coextension increases the chance they are Congruent to the Training Text the Skeleton was derived from.

Congruent Slices

In a Text [a,b,c,d,e,f,g,h,i] there is a Slice [d,e,f,g]. Replacement Slices are Congruent to [d,e,f,g] to the extent the replacement is not Phazed so that human readers consider it plausible and natural. Some candidate Replacements might be: [d,c,a,g] [d,e,f,g,h,g], [d,p,c,r,g] So Replaced Texts would be: [a,b,c,d,c,a,g,h,i] [a,b,c,d,e,f,g,h,,g,h,i] [a,b,c,d,p,c,r,g,h,i] and human readers might consider some of them Phazed, but others not. The Replacement Slices that dont produce Phazed results are more Congruent.

Dynamic Extension (DE)

Starting with SynTex [a,b,c,d,e] we can extend it with [e,f,g] to become [a,b,c,d,e,f,g]. The [e,f,g] Candidate was previously discovered in a Training Text (TT) in a context where its continuation was [g,a,c,p,q]. So the entire discovered segment was [e,f,g,a,c,p,q]. SynTexGen only uses the front part of it to extend the SynTex to [a,b,c,d,e,f,g]. But in addition now the next Left Edge is g and the next Right Edge is q and a 5gram Slice with those Edges would be Coextensive with the orignal, and thus a good Extension Candidate, less likely to Phaze the SynTex than less Coextensive alternatives.

Our Wafer Dictionaries now have to keep track of Completions for each Candidate entry. And what length for the Completions? Static Extension stipulates a SKELETON of Gaps, where next Edges are determined in advance. The resulting SynTex is congruent to the original Training Text (TT) for this reason. Whereas with DE after each progressive Extension the next Coextension is determined dynamically, at run time. There's no pre-stipulated SKELETON.

The Gap is the Coextension SynTexGen intends to fill. Before Extending, SynTexGen only knows the Coextension of the Gap, so SynTexGen tries to find a Congruent Candidate to fill the Gap. This process works since Coextensive Slices are more likely, even quite likely, to resemble each other than otherwise.

Reference Resolution (RR)

(RR) is a challenge for SynTexGen. Consider these examples and notice how a later clause (Post) contains tokens that refer to other tokens in a prior clause.

Mary saw the wine and then she drank some                         wine      <--some
Her friends came over, but she didnt see them                     friends   <--them
Mary and John have shoes, but she doesnt wear hers indoors        Mary      <--she
                                                                  Mary,shoes<--hers

RR can be remote:

 World champion and Olympic gold medalist Simone Biles has  won over 19 medals,
          but one accolade was out of her reach, until now.      Simone Biles <-- her

Suppose our SynTex is currently:

   World champion and Olympic gold medalist Simone Biles has  won over 19 medals,

Then to Extend it, SynTexGen should choose a Candidate that doesnt Phaze RR. There could be many plausible Extensions, but SynTexGen doesnt try RR at Extend-time. Instead SynTexGen just avoids blatent MisResolutions (MR).

 World champion and Olympic gold medalist Simone Biles has  won over 19 medals,
          and natually so
 World champion and Olympic gold medalist Simone Biles has  won over 19 medals,
          and that is amazing
 World champion and Olympic gold medalist Simone Biles has  won over 19 medals,
          and we cant dance                     ?????
 World champion and Olympic gold medalist Simone Biles has  won over 19 medals,
          and you are amazing                   ????

But how to implement RR? Every Extension would need cast analysis:

  Simone Biles has  won over 19 medals
  Simone Biles    female, singular
  medals          no gender, plural
  has won         singular, past

It's not clear this approach is practical.
Let's try a different way:

   Extension: ''World champion and Olympic gold medalist Simone Biles has  won over 19 medals,
   8gram Post':        but one accolade was out of her reach'

The 8gram Post contains Trigger her, so we could prefer a Candidate Extension that also contains her, assuming that Candidate would be more likely to RR correctly. Some English Triggers include:

 I you we he she it they my your our his her its their 
 mine yours his hers its theirs ours

Or SynTexGen could choose the Extension which most closely resembles the Post, irrespective of particular resemblances. SynTexGen doesnt care if the resemblance is a Trigger or not, thus avoiding intractable Trigger processing. Excluding the verbatum Post itself, a Plage.

   Extension: ''World champion and Olympic gold medalist Simone Biles has  won over 19 medals,
   8gram Post':        but one accolade was out of her reach'